Re: [tip: perf/core] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2021-04-16 Thread Liang, Kan




On 4/16/2021 12:45 PM, Peter Zijlstra wrote:

On Fri, Apr 16, 2021 at 03:01:48PM -, tip-bot2 for Kan Liang wrote:

@@ -2331,6 +2367,9 @@ static void x86_pmu_event_unmapped(struct perf_event 
*event, struct mm_struct *m
if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return;
  
+	if (x86_pmu.sched_task && event->hw.target)

+   perf_sched_cb_dec(event->ctx->pmu);
+
if (atomic_dec_and_test(>context.perf_rdpmc_allowed))
on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
  }


'perf test' on a kernel with CONFIG_DEBUG_PREEMPT=y gives:

[  244.439538] BUG: using smp_processor_id() in preemptible [] code: 
perf/1771


If it's a preemptible env, I think we should disable the interrupts and 
preemption to protect the sched_cb_list.


Seems we don't need perf_ctx_lock() here. I don't think we touch the 
area in NMI. I think disabling the interrupts should be good enough to 
protect the cpuctx.


How about the below patch?

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e34eb72..45630beed 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2333,6 +2333,8 @@ static void x86_pmu_clear_dirty_counters(void)

 static void x86_pmu_event_mapped(struct perf_event *event, struct 
mm_struct *mm)

 {
+   unsigned long flags;
+
if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return;

@@ -2341,8 +2343,10 @@ static void x86_pmu_event_mapped(struct 
perf_event *event, struct mm_struct *mm)

 * and clear the existing dirty counters.
 */
if (x86_pmu.sched_task && event->hw.target) {
+   local_irq_save(flags);
perf_sched_cb_inc(event->ctx->pmu);
x86_pmu_clear_dirty_counters();
+   local_irq_restore(flags);
}

/*
@@ -2363,12 +2367,16 @@ static void x86_pmu_event_mapped(struct 
perf_event *event, struct mm_struct *mm)


 static void x86_pmu_event_unmapped(struct perf_event *event, struct 
mm_struct *mm)

 {
+   unsigned long flags;

if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return;

-   if (x86_pmu.sched_task && event->hw.target)
+   if (x86_pmu.sched_task && event->hw.target) {
+   local_irq_save(flags);
perf_sched_cb_dec(event->ctx->pmu);
+   local_irq_restore(flags);
+   }

if (atomic_dec_and_test(>context.perf_rdpmc_allowed))
on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);

Thanks,
Kan


Re: [PATCH] perf/x86/intel/uncore: Avoid null dereferences (uncore_extra_pci_dev)

2021-04-15 Thread Liang, Kan




On 4/15/2021 5:19 PM, Liang, Kan wrote:

Hi Steve,

On 4/15/2021 4:37 PM, Steve Wahl wrote:

If an uncore has no pci_init routine, or that routine fails,
uncore_pci_init is not called, and memory is not allocated for
uncore_extra_pci_dev.

So check to make sure uncore_extra_pci_dev is not NULL before use.



I think more after yesterday's discussion. There may be a better 
solution than this.


Actually, we don't have to probe all the PCU devices and stores them 
into the uncore_extra_pci_dev for the cpu_init().
We just need to pick up the first PCU device and check the existence of 
the SBOX once.


I will send out a patch shortly.


Here is the patch I mentioned.

https://lore.kernel.org/lkml/1618521764-100923-1-git-send-email-kan.li...@linux.intel.com

Thanks,
Kan




And fix the case that led us to discover the null derefs; don't fail
snbep_pci2phy_map_init if BIOS doesn't supply pcibus_to_node
information.

Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci 
bus die id from NUMA info")


In theory, the pci_read_config_dword() may fails as well. It has 
possible that the issue can still be observed before the "> 8 nodes" patch.


I think the fixes should be 5306c31c5733 ("perf/x86/uncore/hsw-ep: 
Handle systems with only two SBOXes")


Thanks,
Kan


Signed-off-by: Steve Wahl 
---
  arch/x86/events/intel/uncore_snbep.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/uncore_snbep.c 
b/arch/x86/events/intel/uncore_snbep.c

index b79951d0707c..14c24356a2fa 100644
--- a/arch/x86/events/intel/uncore_snbep.c
+++ b/arch/x86/events/intel/uncore_snbep.c
@@ -1373,11 +1373,11 @@ static int snbep_pci2phy_map_init(int devid, 
int nodeid_loc, int idmap_loc, bool

  /*
   * The nodeid and idmap registers only contain enough
   * information to handle 8 nodes.  On systems with more
- * than 8 nodes, we need to rely on NUMA information,
+ * than 8 nodes, if available we rely on NUMA information,
   * filled in from BIOS supplied information, to determine
   * the topology.
   */
-    if (nr_node_ids <= 8) {
+    if ((nr_node_ids <= 8) || (pcibus_to_node(ubox_dev->bus) == 
-1)) {

  /* get the Node ID of the local register */
  err = pci_read_config_dword(ubox_dev, nodeid_loc, );
  if (err)
@@ -2865,7 +2865,9 @@ void hswep_uncore_cpu_init(void)
  hswep_uncore_cbox.num_boxes = boot_cpu_data.x86_max_cores;
  /* Detect 6-8 core systems with only two SBOXes */
-    if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) {
+    if (!uncore_extra_pci_dev)
+    hswep_uncore_sbox.num_boxes = 2;
+    else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) {
  u32 capid4;
  
pci_read_config_dword(uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3],

@@ -3243,6 +3245,8 @@ void bdx_uncore_cpu_init(void)
  if (boot_cpu_data.x86_model == 86) {
  uncore_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL;
  /* Detect systems with no SBOXes */
+    } else if (!uncore_extra_pci_dev) {
+    bdx_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL;
  } else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) {
  struct pci_dev *pdev;
  u32 capid4;



Re: [PATCH] perf/x86/intel/uncore: Avoid null dereferences (uncore_extra_pci_dev)

2021-04-15 Thread Liang, Kan

Hi Steve,

On 4/15/2021 4:37 PM, Steve Wahl wrote:

If an uncore has no pci_init routine, or that routine fails,
uncore_pci_init is not called, and memory is not allocated for
uncore_extra_pci_dev.

So check to make sure uncore_extra_pci_dev is not NULL before use.



I think more after yesterday's discussion. There may be a better 
solution than this.


Actually, we don't have to probe all the PCU devices and stores them 
into the uncore_extra_pci_dev for the cpu_init().
We just need to pick up the first PCU device and check the existence of 
the SBOX once.


I will send out a patch shortly.


And fix the case that led us to discover the null derefs; don't fail
snbep_pci2phy_map_init if BIOS doesn't supply pcibus_to_node
information.

Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from 
NUMA info")


In theory, the pci_read_config_dword() may fails as well. It has 
possible that the issue can still be observed before the "> 8 nodes" patch.


I think the fixes should be 5306c31c5733 ("perf/x86/uncore/hsw-ep: 
Handle systems with only two SBOXes")


Thanks,
Kan


Signed-off-by: Steve Wahl 
---
  arch/x86/events/intel/uncore_snbep.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/uncore_snbep.c 
b/arch/x86/events/intel/uncore_snbep.c
index b79951d0707c..14c24356a2fa 100644
--- a/arch/x86/events/intel/uncore_snbep.c
+++ b/arch/x86/events/intel/uncore_snbep.c
@@ -1373,11 +1373,11 @@ static int snbep_pci2phy_map_init(int devid, int 
nodeid_loc, int idmap_loc, bool
/*
 * The nodeid and idmap registers only contain enough
 * information to handle 8 nodes.  On systems with more
-* than 8 nodes, we need to rely on NUMA information,
+* than 8 nodes, if available we rely on NUMA information,
 * filled in from BIOS supplied information, to determine
 * the topology.
 */
-   if (nr_node_ids <= 8) {
+   if ((nr_node_ids <= 8) || (pcibus_to_node(ubox_dev->bus) == 
-1)) {
/* get the Node ID of the local register */
err = pci_read_config_dword(ubox_dev, nodeid_loc, 
);
if (err)
@@ -2865,7 +2865,9 @@ void hswep_uncore_cpu_init(void)
hswep_uncore_cbox.num_boxes = boot_cpu_data.x86_max_cores;
  
  	/* Detect 6-8 core systems with only two SBOXes */

-   if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) {
+   if (!uncore_extra_pci_dev)
+   hswep_uncore_sbox.num_boxes = 2;
+   else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) {
u32 capid4;
  
  		pci_read_config_dword(uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3],

@@ -3243,6 +3245,8 @@ void bdx_uncore_cpu_init(void)
if (boot_cpu_data.x86_model == 86) {
uncore_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL;
/* Detect systems with no SBOXes */
+   } else if (!uncore_extra_pci_dev) {
+   bdx_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL;
} else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) {
struct pci_dev *pdev;
u32 capid4;



Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2021-04-14 Thread Liang, Kan




On 4/14/2021 9:51 AM, Namhyung Kim wrote:

Hi Kan,

On Wed, Apr 14, 2021 at 4:04 AM  wrote:

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dd9f3c2..0d4a1a3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1585,6 +1585,8 @@ static void x86_pmu_del(struct perf_event *event, int 
flags)
 if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
 goto do_del;

+   __set_bit(event->hw.idx, cpuc->dirty);
+
 /*
  * Not a TXN, therefore cleanup properly.
  */
@@ -2304,12 +2306,46 @@ static int x86_pmu_event_init(struct perf_event *event)
 return err;
  }

+void x86_pmu_clear_dirty_counters(void)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   int i;
+
+   if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
+   return;


Maybe you can check it after clearing assigned counters.



It should be very likely that the cpuc->dirty is non-empty.
Move it after the clearing can skip the for_each_set_bit() and 
bitmap_zero().

OK. I will change it in V4.

Thanks,
Kan




Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2021-04-14 Thread Liang, Kan




On 4/13/2021 4:33 PM, kernel test robot wrote:

Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on tip/master linux/master linus/master v5.12-rc7 
next-20210413]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/kan-liang-linux-intel-com/perf-x86-Move-cpuc-running-into-P4-specific-code/20210414-030649
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: i386-tinyconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
 # 
https://github.com/0day-ci/linux/commit/83f02393e1b5a2723294d8697f4fd5473d70602c
 git remote add linux-review https://github.com/0day-ci/linux
 git fetch --no-tags linux-review 
kan-liang-linux-intel-com/perf-x86-Move-cpuc-running-into-P4-specific-code/20210414-030649
 git checkout 83f02393e1b5a2723294d8697f4fd5473d70602c
 # save the attached .config to linux build tree
 make W=1 ARCH=i386

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):


arch/x86/events/core.c:2309:6: warning: no previous prototype for 
'x86_pmu_clear_dirty_counters' [-Wmissing-prototypes]

 2309 | void x86_pmu_clear_dirty_counters(void)
  |  ^~~~


vim +/x86_pmu_clear_dirty_counters +2309 arch/x86/events/core.c

   2308 

2309void x86_pmu_clear_dirty_counters(void)



Should be "static void x86_pmu_clear_dirty_counters(void)".

I will send V4 shortly to fix it.

Thanks,
Kan


   2310 {
   2311 struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
   2312 int i;
   2313 
   2314 if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
   2315 return;
   2316 
   2317  /* Don't need to clear the assigned counter. */
   2318 for (i = 0; i < cpuc->n_events; i++)
   2319 __clear_bit(cpuc->assign[i], cpuc->dirty);
   2320 
   2321 for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) {
   2322 /* Metrics and fake events don't have corresponding HW 
counters. */
   2323 if (is_metric_idx(i) || (i == INTEL_PMC_IDX_FIXED_VLBR))
   2324 continue;
   2325 else if (i >= INTEL_PMC_IDX_FIXED)
   2326 wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - 
INTEL_PMC_IDX_FIXED), 0);
   2327 else
   2328 wrmsrl(x86_pmu_event_addr(i), 0);
   2329 }
   2330 
   2331 bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX);
   2332 }
   2333 

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org



Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2021-04-13 Thread Liang, Kan




On 4/13/2021 8:34 PM, Andy Lutomirski wrote:

On Tue, Apr 13, 2021 at 12:05 PM  wrote:


From: Kan Liang 

The counter value of a perf task may leak to another RDPMC task.
For example, a perf stat task as below is running on CPU 0.

 perf stat -e 'branches,cycles' -- taskset -c 0 ./workload


I assume this doesn't fix the leak if the sensitive counter is systemwide?



Right.


Could Intel please add proper security and ideally virtualization for
this?  Ideally RDPMC permission would be a bitmask for all RDPMC-able
counters, not just a single on/off switch.



Yes, we are working on it.

For now, I think this patch is what we can do so far.

Thanks,
Kan



Re: [PATCH V5 16/25] perf/x86: Register hybrid PMUs

2021-04-09 Thread Liang, Kan




On 4/9/2021 11:45 AM, Peter Zijlstra wrote:

On Fri, Apr 09, 2021 at 09:50:20AM -0400, Liang, Kan wrote:



On 4/9/2021 2:58 AM, Peter Zijlstra wrote:

On Mon, Apr 05, 2021 at 08:10:58AM -0700, kan.li...@linux.intel.com wrote:

@@ -2089,9 +2119,46 @@ static int __init init_hw_perf_events(void)
if (err)
goto out1;
-   err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);
-   if (err)
-   goto out2;
+   if (!is_hybrid()) {
+   err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);
+   if (err)
+   goto out2;
+   } else {
+   u8 cpu_type = get_this_hybrid_cpu_type();
+   struct x86_hybrid_pmu *hybrid_pmu;
+   bool registered = false;
+   int i;
+
+   if (!cpu_type && x86_pmu.get_hybrid_cpu_type)
+   cpu_type = x86_pmu.get_hybrid_cpu_type();
+
+   for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) {
+   hybrid_pmu = _pmu.hybrid_pmu[i];
+
+   hybrid_pmu->pmu = pmu;
+   hybrid_pmu->pmu.type = -1;
+   hybrid_pmu->pmu.attr_update = x86_pmu.attr_update;
+   hybrid_pmu->pmu.capabilities |= 
PERF_PMU_CAP_HETEROGENEOUS_CPUS;
+
+   err = perf_pmu_register(_pmu->pmu, 
hybrid_pmu->name,
+   (hybrid_pmu->cpu_type == 
hybrid_big) ? PERF_TYPE_RAW : -1);
+   if (err)
+   continue;
+
+   if (cpu_type == hybrid_pmu->cpu_type)
+   x86_pmu_update_cpu_context(_pmu->pmu, 
raw_smp_processor_id());
+
+   registered = true;
+   }
+
+   if (!registered) {
+   pr_warn("Failed to register hybrid PMUs\n");
+   kfree(x86_pmu.hybrid_pmu);
+   x86_pmu.hybrid_pmu = NULL;
+   x86_pmu.num_hybrid_pmus = 0;
+   goto out2;
+   }


I don't think this is quite right. registered will be true even if one
fails, while I think you meant to only have it true when all (both)
types registered correctly.


No, I mean that perf error out only when all types fail to be registered.


All or nothing seems a better approach to me. There really isn't a good
reason for any one of them to fail.



Sure. I will change it in V6.

Thanks,
Kan


Re: [PATCH V5 21/25] perf: Introduce PERF_TYPE_HARDWARE_PMU and PERF_TYPE_HW_CACHE_PMU

2021-04-09 Thread Liang, Kan




On 4/9/2021 5:21 AM, Peter Zijlstra wrote:

On Mon, Apr 05, 2021 at 08:11:03AM -0700, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Current Hardware events and Hardware cache events have special perf
types, PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE. The two types don't
pass the PMU type in the user interface. For a hybrid system, the perf
subsystem doesn't know which PMU the events belong to. The first capable
PMU will always be assigned to the events. The events never get a chance
to run on the other capable PMUs.

Add a PMU aware version PERF_TYPE_HARDWARE_PMU and
PERF_TYPE_HW_CACHE_PMU. The PMU type ID is stored at attr.config[40:32].
Support the new types for X86.


Obviously ARM would need the same, but also, I don't think I see the
need to introduce new types. AFAICT there is nothing that stops this
scheme from working for the existing types.

Also, pmu type is 32bit, not 8bit.

So how about something like this?

---
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..074c7687d466 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -260,15 +260,16 @@ struct perf_event;
  /**
   * pmu::capabilities flags
   */
-#define PERF_PMU_CAP_NO_INTERRUPT  0x01
-#define PERF_PMU_CAP_NO_NMI0x02
-#define PERF_PMU_CAP_AUX_NO_SG 0x04
-#define PERF_PMU_CAP_EXTENDED_REGS 0x08
-#define PERF_PMU_CAP_EXCLUSIVE 0x10
-#define PERF_PMU_CAP_ITRACE0x20
-#define PERF_PMU_CAP_HETEROGENEOUS_CPUS0x40
-#define PERF_PMU_CAP_NO_EXCLUDE0x80
-#define PERF_PMU_CAP_AUX_OUTPUT0x100
+#define PERF_PMU_CAP_NO_INTERRUPT  0x0001
+#define PERF_PMU_CAP_NO_NMI0x0002
+#define PERF_PMU_CAP_AUX_NO_SG 0x0004
+#define PERF_PMU_CAP_EXTENDED_REGS 0x0008
+#define PERF_PMU_CAP_EXCLUSIVE 0x0010
+#define PERF_PMU_CAP_ITRACE0x0020
+#define PERF_PMU_CAP_HETEROGENEOUS_CPUS0x0040
+#define PERF_PMU_CAP_NO_EXCLUDE0x0080
+#define PERF_PMU_CAP_AUX_OUTPUT0x0100
+#define PERF_PMU_CAP_EXTENDED_HW_TYPE  0x0200
  
  struct perf_output_handle;
  
diff --git a/kernel/events/core.c b/kernel/events/core.c

index f07943183041..910a0666ebfe 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3,14 +3,21 @@ static struct pmu *perf_init_event(struct perf_event 
*event)
 * are often aliases for PERF_TYPE_RAW.
 */
type = event->attr.type;
-   if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
-   type = PERF_TYPE_RAW;
+   if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) {
+   type = event->attr.config >> 32;
+   if (!type)
+   type = PERF_TYPE_RAW;
+   }


For the old tool, the default PMU will be the big core. I think it's OK 
for X86.


Since only the low 32 bit of event->attr.config contains the 'real' 
config value, I think all the ARCHs will do event->attr.config &= 
0x. Maybe we should move it to the generic code.


+   if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) {
+   type = event->attr.config >> 32;
+   if (!type)
+   type = PERF_TYPE_RAW;
+   else
+   event->attr.config &= 0x;


>   again:
rcu_read_lock();
pmu = idr_find(_idr, type);
rcu_read_unlock();
if (pmu) {
+   if (event->attr.type != type && type != PERF_TYPE_RAW &&
+   !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
+   goto fail;
+
ret = perf_try_init_event(pmu, event);
if (ret == -ENOENT && event->attr.type != type) {
type = event->attr.type;



I don't think we want to go through all available PMUs again if users 
already specify a PMU.


I update the patch a little bit. (Not test yet. I will do some tests then.)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3158cbc..4f5f9a9 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2175,6 +2175,7 @@ static int __init init_hw_perf_events(void)
hybrid_pmu->pmu.type = -1;
hybrid_pmu->pmu.attr_update = x86_pmu.attr_update;
hybrid_pmu->pmu.capabilities |= 
PERF_PMU_CAP_HETEROGENEOUS_CPUS;
+   hybrid_pmu->pmu.capabilities |= 
PERF_PMU_CAP_EXTENDED_HW_TYPE;

err = perf_pmu_register(_pmu->pmu, 
hybrid_pmu->name,
(hybrid_pmu->cpu_type == 
hybrid_big) ? PERF_TYPE_RAW : -1);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b832e09..391bfb7 100644
--- 

Re: [PATCH V5 23/25] perf/x86/msr: Add Alder Lake CPU support

2021-04-09 Thread Liang, Kan




On 4/9/2021 5:24 AM, Peter Zijlstra wrote:

On Mon, Apr 05, 2021 at 08:11:05AM -0700, kan.li...@linux.intel.com wrote:

From: Kan Liang 

PPERF and SMI_COUNT MSRs are also supported on Alder Lake.

The External Design Specification (EDS) is not published yet. It comes
from an authoritative internal source.

The patch has been tested on real hardware.

Reviewed-by: Andi Kleen 
Signed-off-by: Kan Liang 
---
  arch/x86/events/msr.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index 680404c..c853b28 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -100,6 +100,8 @@ static bool test_intel(int idx, void *data)
case INTEL_FAM6_TIGERLAKE_L:
case INTEL_FAM6_TIGERLAKE:
case INTEL_FAM6_ROCKETLAKE:
+   case INTEL_FAM6_ALDERLAKE:
+   case INTEL_FAM6_ALDERLAKE_L:
if (idx == PERF_MSR_SMI || idx == PERF_MSR_PPERF)
return true;
break;


If only it would be sanely enumerated... What about sapphire rapids?



SPR also supports the MSRs. I will submit a patch separately to support SPR.

Thanks,
Kan


Re: [PATCH V5 16/25] perf/x86: Register hybrid PMUs

2021-04-09 Thread Liang, Kan




On 4/9/2021 2:58 AM, Peter Zijlstra wrote:

On Mon, Apr 05, 2021 at 08:10:58AM -0700, kan.li...@linux.intel.com wrote:

@@ -2089,9 +2119,46 @@ static int __init init_hw_perf_events(void)
if (err)
goto out1;
  
-	err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);

-   if (err)
-   goto out2;
+   if (!is_hybrid()) {
+   err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);
+   if (err)
+   goto out2;
+   } else {
+   u8 cpu_type = get_this_hybrid_cpu_type();
+   struct x86_hybrid_pmu *hybrid_pmu;
+   bool registered = false;
+   int i;
+
+   if (!cpu_type && x86_pmu.get_hybrid_cpu_type)
+   cpu_type = x86_pmu.get_hybrid_cpu_type();
+
+   for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) {
+   hybrid_pmu = _pmu.hybrid_pmu[i];
+
+   hybrid_pmu->pmu = pmu;
+   hybrid_pmu->pmu.type = -1;
+   hybrid_pmu->pmu.attr_update = x86_pmu.attr_update;
+   hybrid_pmu->pmu.capabilities |= 
PERF_PMU_CAP_HETEROGENEOUS_CPUS;
+
+   err = perf_pmu_register(_pmu->pmu, 
hybrid_pmu->name,
+   (hybrid_pmu->cpu_type == 
hybrid_big) ? PERF_TYPE_RAW : -1);
+   if (err)
+   continue;
+
+   if (cpu_type == hybrid_pmu->cpu_type)
+   x86_pmu_update_cpu_context(_pmu->pmu, 
raw_smp_processor_id());
+
+   registered = true;
+   }
+
+   if (!registered) {
+   pr_warn("Failed to register hybrid PMUs\n");
+   kfree(x86_pmu.hybrid_pmu);
+   x86_pmu.hybrid_pmu = NULL;
+   x86_pmu.num_hybrid_pmus = 0;
+   goto out2;
+   }


I don't think this is quite right. registered will be true even if one
fails, while I think you meant to only have it true when all (both)
types registered correctly.


No, I mean that perf error out only when all types fail to be registered.

For the case (1 failure, 1 success), users can still access the 
registered PMU.
When a CPU belongs to the unregistered PMU online, a warning will be 
displayed. Because in init_hybrid_pmu(), we will check the PMU type 
before update the CPU mask.


if (WARN_ON_ONCE(!pmu || (pmu->pmu.type == -1))) {
cpuc->pmu = NULL;
return false;
}


Thanks,
Kan


Re: [PATCH V5 04/25] perf/x86/intel: Hybrid PMU support for perf capabilities

2021-04-08 Thread Liang, Kan




On 4/8/2021 9:40 AM, Peter Zijlstra wrote:

@@ -4330,7 +4347,7 @@ static int intel_pmu_check_period(struct perf_event 
*event, u64 value)
  
  static int intel_pmu_aux_output_match(struct perf_event *event)

  {
-   if (!x86_pmu.intel_cap.pebs_output_pt_available)
+   if (!intel_pmu_has_cap(event, PERF_CAP_PT_IDX))
return 0;
  
  	return is_intel_pt_event(event);

these sites can have !event->pmu ?



I don't think the event->pmu can be NULL, but it could be pt_pmu.pmu.
If so, it should be a problem.

I think I will still use the x86_pmu.intel_cap.pebs_output_pt_available 
here in V6. The worst case is that we lost the PEBS via PT support on 
the small core for now.


I guess Alexander may provide a separate patch later to enable the PEBS 
via PT support on the ADL small core.


Thanks,
Kan


Re: [PATCH V5 04/25] perf/x86/intel: Hybrid PMU support for perf capabilities

2021-04-08 Thread Liang, Kan




On 4/8/2021 1:00 PM, Peter Zijlstra wrote:

On Mon, Apr 05, 2021 at 08:10:46AM -0700, kan.li...@linux.intel.com wrote:

+#define is_hybrid()(!!x86_pmu.num_hybrid_pmus)


Given this is sprinkled all over the place, can you make this a
static_key_false + static_branch_unlikely() such that the hybrid case is
out-of-line?



Sure, I will add a new static_key_false "perf_is_hybrid" to indicate a 
hybrid system as below (not test yet).


diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f8d1222..bd6412e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -54,6 +54,7 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {

 DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
 DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
+DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);

 /*
  * This here uses DEFINE_STATIC_CALL_NULL() to get a static_call defined
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2b553d9..7cef3cd 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6119,6 +6119,7 @@ __init int intel_pmu_init(void)
 GFP_KERNEL);
if (!x86_pmu.hybrid_pmu)
return -ENOMEM;
+   static_branch_enable(_is_hybrid);
x86_pmu.num_hybrid_pmus = X86_HYBRID_NUM_PMUS;

x86_pmu.late_ack = true;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index bfbecde..d6383d1 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -663,8 +663,8 @@ static __always_inline struct x86_hybrid_pmu 
*hybrid_pmu(struct pmu *pmu)

return container_of(pmu, struct x86_hybrid_pmu, pmu);
 }

-/* The number of hybrid PMUs implies whether it's a hybrid system */
-#define is_hybrid()(!!x86_pmu.num_hybrid_pmus)
+extern struct static_key_false perf_is_hybrid;
+#define is_hybrid()static_branch_unlikely(_is_hybrid)

 #define hybrid(_pmu, _field)   \
 ({ \

Thanks,
Kan


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Liang, Kan




On 3/23/2021 5:41 PM, Peter Zijlstra wrote:

On Mon, Mar 22, 2021 at 02:06:33PM +0800, Like Xu wrote:

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18df17129695..a4ce669cc78d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return x86_pmu_extra_regs(val, event);
  }
  
-int x86_reserve_hardware(void)

+int x86_reserve_hardware(struct perf_event *event)
  {
int err = 0;
  
@@ -382,8 +382,10 @@ int x86_reserve_hardware(void)

if (atomic_read(_refcount) == 0) {
if (!reserve_pmc_hardware())
err = -EBUSY;
-   else
+   else {
reserve_ds_buffers();
+   reserve_lbr_buffers(event);
+   }
}
if (!err)
atomic_inc(_refcount);


This makes absolutely no sense what so ever. This is only executed for
the first event that gets here.



The LBR xsave buffer is a per-CPU buffer, not a per-event buffer. I 
think we only need to allocate the buffer once when initializing the 
first event.


Thanks,
Kan


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Liang, Kan




On 3/22/2021 2:06 AM, Like Xu wrote:

If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional
might_sleep_if() deep in kmem_cache_alloc() will generate the following
trace, and potentially cause a deadlock when another LBR event is added:

[  243.115549] BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:196
[  243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, 
name: perf
[  243.119326] INFO: lockdep is turned off.
[  243.120249] irq event stamp: 0
[  243.120967] hardirqs last  enabled at (0): [<>] 0x0
[  243.122415] hardirqs last disabled at (0): [] 
copy_process+0xa45/0x1dc0
[  243.124302] softirqs last  enabled at (0): [] 
copy_process+0xa45/0x1dc0
[  243.126255] softirqs last disabled at (0): [<>] 0x0
[  243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8
[  243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.0.0 02/06/2015
[  243.131520] Call Trace:
[  243.132112]  dump_stack+0x8d/0xb5
[  243.132896]  ___might_sleep.cold.106+0xb3/0xc3
[  243.133984]  slab_pre_alloc_hook.constprop.85+0x96/0xd0
[  243.135208]  ? intel_pmu_lbr_add+0x152/0x170
[  243.136207]  kmem_cache_alloc+0x36/0x250
[  243.137126]  intel_pmu_lbr_add+0x152/0x170
[  243.138088]  x86_pmu_add+0x83/0xd0
[  243.138889]  ? lock_acquire+0x158/0x350
[  243.139791]  ? lock_acquire+0x158/0x350
[  243.140694]  ? lock_acquire+0x158/0x350
[  243.141625]  ? lock_acquired+0x1e3/0x360
[  243.142544]  ? lock_release+0x1bf/0x340
[  243.143726]  ? trace_hardirqs_on+0x1a/0xd0
[  243.144823]  ? lock_acquired+0x1e3/0x360
[  243.145742]  ? lock_release+0x1bf/0x340
[  243.147107]  ? __slab_free+0x49/0x540
[  243.147966]  ? trace_hardirqs_on+0x1a/0xd0
[  243.148924]  event_sched_in.isra.129+0xf8/0x2a0
[  243.149989]  merge_sched_in+0x261/0x3e0
[  243.150889]  ? trace_hardirqs_on+0x1a/0xd0
[  243.151869]  visit_groups_merge.constprop.135+0x130/0x4a0
[  243.153122]  ? sched_clock_cpu+0xc/0xb0
[  243.154023]  ctx_sched_in+0x101/0x210
[  243.154884]  ctx_resched+0x6f/0xc0
[  243.155686]  perf_event_exec+0x21e/0x2e0
[  243.156641]  begin_new_exec+0x5e5/0xbd0
[  243.157540]  load_elf_binary+0x6af/0x1770
[  243.158478]  ? __kernel_read+0x19d/0x2b0
[  243.159977]  ? lock_acquire+0x158/0x350
[  243.160876]  ? __kernel_read+0x19d/0x2b0
[  243.161796]  bprm_execve+0x3c8/0x840
[  243.162638]  do_execveat_common.isra.38+0x1a5/0x1c0
[  243.163776]  __x64_sys_execve+0x32/0x40
[  243.164676]  do_syscall_64+0x33/0x40
[  243.165514]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  243.166746] RIP: 0033:0x7f6180a26feb
[  243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1.
[  243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 
003b
[  243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb
[  243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30
[  243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: 
[  243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410
[  243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410

One of the solution is to use GFP_ATOMIC, but it will make the code less
reliable under memory pressue. Let's move the memory allocation out of
the sleeping region and put it into the x86_reserve_hardware().

The disadvantage of this fix is that the cpuc->lbr_xsave memory
will be allocated for each cpu like the legacy ds_buffer.

Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read")
Suggested-by: Kan Liang 
Signed-off-by: Like Xu 


I observed the same issue when I did LBR test on an ADL machine. This 
patch fixes the issue.


Tested-by: Kan Liang 

Thanks,
Kan

---
  arch/x86/events/core.c   |  8 +---
  arch/x86/events/intel/bts.c  |  2 +-
  arch/x86/events/intel/lbr.c  | 22 --
  arch/x86/events/perf_event.h |  8 +++-
  4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18df17129695..a4ce669cc78d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return x86_pmu_extra_regs(val, event);
  }
  
-int x86_reserve_hardware(void)

+int x86_reserve_hardware(struct perf_event *event)
  {
int err = 0;
  
@@ -382,8 +382,10 @@ int x86_reserve_hardware(void)

if (atomic_read(_refcount) == 0) {
if (!reserve_pmc_hardware())
err = -EBUSY;
-   else
+   else {
reserve_ds_buffers();
+   reserve_lbr_buffers(event);
+   }
}
if (!err)
atomic_inc(_refcount);
@@ -634,7 +636,7 @@ static int 

Re: [PATCH V2 1/5] perf/x86/intel/uncore: Parse uncore discovery tables

2021-03-19 Thread Liang, Kan




On 3/18/2021 9:10 PM, Namhyung Kim wrote:

Hi Kan,

On Thu, Mar 18, 2021 at 3:05 AM  wrote:


From: Kan Liang 

A self-describing mechanism for the uncore PerfMon hardware has been
introduced with the latest Intel platforms. By reading through an MMIO
page worth of information, perf can 'discover' all the standard uncore
PerfMon registers in a machine.

The discovery mechanism relies on BIOS's support. With a proper BIOS,
a PCI device with the unique capability ID 0x23 can be found on each
die. Perf can retrieve the information of all available uncore PerfMons
from the device via MMIO. The information is composed of one global
discovery table and several unit discovery tables.
- The global discovery table includes global uncore information of the
   die, e.g., the address of the global control register, the offset of
   the global status register, the number of uncore units, the offset of
   unit discovery tables, etc.
- The unit discovery table includes generic uncore unit information,
   e.g., the access type, the counter width, the address of counters,
   the address of the counter control, the unit ID, the unit type, etc.
   The unit is also called "box" in the code.
Perf can provide basic uncore support based on this information
with the following patches.

To locate the PCI device with the discovery tables, check the generic
PCI ID first. If it doesn't match, go through the entire PCI device tree
and locate the device with the unique capability ID.

The uncore information is similar among dies. To save parsing time and
space, only completely parse and store the discovery tables on the first
die and the first box of each die. The parsed information is stored in
an
RB tree structure, intel_uncore_discovery_type. The size of the stored
discovery tables varies among platforms. It's around 4KB for a Sapphire
Rapids server.

If a BIOS doesn't support the 'discovery' mechanism, the uncore driver
will exit with -ENODEV. There is nothing changed.

Add a module parameter to disable the discovery feature. If a BIOS gets
the discovery tables wrong, users can have an option to disable the
feature. For the current patchset, the uncore driver will exit with
-ENODEV. In the future, it may fall back to the hardcode uncore driver
on a known platform.

Signed-off-by: Kan Liang 
---
  arch/x86/events/intel/Makefile   |   2 +-
  arch/x86/events/intel/uncore.c   |  31 ++-
  arch/x86/events/intel/uncore_discovery.c | 318 +++
  arch/x86/events/intel/uncore_discovery.h | 105 ++
  4 files changed, 448 insertions(+), 8 deletions(-)
  create mode 100644 arch/x86/events/intel/uncore_discovery.c
  create mode 100644 arch/x86/events/intel/uncore_discovery.h

diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index e67a588..10bde6c 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -3,6 +3,6 @@ obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o
  obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o
  obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o
  obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o
-intel-uncore-objs  := uncore.o uncore_nhmex.o uncore_snb.o 
uncore_snbep.o
+intel-uncore-objs  := uncore.o uncore_nhmex.o uncore_snb.o 
uncore_snbep.o uncore_discovery.o
  obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
  intel-cstate-objs  := cstate.o
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index 33c8180..d111370 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -4,7 +4,12 @@
  #include 
  #include 
  #include "uncore.h"
+#include "uncore_discovery.h"

+static bool uncore_no_discover;
+module_param(uncore_no_discover, bool, 0);


Wouldn't it be better to use a positive form like 'uncore_discover = true'?
To disable, the module param can be set to 'uncore_discover = false'.



I'd like the feature is enabled by default. The default value of a 
static is 0. So I use the current name. It's just a personal preference.



+MODULE_PARM_DESC(uncore_no_discover, "Don't enable the Intel uncore PerfMon 
discovery mechanism "
+"(default: enable the discovery 
mechanism).");
  static struct intel_uncore_type *empty_uncore[] = { NULL, };
  struct intel_uncore_type **uncore_msr_uncores = empty_uncore;
  struct intel_uncore_type **uncore_pci_uncores = empty_uncore;


[SNIP]

+enum uncore_access_type {
+   UNCORE_ACCESS_MSR   = 0,
+   UNCORE_ACCESS_MMIO,
+   UNCORE_ACCESS_PCI,
+
+   UNCORE_ACCESS_MAX,
+};
+
+struct uncore_global_discovery {
+   union {
+   u64 table1;
+   struct {
+   u64 type : 8,
+   stride : 8,
+   max_units : 10,
+   __reserved_1 : 36,
+ 

Re: [PATCH v2 11/27] perf parse-events: Support hardware events inside PMU

2021-03-18 Thread Liang, Kan




On 3/18/2021 9:21 AM, Arnaldo Carvalho de Melo wrote:

Em Thu, Mar 18, 2021 at 01:16:37PM +0100, Jiri Olsa escreveu:

On Wed, Mar 17, 2021 at 10:42:45AM -0300, Arnaldo Carvalho de Melo wrote:

Em Wed, Mar 17, 2021 at 08:17:52PM +0800, Jin, Yao escreveu:

I'm OK to only support 'cpu_core/cpu-cycles/' or 'cpu_atom/cpu-cycles/'. But
what would we do for cache event?



'perf stat -e LLC-loads' is OK, but 'perf stat -e cpu/LLC-loads/' is not 
supported currently.



For hybrid platform, user may only want to enable the LLC-loads on core CPUs
or on atom CPUs. That's reasonable. While if we don't support the pmu style
event, how to satisfy this requirement?



If we can support the pmu style event, we can also use the same way for
cpu_core/cycles/. At least it's not a bad thing, right? :)



While we're discussing, do we really want to use the "core" and "atom"
terms here? I thought cpu/cycles/ would be ok for the main (Big) CPU and
that we should come up with some short name for the "litle" CPUs.



Won't we have the same situation with ARM where we want to know the
number of cycles spent on a BIG core and also on a little one?



Perhaps 'cycles' should mean all cycles, and then we use 'big/cycles/' and
'little/cycles/'?



do arm servers already export multiple pmus like this?
I did not notice


I haven't checked, but AFAIK this BIG/Little kind of arch started there,
Mark?



Here is the cover letter of the ARM big.little patch set. ARM also 
exports multiple PMUs, e.g., armv7_cortex_a15 and armv7_cortex_a7.

https://lore.kernel.org/lkml/1431533549-27715-1-git-send-email-mark.rutl...@arm.com/

We follow a similar way to handle the Intel hybrid PMUs. The naming rule 
is also similar, "cpu_" + CPU type.


We don't use the old name "cpu" for the main CPU type, because we want 
to make sure every software updated for the hybrid architecture. 
Otherwise, the old script with "cpu//" can still run on a hybrid 
architecture. Users cannot notice that the monitored scope is already 
implicitly changed. The results may be not what they want.


Thanks,
Kan



- Arnaldo
  

it'd be definitely great to have some unite way for this,
so far we have the hybrid pmu detection and support in
hw events like cycles/instructions.. which should be easy
to follow on arm

there's also support to have these events on specific pmu
pmu/cycles/ , which I still need to check on


Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

2021-03-16 Thread Liang, Kan




On 3/16/2021 2:34 PM, Stephane Eranian wrote:

On Tue, Mar 16, 2021 at 5:28 AM Liang, Kan  wrote:




On 3/16/2021 3:22 AM, Namhyung Kim wrote:

Hi Peter and Kan,

On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra  wrote:


On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote:

On 3/3/2021 1:59 PM, Peter Zijlstra wrote:

On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote:



+++ b/arch/x86/events/intel/ds.c
@@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
*iregs, struct perf_sample_d
continue;
}
- /*
-  * On some CPUs the PEBS status can be zero when PEBS is
-  * racing with clearing of GLOBAL_STATUS.
-  *
-  * Normally we would drop that record, but in the
-  * case when there is only a single active PEBS event
-  * we can assume it's for that event.
-  */
- if (!pebs_status && cpuc->pebs_enabled &&
- !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
- pebs_status = cpuc->pebs_enabled;


Wouldn't something like:

  pebs_status = p->status = cpus->pebs_enabled;



I didn't consider it as a potential solution in this patch because I don't
think it's a proper way that SW modifies the buffer, which is supposed to be
manipulated by the HW.


Right, but then HW was supposed to write sane values and it doesn't do
that either ;-)


It's just a personal preference. I don't see any issue here. We may try it.


So I mostly agree with you, but I think it's a shame to unsupport such
chips, HSW is still a plenty useable chip today.


I got a similar issue on ivybridge machines which caused kernel crash.
My case it's related to the branch stack with PEBS events but I think
it's the same issue.  And I can confirm that the above approach of
updating p->status fixed the problem.

I've talked to Stephane about this, and he wants to make it more
robust when we see stale (or invalid) PEBS records.  I'll send the
patch soon.



Hi Namhyung,

In case you didn't see it, I've already submitted a patch to fix the
issue last Friday.
https://lore.kernel.org/lkml/161298-140216-1-git-send-email-kan.li...@linux.intel.com/
But if you have a more robust proposal, please feel free to submit it.


This fixes the problem on the older systems. The other problem we
identified related to the
PEBS sample processing code is that you can end up with uninitialized
perf_sample_data
struct passed to perf_event_overflow():

  setup_pebs_fixed_sample_data(pebs, data)
{
 if (!pebs)
 return;
 perf_sample_data_init(data);  <<< must be moved before the if (!pebs)
 ...
}

__intel_pmu_pebs_event(pebs, data)
{
 setup_sample(pebs, data)
 perf_event_overflow(data);
 ...
}

If there is any other reason to get a pebs = NULL in fix_sample_data()
or adaptive_sample_data(), then
you must call perf_sample_data_init(data) BEFORE you return otherwise
you end up in perf_event_overflow()
with uninitialized data and you may die as follows:

[] ? perf_output_copy+0x4d/0xb0
[] perf_output_sample+0x561/0xab0
[] ? __perf_event_header__init_id+0x112/0x130
[] ? perf_prepare_sample+0x1b1/0x730
[] perf_event_output_forward+0x59/0x80
[] ? perf_event_update_userpage+0xf4/0x110
[] perf_event_overflow+0x88/0xe0
[] __intel_pmu_pebs_event+0x328/0x380

This all stems from get_next_pebs_record_by_bit()  potentially
returning NULL but the NULL is not handled correctly
by the callers. This is what I'd like to see cleaned up in
__intel_pmu_pebs_event() to  avoid future problems.



Got it. Yes, we need another patch to correctly handle the potentially
returning NULL. Thanks for pointing it out.

Thanks,
Kan


Re: [PATCH 1/5] perf/x86/intel/uncore: Parse uncore discovery tables

2021-03-16 Thread Liang, Kan




On 3/16/2021 10:05 AM, Peter Zijlstra wrote:

On Tue, Mar 16, 2021 at 08:42:25AM -0400, Liang, Kan wrote:



On 3/16/2021 7:43 AM, Peter Zijlstra wrote:

On Fri, Mar 12, 2021 at 08:34:34AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

A self-describing mechanism for the uncore PerfMon hardware has been
introduced with the latest Intel platforms. By reading through an MMIO
page worth of information, perf can 'discover' all the standard uncore
PerfMon registers in a machine.

The discovery mechanism relies on BIOS's support. With a proper BIOS,
a PCI device with the unique capability ID 0x23 can be found on each
die. Perf can retrieve the information of all available uncore PerfMons
from the device via MMIO. The information is composed of one global
discovery table and several unit discovery tables.



If a BIOS doesn't support the 'discovery' mechanism, there is nothing
changed.


What if the BIOS got it wrong? Will the driver still get it correct if
it is a known platform?


Yes, I will submit a platform specific patch to fix this case.



Do we need a chicken flag to kill the discovery? uncore_no_discover?



Yes, I plan to introduce a .use_discovery_tables flag to indicate whether to
use the discovery tables for the known platform.

The below codes is part of the upcoming SPR uncore patches.
The first SPR uncore patch will still rely on the BIOS discovery tables,
because some uncore block information hasn't been published yet. We have to
retrieve the information fro the tables. Once all the information is
published, we can kill the discovery by removing the ".use_discovery_tables
= true".


I was thinking of a module parameter, such that we can tell it to skip
discovery on module load time etc.



Sure, I will add a module parameter, uncore_no_discover.
If users don't want the discovery feature, they can set 
uncore_no_discover=true.


Thanks,
Kan


Re: [PATCH 1/5] perf/x86/intel/uncore: Parse uncore discovery tables

2021-03-16 Thread Liang, Kan




On 3/16/2021 7:43 AM, Peter Zijlstra wrote:

On Fri, Mar 12, 2021 at 08:34:34AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

A self-describing mechanism for the uncore PerfMon hardware has been
introduced with the latest Intel platforms. By reading through an MMIO
page worth of information, perf can 'discover' all the standard uncore
PerfMon registers in a machine.

The discovery mechanism relies on BIOS's support. With a proper BIOS,
a PCI device with the unique capability ID 0x23 can be found on each
die. Perf can retrieve the information of all available uncore PerfMons
from the device via MMIO. The information is composed of one global
discovery table and several unit discovery tables.



If a BIOS doesn't support the 'discovery' mechanism, there is nothing
changed.


What if the BIOS got it wrong? Will the driver still get it correct if
it is a known platform?


Yes, I will submit a platform specific patch to fix this case.



Do we need a chicken flag to kill the discovery? uncore_no_discover?



Yes, I plan to introduce a .use_discovery_tables flag to indicate 
whether to use the discovery tables for the known platform.


The below codes is part of the upcoming SPR uncore patches.
The first SPR uncore patch will still rely on the BIOS discovery tables, 
because some uncore block information hasn't been published yet. We have 
to retrieve the information fro the tables. Once all the information is 
published, we can kill the discovery by removing the 
".use_discovery_tables = true".


+static const struct intel_uncore_init_fun spr_uncore_init __initconst = {
+   .cpu_init = spr_uncore_cpu_init,
+   .pci_init = spr_uncore_pci_init,
+   .mmio_init = spr_uncore_mmio_init,
+   .use_discovery_tables = true,
+};

Thanks,
Kan


Re: [PATCH 1/5] perf/x86/intel/uncore: Parse uncore discovery tables

2021-03-16 Thread Liang, Kan




On 3/16/2021 7:40 AM, Peter Zijlstra wrote:

On Fri, Mar 12, 2021 at 08:34:34AM -0800, kan.li...@linux.intel.com wrote:

+static struct intel_uncore_discovery_type *
+search_uncore_discovery_type(u16 type_id)
+{
+   struct rb_node *node = discovery_tables.rb_node;
+   struct intel_uncore_discovery_type *type;
+
+   while (node) {
+   type = rb_entry(node, struct intel_uncore_discovery_type, node);
+
+   if (type->type > type_id)
+   node = node->rb_left;
+   else if (type->type < type_id)
+   node = node->rb_right;
+   else
+   return type;
+   }
+
+   return NULL;
+}
+
+static struct intel_uncore_discovery_type *
+add_uncore_discovery_type(struct uncore_unit_discovery *unit)
+{
+   struct intel_uncore_discovery_type *type, *cur;
+   struct rb_node **node = _tables.rb_node;
+   struct rb_node *parent = *node;
+
+   if (unit->access_type >= UNCORE_ACCESS_MAX) {
+   pr_warn("Unsupported access type %d\n", unit->access_type);
+   return NULL;
+   }
+
+   type = kzalloc(sizeof(struct intel_uncore_discovery_type), GFP_KERNEL);
+   if (!type)
+   return NULL;
+
+   type->box_ctrl_die = kcalloc(__uncore_max_dies, sizeof(u64), 
GFP_KERNEL);
+   if (!type->box_ctrl_die)
+   goto free_type;
+
+   type->access_type = unit->access_type;
+   num_discovered_types[type->access_type]++;
+   type->type = unit->box_type;
+
+   while (*node) {
+   parent = *node;
+   cur = rb_entry(parent, struct intel_uncore_discovery_type, 
node);
+
+   if (cur->type > type->type)
+   node = >rb_left;
+   else
+   node = >rb_right;
+   }
+
+   rb_link_node(>node, parent, node);
+   rb_insert_color(>node, _tables);
+
+   return type;
+
+free_type:
+   kfree(type);
+
+   return NULL;
+
+}


I'm thinking this can use some of this:

   2d24dd5798d0 ("rbtree: Add generic add and find helpers")



Sure, I will use the generic rbtree framework in V2.

Thanks,
Kan


Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

2021-03-16 Thread Liang, Kan




On 3/16/2021 3:22 AM, Namhyung Kim wrote:

Hi Peter and Kan,

On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra  wrote:


On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote:

On 3/3/2021 1:59 PM, Peter Zijlstra wrote:

On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote:



+++ b/arch/x86/events/intel/ds.c
@@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
*iregs, struct perf_sample_d
   continue;
   }
- /*
-  * On some CPUs the PEBS status can be zero when PEBS is
-  * racing with clearing of GLOBAL_STATUS.
-  *
-  * Normally we would drop that record, but in the
-  * case when there is only a single active PEBS event
-  * we can assume it's for that event.
-  */
- if (!pebs_status && cpuc->pebs_enabled &&
- !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
- pebs_status = cpuc->pebs_enabled;


Wouldn't something like:

 pebs_status = p->status = cpus->pebs_enabled;



I didn't consider it as a potential solution in this patch because I don't
think it's a proper way that SW modifies the buffer, which is supposed to be
manipulated by the HW.


Right, but then HW was supposed to write sane values and it doesn't do
that either ;-)


It's just a personal preference. I don't see any issue here. We may try it.


So I mostly agree with you, but I think it's a shame to unsupport such
chips, HSW is still a plenty useable chip today.


I got a similar issue on ivybridge machines which caused kernel crash.
My case it's related to the branch stack with PEBS events but I think
it's the same issue.  And I can confirm that the above approach of
updating p->status fixed the problem.

I've talked to Stephane about this, and he wants to make it more
robust when we see stale (or invalid) PEBS records.  I'll send the
patch soon.



Hi Namhyung,

In case you didn't see it, I've already submitted a patch to fix the 
issue last Friday.

https://lore.kernel.org/lkml/161298-140216-1-git-send-email-kan.li...@linux.intel.com/
But if you have a more robust proposal, please feel free to submit it.

BTW: The patch set from last Friday also fixed another bug found by the 
perf_fuzzer test. You may be interested.


Thanks,
Kan



Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Liang, Kan




On 3/11/2021 2:58 PM, Peter Zijlstra wrote:

On Thu, Mar 11, 2021 at 11:53:35AM -0500, Liang, Kan wrote:


The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without
PMEM.
The "cpu_atom" PMU is similar to Tremont, but with different
event_constraints, extra_regs and number of counters.



So do these things use the same event lists as SPR and TNT?


No, there will be two new event lists on ADL. One is for Atom core, and the
other is for big core. They are different to SPR and TNT.


*sigh* how different?


The core PMU event list should be similar between SPR and the big core 
of ADL, because they both utilize the Golden Cove core. But the uncore 
PMU event list is totally different for client and server.


The Atom core of ADL utilizes the Gracemont core, which is a successor 
to Tremont. It introduces many new events. We cannot use the Tremont 
event list instead.






My desktop has: cpu/caps/pmu_name and that gives "skylake", do we want
the above to have cpu_core/caps/pmu_name give "sapphire_rapids" etc.. ?



I think current implementation should be good enough.

$ cat /sys/devices/cpu_atom/caps/pmu_name
alderlake_hybrid

"alderlake_hybrid" tells the perf tool that it's Alder Lake Hybrid system.
"cpu_atom" tells the perf tool that it's for Atom core.


Yeah, but then I have to ask Google wth those atoms and cores actually
are. Why not tell me upfront?

Since we're now working on it, we all know, but in 6 months time nobody
will remember and then we'll constantly have to look it up and curse
ourselves for not doing better.


I think the "sapphire_rapids" is the code name for the server platform.
Maybe we should use the code name of core?

$ cat /sys/devices/cpu_atom/caps/pmu_name
gracemont
$ cat /sys/devices/cpu_core/caps/pmu_name
golden_cove


Thanks,
Kan


Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Liang, Kan




On 3/11/2021 11:32 AM, Peter Zijlstra wrote:

On Thu, Mar 11, 2021 at 05:09:25PM +0100, Peter Zijlstra wrote:

On Wed, Mar 10, 2021 at 08:37:56AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Alder Lake Hybrid system has two different types of core, Golden Cove
core and Gracemont core. The Golden Cove core is registered to
"cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU.

The difference between the two PMUs include:
- Number of GP and fixed counters
- Events
- The "cpu_core" PMU supports Topdown metrics.
   The "cpu_atom" PMU supports PEBS-via-PT.

The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without
PMEM.
The "cpu_atom" PMU is similar to Tremont, but with different
event_constraints, extra_regs and number of counters.




+   /* Initialize big core specific PerfMon capabilities.*/
+   pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX];
+   pmu->name = "cpu_core";



+   /* Initialize Atom core specific PerfMon capabilities.*/
+   pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_ATOM_IDX];
+   pmu->name = "cpu_atom";


So do these things use the same event lists as SPR and TNT? Is there any
way to discover that, because AFAICT /proc/cpuinfo will say every CPU
is 'Alderlake', and the above also doesn't give any clue.

FWIW, ARM big.LITTLE does discriminate in its /proc/cpuinfo, but I'm not
entirely sure it's really useful. Mark said perf userspace uses
somethink akin to our CPUID, except exposed through sysfs, to find the
event lists.

My desktop has: cpu/caps/pmu_name and that gives "skylake", do we want
the above to have cpu_core/caps/pmu_name give "sapphire_rapids" etc.. ?


FWIW, "Tremont" is the only pmu_name with a capital :-( I don't suppose
we can still fix that?



The cpu/caps/pmu_name is not used for the event list.

I don't think perf tool checks the "Tremont". I think it should be OK to 
fix it. Let me post a patch.


Thanks,
Kan


Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Liang, Kan




On 3/11/2021 11:53 AM, Liang, Kan wrote:



On 3/11/2021 11:09 AM, Peter Zijlstra wrote:
On Wed, Mar 10, 2021 at 08:37:56AM -0800, kan.li...@linux.intel.com 
wrote:

From: Kan Liang 

Alder Lake Hybrid system has two different types of core, Golden Cove
core and Gracemont core. The Golden Cove core is registered to
"cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU.

The difference between the two PMUs include:
- Number of GP and fixed counters
- Events
- The "cpu_core" PMU supports Topdown metrics.
   The "cpu_atom" PMU supports PEBS-via-PT.

The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without
PMEM.
The "cpu_atom" PMU is similar to Tremont, but with different
event_constraints, extra_regs and number of counters.




+    /* Initialize big core specific PerfMon capabilities.*/
+    pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX];
+    pmu->name = "cpu_core";



+    /* Initialize Atom core specific PerfMon capabilities.*/
+    pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_ATOM_IDX];
+    pmu->name = "cpu_atom";


So do these things use the same event lists as SPR and TNT?


No, there will be two new event lists on ADL. One is for Atom core, and 
the other is for big core. They are different to SPR and TNT.



Is there any
way to discover that, because AFAICT /proc/cpuinfo will say every CPU
is 'Alderlake', and the above also doesn't give any clue.



Ricardo once submitted a patch to expose the CPU type under 
/sys/devices/system/cpu, but I don't know the latest status.
https://lore.kernel.org/lkml/20201003011745.7768-5-ricardo.neri-calde...@linux.intel.com/ 






FWIW, ARM big.LITTLE does discriminate in its /proc/cpuinfo, but I'm not
entirely sure it's really useful. Mark said perf userspace uses
somethink akin to our CPUID, except exposed through sysfs, to find the
event lists.



Ah, I guess I misunderstood the concern. Let me try again.

Here is how perf tool find a event name via event list.

To get the correct event list file, yes, perf tool relies on the CPUID.
It will search a CPUID table in the 
tools/perf/pmu-events/arch/x86/mapfile.csv.

GenuineIntel-6-97,v1,alderlake,core
Now perf tool knows the event list file "alderlake" is for the CPUID 97.

In the event list file for the Alder Lake (CPUID 0x97), we add a new 
field "Unit" to distinguish the type of PMU.

"Unit": "cpu_core"
"Unit": "cpu_atom"

So perf can search the event name for a certain type of PMU via PMU name 
"cpu_core" or "cpu_atom".


Perf tool doesn't use the "cpu_core/caps/pmu_name" for the event list.

Thanks,
Kan


Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Liang, Kan




On 3/11/2021 11:09 AM, Peter Zijlstra wrote:

On Wed, Mar 10, 2021 at 08:37:56AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Alder Lake Hybrid system has two different types of core, Golden Cove
core and Gracemont core. The Golden Cove core is registered to
"cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU.

The difference between the two PMUs include:
- Number of GP and fixed counters
- Events
- The "cpu_core" PMU supports Topdown metrics.
   The "cpu_atom" PMU supports PEBS-via-PT.

The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without
PMEM.
The "cpu_atom" PMU is similar to Tremont, but with different
event_constraints, extra_regs and number of counters.




+   /* Initialize big core specific PerfMon capabilities.*/
+   pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX];
+   pmu->name = "cpu_core";



+   /* Initialize Atom core specific PerfMon capabilities.*/
+   pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_ATOM_IDX];
+   pmu->name = "cpu_atom";


So do these things use the same event lists as SPR and TNT?


No, there will be two new event lists on ADL. One is for Atom core, and 
the other is for big core. They are different to SPR and TNT.



Is there any
way to discover that, because AFAICT /proc/cpuinfo will say every CPU
is 'Alderlake', and the above also doesn't give any clue.



Ricardo once submitted a patch to expose the CPU type under 
/sys/devices/system/cpu, but I don't know the latest status.

https://lore.kernel.org/lkml/20201003011745.7768-5-ricardo.neri-calde...@linux.intel.com/




FWIW, ARM big.LITTLE does discriminate in its /proc/cpuinfo, but I'm not
entirely sure it's really useful. Mark said perf userspace uses
somethink akin to our CPUID, except exposed through sysfs, to find the
event lists.



Perf tool can use the pmu name.
For perf stat -e cpu_atom/EVENT_NAME/, perf will apply the event list 
for atom.
For perf stat -e cpu_core/EVENT_NAME/, perf will apply the event list 
for the big core.
For perf stat -e EVENT_NAME, perf tool will check if the EVENT_NAME 
exists. If it's available on both event list, perf will automatically 
create two events, perf stat -e cpu_atom/EVENT_NAME/,cpu_core/EVENT_NAME/.
If the event name is only available on a certain type, e.g., atom. The 
perf tool will only apply the corresponding event, e.g., perf stat -e 
cpu_atom/EVENT_NAME/



My desktop has: cpu/caps/pmu_name and that gives "skylake", do we want
the above to have cpu_core/caps/pmu_name give "sapphire_rapids" etc.. ?



I think current implementation should be good enough.

$ cat /sys/devices/cpu_atom/caps/pmu_name
alderlake_hybrid

"alderlake_hybrid" tells the perf tool that it's Alder Lake Hybrid system.
"cpu_atom" tells the perf tool that it's for Atom core.


Thanks,
Kan


Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs

2021-03-11 Thread Liang, Kan




On 3/11/2021 7:34 AM, Peter Zijlstra wrote:

On Wed, Mar 10, 2021 at 08:37:52AM -0800, kan.li...@linux.intel.com wrote:

@@ -2092,9 +2105,37 @@ static int __init init_hw_perf_events(void)
if (err)
goto out1;
  
-	err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);

-   if (err)
-   goto out2;
+   if (!is_hybrid()) {
+   err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);
+   if (err)
+   goto out2;
+   } else {
+   u8 cpu_type = get_hybrid_cpu_type(smp_processor_id());
+   struct x86_hybrid_pmu *hybrid_pmu;
+   int i;
+
+   for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) {
+   hybrid_pmu = _pmu.hybrid_pmu[i];
+
+   hybrid_pmu->pmu = pmu;
+   hybrid_pmu->pmu.type = -1;
+   hybrid_pmu->pmu.attr_update = x86_pmu.attr_update;
+   hybrid_pmu->pmu.capabilities |= 
PERF_PMU_CAP_HETEROGENEOUS_CPUS;
+
+   /* Only register the PMU for the boot CPU */


Why ?!
> AFAICT we could register them all here. That instantly fixes that
CPU_STARTING / CPU_DEAD fail elsewhere in this patch.


It's possible that all CPUs of a certain type all offline, but I cannot 
know the information here, because the boot CPU is the only online CPU. 
I don't know the status of the other CPUs.


If we unconditionally register all PMUs, users may see a PMU in 
/sys/devices, but they cannot use it, because there is no available CPU.

Is it acceptable that registering an empty PMU?


Thanks,
Kan



Re: [PATCH V2 08/25] perf/x86: Hybrid PMU support for hardware cache event

2021-03-11 Thread Liang, Kan




On 3/11/2021 6:07 AM, Peter Zijlstra wrote:

On Wed, Mar 10, 2021 at 08:37:44AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

The hardware cache events are different among hybrid PMUs. Each hybrid
PMU should have its own hw cache event table.

Reviewed-by: Andi Kleen 
Signed-off-by: Kan Liang 
---
  arch/x86/events/core.c   | 11 +--
  arch/x86/events/perf_event.h |  9 +
  2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 039a851..1db4a67 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -352,6 +352,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
  {
struct perf_event_attr *attr = >attr;
unsigned int cache_type, cache_op, cache_result;
+   struct x86_hybrid_pmu *pmu = is_hybrid() ? hybrid_pmu(event->pmu) : 
NULL;
u64 config, val;
  
  	config = attr->config;

@@ -371,7 +372,10 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return -EINVAL;
cache_result = array_index_nospec(cache_result, 
PERF_COUNT_HW_CACHE_RESULT_MAX);
  
-	val = hw_cache_event_ids[cache_type][cache_op][cache_result];

+   if (pmu)
+   val = 
pmu->hw_cache_event_ids[cache_type][cache_op][cache_result];
+   else
+   val = hw_cache_event_ids[cache_type][cache_op][cache_result];
  
  	if (val == 0)

return -ENOENT;
@@ -380,7 +384,10 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct 
perf_event *event)
return -EINVAL;
  
  	hwc->config |= val;

-   attr->config1 = hw_cache_extra_regs[cache_type][cache_op][cache_result];
+   if (pmu)
+   attr->config1 = 
pmu->hw_cache_extra_regs[cache_type][cache_op][cache_result];
+   else
+   attr->config1 = 
hw_cache_extra_regs[cache_type][cache_op][cache_result];


Why not:

attr->config1 = hybrid(event->pmu, 
hw_cache_extra_regs)[cache_type][cache_op][cache_result];

?


Because hw_cache_extra_regs is not part of the struct x86_pmu.


Thanks,
Kan


Re: [PATCH V2 1/25] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit

2021-03-10 Thread Liang, Kan




On 3/10/2021 5:25 PM, Ricardo Neri wrote:

On Wed, Mar 10, 2021 at 09:01:47PM +0100, Borislav Petkov wrote:

On Wed, Mar 10, 2021 at 11:46:44AM -0800, Ricardo Neri wrote:

But this series provides the use case, right? Kan's patches handle PMU counters
that may differ cross types of CPUs. In patch 2, get_hybrid_params()
needs to check first if X86_FEATURE_HYBRID_CPU is enabled before
querying the hybrid parameters. Otherwise, we would need to rely on the
maximum level of CPUID, which may not be reliable.


On Wed, Mar 10, 2021 at 11:33:54AM -0800, Srinivas Pandruvada wrote:

We are working on changes to P-State driver for hybrid CPUs using this
define. They are still work in progress.
But this patch can be submitted later with our set of changes.


Answering to both with a single mail:

I don't have a problem with X86_FEATURE_HYBRID_CPU - I simply don't want
to show "hybrid_cpu" in /proc/cpuinfo unless there's a valid use case
for userspace to know that it is running on a hybrid CPU.


Ah, I get your point now. You would like to see

#define X86_FEATURE_HYBRID_CPU   (18*32+15) /* "" This part has 
CPUs of more than one type */

Right? Now your first comment makes sense.

Srinivas, Kan, I don't think we need to expose "hybrid_cpu" in
/proc/cpuinfo, do we?



Right, Perf doesn't use the "hybrid_cpu" in /proc/cpuinfo.

Thanks,
Kan


Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs

2021-03-10 Thread Liang, Kan




On 3/10/2021 11:50 AM, Dave Hansen wrote:

On 3/10/21 8:37 AM, kan.li...@linux.intel.com wrote:

-   err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);
-   if (err)
-   goto out2;
+   if (!is_hybrid()) {
+   err = perf_pmu_register(, "cpu", PERF_TYPE_RAW);
+   if (err)
+   goto out2;
+   } else {
+   u8 cpu_type = get_hybrid_cpu_type(smp_processor_id());
+   struct x86_hybrid_pmu *hybrid_pmu;
+   int i;


Where's the preempt_disable()?


+static void init_hybrid_pmu(int cpu)
+{
+   unsigned int fixed_mask, unused_eax, unused_ebx, unused_edx;
+   struct cpu_hw_events *cpuc = _cpu(cpu_hw_events, cpu);
+   u8 cpu_type = get_hybrid_cpu_type(cpu);
+   struct x86_hybrid_pmu *pmu = NULL;
+   struct perf_cpu_context *cpuctx;
+   int i;


Ditto.

Are we really sure the IPIs are worth the trouble?  Why don't we just
cache the leaf when we bring the CPU up like just about every other
thing we read from CPUID?


Simple answer: You are right. We don't need it. A simple 
get_this_hybrid_cpu_type() should be fine for perf.


Here is the complete story.
I need the CPU type of the dead CPU in the cpu_dead. In the previous 
patch set, we can read it from the cached CPU type in the struct 
cpuinfo_x86.


In the V2 patch, I tried to do a similar thing (but I have to read it at 
runtime). Since the CPU is offlined, I asked Ricardo to provide the 
function get_hybrid_cpu_type() which can read the CPU type of a specific 
CPU. But I'm wrong. We cannot retrieve the valid CPU type from an 
offlined CPU. So I dropped the method and used another method to 
retrieve the information, but I didn't let Ricardo update the function. 
My bad.


Now, we only need to read the CPU type of the current CPU. A 
get_this_hybrid_cpu_type() function is enough for me.


I think we can get rid of the IPIs trouble with the new 
get_this_hybrid_cpu_type() in the next version. We shouldn't need the 
preempt_disable() either.


Thanks for pointing that out.

Kan



Re: [PATCH 00/49] Add Alder Lake support for perf

2021-03-05 Thread Liang, Kan




On 3/5/2021 6:14 AM, Peter Zijlstra wrote:

On Thu, Mar 04, 2021 at 06:50:00PM +0100, Peter Zijlstra wrote:

On Thu, Mar 04, 2021 at 10:50:45AM -0500, Liang, Kan wrote:

Hi Peter,

Could you please take a look at the perf kernel patches (3-25)?

By now, we have got some comments regarding the generic hybrid feature
enumeration code and perf tool patches. I would appreciate it very much if
you could comment on the perf kernel patches.



Yeah, I started staring at it yesterday.. I'll try and respond tomorrow.


OK, so STYLE IS HORRIBLE, please surrender your CAPS-LOCK key until
further notice.

The below is a completely unfinished 'diff' against 3-20. It has various
notes interspersed.

Please rework.



Thanks for the detailed comments. I will rework the code accordingly.

Thanks,
Kan



--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -44,11 +44,13 @@
  
  #include "perf_event.h"
  
+static struct pmu pmu;

+
  struct x86_pmu x86_pmu __read_mostly;
  
  DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {

.enabled = 1,
-   .hybrid_pmu_idx = X86_NON_HYBRID_PMU,
+   .pmu = 
  };
  
  DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);

@@ -184,7 +186,7 @@ static inline int get_possible_num_count
  {
int bit, num_counters = 0;
  
-	if (!IS_X86_HYBRID)

+   if (!is_hybrid())
return x86_pmu.num_counters;
  
  	for_each_set_bit(bit, _pmu.hybrid_pmu_bitmap, X86_HYBRID_PMU_MAX_INDEX)

@@ -270,6 +272,9 @@ bool check_hw_exists(int num_counters, i
if (ret)
goto msr_fail;
for (i = 0; i < num_counters_fixed; i++) {
+   /*
+* XXX comment that explains why/how NULL below works.
+*/
if (fixed_counter_disabled(i, NULL))
continue;
if (val & (0x03 << i*4)) {
@@ -352,7 +357,6 @@ set_ext_hw_attr(struct hw_perf_event *hw
  {
struct perf_event_attr *attr = >attr;
unsigned int cache_type, cache_op, cache_result;
-   struct x86_hybrid_pmu *pmu = IS_X86_HYBRID ? container_of(event->pmu, 
struct x86_hybrid_pmu, pmu) : NULL;
u64 config, val;
  
  	config = attr->config;

@@ -372,10 +376,7 @@ set_ext_hw_attr(struct hw_perf_event *hw
return -EINVAL;
cache_result = array_index_nospec(cache_result, 
PERF_COUNT_HW_CACHE_RESULT_MAX);
  
-	if (pmu)

-   val = 
pmu->hw_cache_event_ids[cache_type][cache_op][cache_result];
-   else
-   val = hw_cache_event_ids[cache_type][cache_op][cache_result];
+   val = hybrid(event->pmu, 
hw_cache_event_ids)[cache_type][cache_op][cache_result];
  
  	if (val == 0)

return -ENOENT;
@@ -384,10 +385,7 @@ set_ext_hw_attr(struct hw_perf_event *hw
return -EINVAL;
  
  	hwc->config |= val;

-   if (pmu)
-   attr->config1 = 
pmu->hw_cache_extra_regs[cache_type][cache_op][cache_result];
-   else
-   attr->config1 = 
hw_cache_extra_regs[cache_type][cache_op][cache_result];
+   attr->config1 = hybrid(event->pmu, 
hw_cache_extra_regs)[cache_type][cache_op][cache_result];
return x86_pmu_extra_regs(val, event);
  }
  
@@ -742,13 +740,11 @@ void x86_pmu_enable_all(int added)

}
  }
  
-static struct pmu pmu;

-
  static inline int is_x86_event(struct perf_event *event)
  {
int bit;
  
-	if (!IS_X86_HYBRID)

+   if (!is_hybrid())
return event->pmu == 
  
  	for_each_set_bit(bit, _pmu.hybrid_pmu_bitmap, X86_HYBRID_PMU_MAX_INDEX) {

@@ -760,6 +756,7 @@ static inline int is_x86_event(struct pe
  
  struct pmu *x86_get_pmu(void)

  {
+   /* borken */
return 
  }
  /*
@@ -963,7 +960,7 @@ EXPORT_SYMBOL_GPL(perf_assign_events);
  
  int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

  {
-   int num_counters = X86_HYBRID_READ_FROM_CPUC(num_counters, cpuc);
+   int num_counters = hybrid(cpuc->pmu, num_counters);
struct event_constraint *c;
struct perf_event *e;
int n0, i, wmin, wmax, unsched = 0;
@@ -1124,7 +1121,7 @@ static void del_nr_metric_event(struct c
  static int collect_event(struct cpu_hw_events *cpuc, struct perf_event *event,
 int max_count, int n)
  {
-   union perf_capabilities intel_cap = 
X86_HYBRID_READ_FROM_CPUC(intel_cap, cpuc);
+   union perf_capabilities intel_cap = hybrid(cpuc->pmu, intel_cap);
  
  	if (intel_cap.perf_metrics && add_nr_metric_event(cpuc, event))

return -EINVAL;
@@ -1147,8 +1144,8 @@ static int collect_event(struct cpu_hw_e
   */
  static int collect_events(struct cpu_hw_events *cpuc, struct perf_event 
*leader, bool dogrp)
  {
-   int num_counters = X86_HYBRID_READ_FROM_CPUC(num_counters, cpuc);
-   int num_counters_fixed = X86_HYBRID_REA

Re: [perf] perf_fuzzer causes unchecked MSR access error

2021-03-04 Thread Liang, Kan




On 3/3/2021 3:22 PM, Vince Weaver wrote:

On Wed, 3 Mar 2021, Liang, Kan wrote:


We never use bit 58. It should be a new issue.


Actually, KVM uses it. They create a fake event called VLBR_EVENT, which 
uses bit 58. It's introduced from the commit 097e4311cda9 ("perf/x86: 
Add constraint to create guest LBR event without hw counter").


Since it's a fake event, it doesn't support PEBS. Perf should reject it 
if it sets the precise_ip.


The below patch should fix the MSR access error.

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5bac48d..1ea3c67 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3659,6 +3659,10 @@ static int intel_pmu_hw_config(struct perf_event 
*event)

return ret;

if (event->attr.precise_ip) {
+
+		if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == 
INTEL_FIXED_VLBR_EVENT)

+   return -EINVAL;
+
 		if (!(event->attr.freq || (event->attr.wakeup_events && 
!event->attr.watermark))) {

event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (!(event->attr.sample_type &

Thanks,
Kan


Re: [PATCH 00/49] Add Alder Lake support for perf

2021-03-04 Thread Liang, Kan

Hi Peter,

Could you please take a look at the perf kernel patches (3-25)?

By now, we have got some comments regarding the generic hybrid feature 
enumeration code and perf tool patches. I would appreciate it very much 
if you could comment on the perf kernel patches.


Thanks,
Kan

On 2/8/2021 10:24 AM, kan.li...@linux.intel.com wrote:

From: Kan Liang 

(The V1 patchset is a complete patchset for the Alder Lake support on
the Linux perf. It includes both kernel patches (1-25) and the user
space patches (26-49). It tries to give the maintainers/reviewers an
overall picture of the ADL enabling patches. The number of the patches
are huge. Sorry for it. For future versions, the patchset will be
divided into the kernel patch series and the userspace patch series.
They can be reviewed separately.)

Alder Lake uses a hybrid architecture utilizing Golden Cove cores
and Gracemont cores. On such architectures, all CPUs support the same,
homogeneous and symmetric, instruction set. Also, CPUID enumerate
the same features for all CPUs. There may be model-specific differences,
such as those addressed in this patchset.

The first two patches enumerate the hybrid CPU feature bit and save the
CPU type in a new field x86_cpu_type in struct cpuinfo_x86 for the
following patches. They were posted previously[1] but not merged.
Compared with the initial submission, they address the below two
concerns[2][3],
- Provide a good use case, PMU.
- Clarify what Intel Hybrid Technology is and is not.

The PMU capabilities for Golden Cove core and Gracemont core are not the
same. The key differences include the number of counters, events, perf
metrics feature, and PEBS-via-PT feature. A dedicated hybrid PMU has to
be registered for each of them. However, the current perf X86 assumes
that there is only one CPU PMU. To handle the hybrid PMUs, the patchset
- Introduce a new struct x86_hybrid_pmu to save the unique capabilities
   from different PMUs. It's part of the global x86_pmu. The architecture
   capabilities, which are available for all PMUs, are still saved in
   the global x86_pmu. I once considered dynamically create dedicated
   x86_pmu and pmu for each hybrid PMU. If so, they have to be changed to
   pointers. Since they are used everywhere, the changes could be huge
   and complex. Also, most of the PMU capabilities are the same between
   hybrid PMUs. Duplicated data in the big x86_pmu structure will be
   saved many times. So the dynamic way was dropped.
- The hybrid PMU registration has been moved to the cpu_starting(),
   because only boot CPU is available when invoking the
   init_hw_perf_events().
- Hybrid PMUs have different events and formats. Add new structures and
   helpers for events attribute and format attribute which take the PMU
   type into account.
- Add a PMU aware version PERF_TYPE_HARDWARE_PMU and
   PERF_TYPE_HW_CACHE_PMU to facilitate user space tools

The uncore, MSR and cstate are the same between hybrid CPUs.
Don't need to register hybrid PMUs for them.

The generic code kernel/events/core.c is not hybrid friendly either,
especially for the per-task monitoring. Peter once proposed a
patchset[4], but it hasn't been merged. This patchset doesn't intend to
improve the generic code (which can be improved later separately). It
still uses the capability PERF_PMU_CAP_HETEROGENEOUS_CPUS for each
hybrid PMUs. For per-task and system-wide monitoring, user space tools
have to create events on all available hybrid PMUs. The events which are
from different hybrid PMUs cannot be included in the same group.

[1]. 
https://lore.kernel.org/lkml/20201002201931.2826-1-ricardo.neri-calde...@linux.intel.com/
[2]. https://lore.kernel.org/lkml/20201002203452.ge17...@zn.tnic/
[3]. https://lore.kernel.org/lkml/87r1qgccku@nanos.tec.linutronix.de/
[4]. 
https://lkml.kernel.org/r/20181010104559.go5...@hirez.programming.kicks-ass.net/

Jin Yao (24):
   perf jevents: Support unit value "cpu_core" and "cpu_atom"
   perf util: Save pmu name to struct perf_pmu_alias
   perf pmu: Save detected hybrid pmus to a global pmu list
   perf pmu: Add hybrid helper functions
   perf list: Support --cputype option to list hybrid pmu events
   perf stat: Hybrid evsel uses its own cpus
   perf header: Support HYBRID_TOPOLOGY feature
   perf header: Support hybrid CPU_PMU_CAPS
   tools headers uapi: Update tools's copy of linux/perf_event.h
   perf parse-events: Create two hybrid hardware events
   perf parse-events: Create two hybrid cache events
   perf parse-events: Support hardware events inside PMU
   perf list: Display pmu prefix for partially supported hybrid cache
 events
   perf parse-events: Support hybrid raw events
   perf stat: Support --cputype option for hybrid events
   perf stat: Support metrics with hybrid events
   perf evlist: Create two hybrid 'cycles' events by default
   perf stat: Add default hybrid events
   perf stat: Uniquify hybrid event name
   perf stat: Merge event counts from all hybrid PMUs
   perf 

Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

2021-03-03 Thread Liang, Kan




On 3/3/2021 1:59 PM, Peter Zijlstra wrote:

On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote:


For some old CPUs (HSW and earlier), the PEBS status in a PEBS record
may be mistakenly set to 0. To minimize the impact of the defect, the
commit was introduced to try to avoid dropping the PEBS record for some
cases. It adds a check in the intel_pmu_drain_pebs_nhm(), and updates
the local pebs_status accordingly. However, it doesn't correct the PEBS
status in the PEBS record, which may trigger the crash, especially for
the large PEBS.

It's possible that all the PEBS records in a large PEBS have the PEBS
status 0. If so, the first get_next_pebs_record_by_bit() in the
__intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large
PEBS, the 'count' parameter must > 1. The second
get_next_pebs_record_by_bit() will crash.

Two solutions were considered to fix the crash.
- Keep the SW workaround and add extra checks in the
   get_next_pebs_record_by_bit() to workaround the issue. The
   get_next_pebs_record_by_bit() is a critical path. The extra checks
   will bring extra overhead for the latest CPUs which don't have the
   defect. Also, the defect can only be observed on some old CPUs
   (For example, the issue can be reproduced on an HSW client, but I
   didn't observe the issue on my Haswell server machine.). The impact
   of the defect should be limit.
   This solution is dropped.
- Drop the SW workaround and revert the commit.
   It seems that the commit never works, because the PEBS status in the
   PEBS record never be changed. The get_next_pebs_record_by_bit() only
   checks the PEBS status in the PEBS record. The record is dropped
   eventually. Reverting the commit should not change the current
   behavior.



+++ b/arch/x86/events/intel/ds.c
@@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
*iregs, struct perf_sample_d
continue;
}
  
-		/*

-* On some CPUs the PEBS status can be zero when PEBS is
-* racing with clearing of GLOBAL_STATUS.
-*
-* Normally we would drop that record, but in the
-* case when there is only a single active PEBS event
-* we can assume it's for that event.
-*/
-   if (!pebs_status && cpuc->pebs_enabled &&
-   !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
-   pebs_status = cpuc->pebs_enabled;


Wouldn't something like:

pebs_status = p->status = cpus->pebs_enabled;



I didn't consider it as a potential solution in this patch because I 
don't think it's a proper way that SW modifies the buffer, which is 
supposed to be manipulated by the HW.

It's just a personal preference. I don't see any issue here. We may try it.

Vince, could you please help check whether Peter's suggestion fixes the 
crash?


Thanks,
Kan


actually fix things without adding overhead?



Re: [perf] perf_fuzzer causes unchecked MSR access error

2021-03-03 Thread Liang, Kan




On 3/3/2021 2:28 PM, Stephane Eranian wrote:

On Wed, Mar 3, 2021 at 10:16 AM Vince Weaver  wrote:


Hello

on my Haswell machine the perf_fuzzer managed to trigger this message:

[117248.075892] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 
0x0400) at rIP: 0x8106e4f4 (native_write_msr+0x4/0x20)
[117248.089957] Call Trace:
[117248.092685]  intel_pmu_pebs_enable_all+0x31/0x40
[117248.097737]  intel_pmu_enable_all+0xa/0x10
[117248.102210]  __perf_event_task_sched_in+0x2df/0x2f0
[117248.107511]  finish_task_switch.isra.0+0x15f/0x280
[117248.112765]  schedule_tail+0xc/0x40
[117248.116562]  ret_from_fork+0x8/0x30

that shouldn't be possible, should it?  MSR 0x3f1 is MSR_IA32_PEBS_ENABLE


Not possible, bit 58 is not defined in PEBS_ENABLE, AFAIK.



this is on recent-git with the patch causing the pebs-related crash
reverted.



We never use bit 58. It should be a new issue.
Is it repeatable?

Thanks,
Kan


Re: [PATCH] perf test: Test case 27 fails on s390 and non-x86 platforms

2021-03-02 Thread Liang, Kan




On 3/2/2021 12:08 PM, Thomas Richter wrote:

On 3/2/21 4:23 PM, Liang, Kan wrote:



On 3/2/2021 9:48 AM, Thomas Richter wrote:

On 3/2/21 3:03 PM, Liang, Kan wrote:


+ Athira Rajeev

On 3/2/2021 8:31 AM, Thomas Richter wrote:

Executing perf test 27 fails on s390:
    [root@t35lp46 perf]# ./perf test -Fv 27
    27: Sample parsing
    --- start ---
     end 
    Sample parsing: FAILED!
    [root@t35lp46 perf]#

The root cause is
commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT")
This commit introduced a test case for PERF_SAMPLE_WEIGHT_STRUCT
but does not adjust non-x86 weak linkage functions.

The error is in test__sample_parsing() --> do_test()
Function do_test() defines two structures of type struct perf_sample named
sample and sample_out. The first sets member sample.ins_lat = 117

Structure sample_out is constructed dynamically using functions
perf_event__synthesize_sample() and evsel__parse_sample().
Both functions have an x86 specific function version which sets member
ins_lat. The weak common functions do not set member ins_lat.



I don't think Power supports the instruction latency. As a request from Athira 
Rajeev, I moved the PERF_SAMPLE_WEIGHT_STRUCT to the X86 specific codes.
https://lore.kernel.org/lkml/d97fef4f-dd88-4760-885e-9a6161a9b...@linux.vnet.ibm.com/
https://lore.kernel.org/lkml/1612540912-6562-1-git-send-email-kan.li...@linux.intel.com/

I don't think we want to add the ins_lat back in the weak common functions.

Could you please update the perf test and don't apply the 
PERF_SAMPLE_WEIGHT_STRUCT for the non-X86 platform?


I used offical linux git tree
   [root@t35lp46 perf]# git tag | fgrep 5.12
v5.12-rc1
[root@t35lp46 perf]#

So this change is in the pipe. I do not plan to revert individual patches.


No, we shouldn't revert the patch.
I mean can you fix the issue in perf test?
Don't test ins_lat or PERF_SAMPLE_WEIGHT_STRUCT for a non-X86 platform.


That would be very ugly code. We would end up in conditional compiles like
#ifdef __s390x__
#endif
and other architectes like ARM/POWER etc come along. This is something I want 
to avoid.



The ins_lat is a model specific variable. Maybe we should move it to the 
arch specific test.




And this fix only touches perf, not the kernel.


The patch changes the behavior of the PERF_SAMPLE_WEIGHT. The high 32 
bit will be dropped. It should bring some problems if the high 32 bit 
contains valid information.








Later in function samples_same() both data in variable sample and sample_out
are compared. The comparison fails because sample.ins_lat is 117
and samples_out.ins_lat is 0, the weak functions never set member ins_lat.

Output after:
    [root@t35lp46 perf]# ./perf test -Fv 27
    27: Sample parsing
    --- start ---
     end 
    Sample parsing: Ok
[root@t35lp46 perf]#

Fixes:
commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT")


I think the regression should start from
commit fbefe9c2f87f ("perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT 
processing")


Thanks,
Kan


Kan,

I do not follow you. Your commit c7444297fd3769d10c7ffb52c81d71503b3e268f
adds this line

@@ -242,6 +245,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 
read_format)
  .cgroup = 114,
  .data_page_size = 115,
  .code_page_size = 116,
+   .ins_lat    = 117,

And this assignment 117 breaks the test. As mentioned before, member ins_lat is 
never touched
by the weak functions.



Here is the timeline for the patches.

1. The commit c7444297fd3769 and other SPR patches are merged at 2021-02-08. At 
that time, I don't think we have this issue. perf test should work well.


Nope, that line above 'ins_lat = 117.' breaks the test. Comment it out and it 
works well!!!


If you revert the commit fbefe9c2f87f, perf test should work well too.

The root cause of the issue is that the commit fbefe9c2f87f change the 
ins_lat to a model-specific variable, but perf test still verify the 
variable in the generic test.


The below patch moves the PERF_SAMPLE_WEIGHT test into a X86 specific test.

Does it work for you?

---
 tools/perf/arch/x86/include/arch-tests.h   |   1 +
 tools/perf/arch/x86/tests/Build|   1 +
 tools/perf/arch/x86/tests/arch-tests.c |   4 +
 tools/perf/arch/x86/tests/sample-parsing.c | 125 
+

 tools/perf/tests/sample-parsing.c  |   4 -
 5 files changed, 131 insertions(+), 4 deletions(-)
 create mode 100644 tools/perf/arch/x86/tests/sample-parsing.c

diff --git a/tools/perf/arch/x86/include/arch-tests.h 
b/tools/perf/arch/x86/include/arch-tests.h

index 6a54b94..0e20f3d 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -10,6 +10,7 @@ int test__rdpmc(struct test *test __maybe_unused, int 
subtest);

 int test__insn_x86(struct test *test __maybe_unused, int subtest)

Re: [PATCH] perf test: Test case 27 fails on s390 and non-x86 platforms

2021-03-02 Thread Liang, Kan




On 3/2/2021 9:48 AM, Thomas Richter wrote:

On 3/2/21 3:03 PM, Liang, Kan wrote:


+ Athira Rajeev

On 3/2/2021 8:31 AM, Thomas Richter wrote:

Executing perf test 27 fails on s390:
   [root@t35lp46 perf]# ./perf test -Fv 27
   27: Sample parsing
   --- start ---
    end 
   Sample parsing: FAILED!
   [root@t35lp46 perf]#

The root cause is
commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT")
This commit introduced a test case for PERF_SAMPLE_WEIGHT_STRUCT
but does not adjust non-x86 weak linkage functions.

The error is in test__sample_parsing() --> do_test()
Function do_test() defines two structures of type struct perf_sample named
sample and sample_out. The first sets member sample.ins_lat = 117

Structure sample_out is constructed dynamically using functions
perf_event__synthesize_sample() and evsel__parse_sample().
Both functions have an x86 specific function version which sets member
ins_lat. The weak common functions do not set member ins_lat.



I don't think Power supports the instruction latency. As a request from Athira 
Rajeev, I moved the PERF_SAMPLE_WEIGHT_STRUCT to the X86 specific codes.
https://lore.kernel.org/lkml/d97fef4f-dd88-4760-885e-9a6161a9b...@linux.vnet.ibm.com/
https://lore.kernel.org/lkml/1612540912-6562-1-git-send-email-kan.li...@linux.intel.com/

I don't think we want to add the ins_lat back in the weak common functions.

Could you please update the perf test and don't apply the 
PERF_SAMPLE_WEIGHT_STRUCT for the non-X86 platform?


I used offical linux git tree
  [root@t35lp46 perf]# git tag | fgrep 5.12
v5.12-rc1
[root@t35lp46 perf]#

So this change is in the pipe. I do not plan to revert individual patches.


No, we shouldn't revert the patch.
I mean can you fix the issue in perf test?
Don't test ins_lat or PERF_SAMPLE_WEIGHT_STRUCT for a non-X86 platform.





Later in function samples_same() both data in variable sample and sample_out
are compared. The comparison fails because sample.ins_lat is 117
and samples_out.ins_lat is 0, the weak functions never set member ins_lat.

Output after:
   [root@t35lp46 perf]# ./perf test -Fv 27
   27: Sample parsing
   --- start ---
    end 
   Sample parsing: Ok
[root@t35lp46 perf]#

Fixes:
commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT")


I think the regression should start from
commit fbefe9c2f87f ("perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT 
processing")


Thanks,
Kan


Kan,

I do not follow you. Your commit c7444297fd3769d10c7ffb52c81d71503b3e268f
adds this line

@@ -242,6 +245,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 
read_format)
 .cgroup = 114,
 .data_page_size = 115,
 .code_page_size = 116,
+   .ins_lat= 117,

And this assignment 117 breaks the test. As mentioned before, member ins_lat is 
never touched
by the weak functions.



Here is the timeline for the patches.

1. The commit c7444297fd3769 and other SPR patches are merged at 
2021-02-08. At that time, I don't think we have this issue. perf test 
should work well.
2. Athira Rajeev told me that Power doesn't support instruction latency. 
So I submitted a patch which create weak functions and move the ins_lat 
into X86 specific.
3. The patch (commit fbefe9c2f87f) was finally merged at 2021-02-18. We 
should observe the perf test at this time.


As my understanding, "Fixes" should log the commit where an issue starts 
to be observed. If someone tries to backport the fix later, they have an 
idea where to put it.


Thanks,
Kan



Re: [PATCH] perf test: Test case 27 fails on s390 and non-x86 platforms

2021-03-02 Thread Liang, Kan



+ Athira Rajeev

On 3/2/2021 8:31 AM, Thomas Richter wrote:

Executing perf test 27 fails on s390:
  [root@t35lp46 perf]# ./perf test -Fv 27
  27: Sample parsing
  --- start ---
   end 
  Sample parsing: FAILED!
  [root@t35lp46 perf]#

The root cause is
commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT")
This commit introduced a test case for PERF_SAMPLE_WEIGHT_STRUCT
but does not adjust non-x86 weak linkage functions.

The error is in test__sample_parsing() --> do_test()
Function do_test() defines two structures of type struct perf_sample named
sample and sample_out. The first sets member sample.ins_lat = 117

Structure sample_out is constructed dynamically using functions
perf_event__synthesize_sample() and evsel__parse_sample().
Both functions have an x86 specific function version which sets member
ins_lat. The weak common functions do not set member ins_lat.



I don't think Power supports the instruction latency. As a request from 
Athira Rajeev, I moved the PERF_SAMPLE_WEIGHT_STRUCT to the X86 specific 
codes.

https://lore.kernel.org/lkml/d97fef4f-dd88-4760-885e-9a6161a9b...@linux.vnet.ibm.com/
https://lore.kernel.org/lkml/1612540912-6562-1-git-send-email-kan.li...@linux.intel.com/

I don't think we want to add the ins_lat back in the weak common functions.

Could you please update the perf test and don't apply the 
PERF_SAMPLE_WEIGHT_STRUCT for the non-X86 platform?




Later in function samples_same() both data in variable sample and sample_out
are compared. The comparison fails because sample.ins_lat is 117
and samples_out.ins_lat is 0, the weak functions never set member ins_lat.

Output after:
  [root@t35lp46 perf]# ./perf test -Fv 27
  27: Sample parsing
  --- start ---
   end 
  Sample parsing: Ok
[root@t35lp46 perf]#

Fixes:
commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT")


I think the regression should start from
commit fbefe9c2f87f ("perf tools: Support arch specific 
PERF_SAMPLE_WEIGHT_STRUCT processing")



Thanks,
Kan


Signed-off-by: Thomas Richter 
---
  tools/perf/util/evsel.c| 8 +---
  tools/perf/util/synthetic-events.c | 6 +-
  2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1bf76864c4f2..c9efed5c173d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2106,10 +2106,12 @@ perf_event__check_size(union perf_event *event, 
unsigned int sample_size)
  }
  
  void __weak arch_perf_parse_sample_weight(struct perf_sample *data,

- const __u64 *array,
- u64 type __maybe_unused)
+ const __u64 *array, u64 type)
  {
-   data->weight = *array;
+   if (type & PERF_SAMPLE_WEIGHT)
+   data->weight = *array & 0x;
+   if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+   data->ins_lat = (*array >> 32) & 0x;
  }
  
  int evsel__parse_sample(struct evsel *evsel, union perf_event *event,

diff --git a/tools/perf/util/synthetic-events.c 
b/tools/perf/util/synthetic-events.c
index b698046ec2db..af7ecbc57cbe 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1507,9 +1507,13 @@ size_t perf_event__sample_event_size(const struct 
perf_sample *sample, u64 type,
  }
  
  void __weak arch_perf_synthesize_sample_weight(const struct perf_sample *data,

-  __u64 *array, u64 type 
__maybe_unused)
+  __u64 *array, u64 type)
  {
*array = data->weight;
+   if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
+   *array &= 0x;
+   *array |= ((u64)data->ins_lat << 32);
+   }
  }
  
  int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_format,




Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()

2021-03-01 Thread Liang, Kan




On 2/11/2021 9:53 AM, Peter Zijlstra wrote:


Kan, do you have time to look at this?

On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote:

On Thu, 28 Jan 2021, Vince Weaver wrote:


the perf_fuzzer has turned up a repeatable crash on my haswell system.

addr2line is not being very helpful, it points to DECLARE_PER_CPU_FIRST.
I'll investigate more when I have the chance.


so I poked around some more.

This seems to be caused in

__intel_pmu_pebs_event()
get_next_pebs_record_by_bit()   ds.c line 1639
get_pebs_status(at) ds.c line 1317
return ((struct pebs_record_nhm *)n)->status;

where "n" has the value of 0xc0 rather than a proper pointer.



I think I find the suspicious patch.
The commt id 01330d7288e00 ("perf/x86: Allow zero PEBS status with only 
single active event")

https://lore.kernel.org/lkml/tip-01330d7288e0050c5aaabc558059ff91589e6...@git.kernel.org/
The patch is an SW workaround for some old CPUs (HSW and earlier), which 
may set 0 to the PEBS status. It adds a check in the 
intel_pmu_drain_pebs_nhm(). It tries to minimize the impact of the 
defect by avoiding dropping the PEBS records which have PEBS status 0.

But, it doesn't correct the PEBS status, which may bring problems,
especially for the large PEBS.
It's possible that all the PEBS records in a large PEBS have the PEBS 
status 0. If so, the first get_next_pebs_record_by_bit() in the 
__intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large 
PEBS, the 'count' parameter must > 1. The second 
get_next_pebs_record_by_bit() will crash.


Could you please revert the patch and check whether it fixes your issue?

Thanks,
Kan


Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()

2021-02-25 Thread Liang, Kan




On 2/11/2021 5:14 PM, Vince Weaver wrote:

On Thu, 11 Feb 2021, Liang, Kan wrote:


On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote:

I'd like to reproduce it on my machine.
Is this issue only found in a Haswell client machine?

To reproduce the issue, can I use ./perf_fuzzer under perf_event_tests/fuzzer?
Do I need to apply any parameters with ./perf_fuzzer?

Usually how long does it take to reproduce the issue?


On my machine if I run the commands
echo 1 > /proc/sys/kernel/nmi_watchdog
 echo 0 > /proc/sys/kernel/perf_event_paranoid
 echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate
 ./perf_fuzzer -s 3 -r 1611784483

it is repeatable within a minute, but because of the nature of the fuzzer
it probably won't work for you because the random events will diverge
based on the different configs of the system.

I can try to generate a simple reproducer, I've just been extremely busy
here at work and haven't had the chance.

If you want to try to reproduce it the hard way, run the
./fast_repro99.sh
script in the perf_fuzzer directory.  It will start fuzzing.  My machine
turned up the issue within a day or so.

Sorry for the late response. Just want to let you know I'm still trying 
to reproduce the issue.


I only have a Haswell server on hand. I run the ./fast_repro99.sh on the 
machine for 3 days, but I didn't observe any crash.
Now, I'm looking for a HSW client in my lab. I will check if I can 
reproduce it on a client machine.


If you have a simple reproducer, please let me know.

Thanks,
Kan


Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()

2021-02-11 Thread Liang, Kan




On 2/11/2021 9:53 AM, Peter Zijlstra wrote:


Kan, do you have time to look at this?

On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote:

On Thu, 28 Jan 2021, Vince Weaver wrote:


the perf_fuzzer has turned up a repeatable crash on my haswell system.

addr2line is not being very helpful, it points to DECLARE_PER_CPU_FIRST.
I'll investigate more when I have the chance.


so I poked around some more.

This seems to be caused in

__intel_pmu_pebs_event()
get_next_pebs_record_by_bit()   ds.c line 1639
get_pebs_status(at) ds.c line 1317
return ((struct pebs_record_nhm *)n)->status;

where "n" has the value of 0xc0 rather than a proper pointer.



The issue is pretty strange. I haven't found the root cause yet.

The base->status (aka "n") was just used in the intel_pmu_drain_pebs_nhm().
for (at = base; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;
u64 pebs_status;

pebs_status = p->status & cpuc->pebs_enabled;
pebs_status &= mask;
Then it seems to be modified to 0xc0, and crash at get_pebs_status() 
based on your investigation.
The "base" is a local variable. The ds->pebs_buffer_base is assigned to 
"base" at the beginning of the function. After that, no one change it.




this does seem to be repetable,


I'd like to reproduce it on my machine.
Is this issue only found in a Haswell client machine?

To reproduce the issue, can I use ./perf_fuzzer under 
perf_event_tests/fuzzer?

Do I need to apply any parameters with ./perf_fuzzer?

Usually how long does it take to reproduce the issue?

Thanks,
Kan


but fairly deep in a fuzzing run so I
don't have a quick reproducer. >> Vince



[96289.009646] BUG: kernel NULL pointer dereference, address: 0150
[96289.017094] #PF: supervisor read access in kernel mode
[96289.022588] #PF: error_code(0x) - not-present page
[96289.028069] PGD 0 P4D 0
[96289.030796] Oops:  [#1] SMP PTI
[96289.034549] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 
5.11.0-rc5+ #151
[96289.043059] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 
01/26/2014
[96289.050946] RIP: 0010:intel_pmu_drain_pebs_nhm+0x464/0x5f0
[96289.056817] Code: 09 00 00 0f b6 c0 49 39 c4 74 2a 48 63 82 78 09 00 00 48 01 c5 
48 39 6c 24 08 76 17 0f b6 05 14 70 3f 01 83 e0 0f 3c 03 77 a4 <48> 8b 85 90 00 
00 00 eb 9f 31 ed 83 eb 01 83 fb 01 0f 85 30 ff ff
[96289.076876] RSP: :822039e0 EFLAGS: 00010097
[96289.082468] RAX: 0002 RBX: 0155 RCX: 0008
[96289.090095] RDX: 88811ac118a0 RSI: 82203980 RDI: 82203980
[96289.097746] RBP: 00c0 R08:  R09: 
[96289.105376] R10:  R11:  R12: 0001
[96289.113008] R13: 82203bc0 R14: 88801c3cf800 R15: 829814a0
[96289.120671] FS:  () GS:88811ac0() 
knlGS:
[96289.129346] CS:  0010 DS:  ES:  CR0: 80050033
[96289.135526] CR2: 0150 CR3: 0220c003 CR4: 001706f0
[96289.143159] DR0:  DR1:  DR2: 
[96289.150803] DR3:  DR6: 0ff0 DR7: 0600
[96289.158414] Call Trace:
[96289.161041]  ? update_blocked_averages+0x532/0x620
[96289.166152]  ? update_group_capacity+0x25/0x1d0
[96289.171025]  ? cpumask_next_and+0x19/0x20
[96289.175339]  ? update_sd_lb_stats.constprop.0+0x702/0x820
[96289.181105]  intel_pmu_drain_pebs_buffer+0x33/0x50
[96289.186259]  ? x86_pmu_commit_txn+0xbc/0xf0
[96289.190749]  ? _raw_spin_lock_irqsave+0x1d/0x30
[96289.195603]  ? timerqueue_add+0x64/0xb0
[96289.199720]  ? update_load_avg+0x6c/0x5e0
[96289.204001]  ? enqueue_task_fair+0x98/0x5a0
[96289.208464]  ? timerqueue_del+0x1e/0x40
[96289.212556]  ? uncore_msr_read_counter+0x10/0x20
[96289.217513]  intel_pmu_pebs_disable+0x12a/0x130
[96289.222324]  x86_pmu_stop+0x48/0xa0
[96289.226076]  x86_pmu_del+0x40/0x160
[96289.229813]  event_sched_out.isra.0+0x81/0x1e0
[96289.234602]  group_sched_out.part.0+0x4f/0xc0
[96289.239257]  __perf_event_disable+0xef/0x1d0
[96289.243831]  event_function+0x8c/0xd0
[96289.247785]  remote_function+0x3e/0x50
[96289.251797]  flush_smp_call_function_queue+0x11b/0x1a0
[96289.257268]  flush_smp_call_function_from_idle+0x38/0x60
[96289.262944]  do_idle+0x15f/0x240
[96289.266421]  cpu_startup_entry+0x19/0x20
[96289.270639]  start_kernel+0x7df/0x804
[96289.274558]  ? apply_microcode_early.cold+0xc/0x27
[96289.279678]  secondary_startup_64_no_verify+0xb0/0xbb
[96289.285078] Modules linked in: nf_tables libcrc32c nfnetlink intel_rapl_msr 
intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic 
snd_hda_codec_hdmi x86_pkg_temp_thermal ledtrig_audio intel_powerclamp 
snd_hda_intel coretemp snd_intel_dspcfg snd_hda_codec snd_hda_core kvm_intel 
kvm 

Re: [PATCH 00/49] Add Alder Lake support for perf

2021-02-11 Thread Liang, Kan




On 2/11/2021 6:40 AM, Jiri Olsa wrote:

On Mon, Feb 08, 2021 at 07:24:57AM -0800, kan.li...@linux.intel.com wrote:

SNIP


Jin Yao (24):
   perf jevents: Support unit value "cpu_core" and "cpu_atom"
   perf util: Save pmu name to struct perf_pmu_alias
   perf pmu: Save detected hybrid pmus to a global pmu list
   perf pmu: Add hybrid helper functions
   perf list: Support --cputype option to list hybrid pmu events
   perf stat: Hybrid evsel uses its own cpus
   perf header: Support HYBRID_TOPOLOGY feature
   perf header: Support hybrid CPU_PMU_CAPS
   tools headers uapi: Update tools's copy of linux/perf_event.h
   perf parse-events: Create two hybrid hardware events
   perf parse-events: Create two hybrid cache events
   perf parse-events: Support hardware events inside PMU
   perf list: Display pmu prefix for partially supported hybrid cache
 events
   perf parse-events: Support hybrid raw events
   perf stat: Support --cputype option for hybrid events
   perf stat: Support metrics with hybrid events
   perf evlist: Create two hybrid 'cycles' events by default
   perf stat: Add default hybrid events
   perf stat: Uniquify hybrid event name
   perf stat: Merge event counts from all hybrid PMUs
   perf stat: Filter out unmatched aggregation for hybrid event
   perf evlist: Warn as events from different hybrid PMUs in a group
   perf Documentation: Document intel-hybrid support
   perf evsel: Adjust hybrid event and global event mixed group

Kan Liang (22):
   perf/x86/intel: Hybrid PMU support for perf capabilities
   perf/x86: Hybrid PMU support for intel_ctrl
   perf/x86: Hybrid PMU support for counters
   perf/x86: Hybrid PMU support for unconstrained
   perf/x86: Hybrid PMU support for hardware cache event
   perf/x86: Hybrid PMU support for event constraints
   perf/x86: Hybrid PMU support for extra_regs
   perf/x86/intel: Factor out intel_pmu_check_num_counters
   perf/x86/intel: Factor out intel_pmu_check_event_constraints
   perf/x86/intel: Factor out intel_pmu_check_extra_regs
   perf/x86: Expose check_hw_exists
   perf/x86: Remove temporary pmu assignment in event_init
   perf/x86: Factor out x86_pmu_show_pmu_cap
   perf/x86: Register hybrid PMUs
   perf/x86: Add structures for the attributes of Hybrid PMUs
   perf/x86/intel: Add attr_update for Hybrid PMUs
   perf/x86: Support filter_match callback
   perf/x86/intel: Add Alder Lake Hybrid support
   perf: Introduce PERF_TYPE_HARDWARE_PMU and PERF_TYPE_HW_CACHE_PMU
   perf/x86/intel/uncore: Add Alder Lake support
   perf/x86/msr: Add Alder Lake CPU support
   perf/x86/cstate: Add Alder Lake CPU support

Ricardo Neri (2):
   x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
   x86/cpu: Describe hybrid CPUs in cpuinfo_x86

Zhang Rui (1):
   perf/x86/rapl: Add support for Intel Alder Lake


hi,
would you have git branch with all this somewhere?



Here is the git branch

https://github.com/kliang2/perf.git adl_enabling

Please note that the branch is on top of Peter's perf/core branch, which 
doesn't include the latest perf tool changes. The perf tool patches in 
the branch only includes the critical changes. There will be more tool 
patches later, e.g., patches for perf mem, perf test etc.


Thanks,
Kan


Re: [PATCH 23/49] perf/x86/msr: Add Alder Lake CPU support

2021-02-09 Thread Liang, Kan




On 2/8/2021 10:58 PM, kernel test robot wrote:

Hi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[cannot apply to tip/master linus/master tip/x86/core v5.11-rc6 next-20210125]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/kan-liang-linux-intel-com/Add-Alder-Lake-support-for-perf/20210209-070642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
32451614da2a9cf4296f90d3606ac77814fb519d
config: x86_64-randconfig-s021-20210209 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
 # apt-get install sparse
 # sparse version: v0.6.3-215-g0fb77bb6-dirty
 # 
https://github.com/0day-ci/linux/commit/ef3d3e5028f5f70a78fa37d642e8e7e65c60dee7
 git remote add linux-review https://github.com/0day-ci/linux
 git fetch --no-tags linux-review 
kan-liang-linux-intel-com/Add-Alder-Lake-support-for-perf/20210209-070642
 git checkout ef3d3e5028f5f70a78fa37d642e8e7e65c60dee7
 # save the attached .config to linux build tree
 make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

arch/x86/events/msr.c: In function 'test_intel':

arch/x86/events/msr.c:104:7: error: 'INTEL_FAM6_ALDERLAKE_L' undeclared (first 
use in this function); did you mean 'INTEL_FAM6_ALDERLAKE'?

  104 |  case INTEL_FAM6_ALDERLAKE_L:
  |   ^~
  |   INTEL_FAM6_ALDERLAKE
arch/x86/events/msr.c:104:7: note: each undeclared identifier is reported 
only once for each function it appears in



The patchset is on top of PeterZ's perf/core branch plus
commit id 6e1239c13953 ("x86/cpu: Add another Alder Lake CPU to the 
Intel family")


The above patch is also missed in the tip/perf/core branch. All the 
issues should be gone once the tip/perf/core sync with the tip/x86/urgent.



Thanks,
Kan




Re: [PATCH 02/49] x86/cpu: Describe hybrid CPUs in cpuinfo_x86

2021-02-08 Thread Liang, Kan




On 2/8/2021 12:56 PM, Borislav Petkov wrote:

On Mon, Feb 08, 2021 at 07:24:59AM -0800, kan.li...@linux.intel.com wrote:

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c20a52b..1f25ac9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -139,6 +139,16 @@ struct cpuinfo_x86 {
u32 microcode;
/* Address space bits used by the cache internally */
u8  x86_cache_bits;
+   /*
+* In hybrid processors, there is a CPU type and a native model ID. The
+* CPU type (x86_cpu_type[31:24]) describes the type of micro-
+* architecture families. The native model ID (x86_cpu_type[23:0])
+* describes a specific microarchitecture version. Combining both
+* allows to uniquely identify a CPU.
+*
+* Please note that the native model ID is not related to x86_model.
+*/
+   u32 x86_cpu_type;


Why are you adding it here instead of simply using
X86_FEATURE_HYBRID_CPU at the call site?

How many uses in this patchset?

/me searches...

Exactly one.

Just query X86_FEATURE_HYBRID_CPU at the call site and read what you
need from CPUID and use it there - no need for this.



I think it's good enough for perf, but I'm not sure whether other codes 
need the CPU type information.


Ricardo, do you know?

Maybe we should implement a generic function as below for this?
(Not test. Just use as an example.)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a66c1fd..679f5fe 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2056,3 +2056,11 @@ void arch_smt_update(void)
/* Check whether IPI broadcasting can be enabled */
apic_smt_update();
 }
+
+u32 x86_read_hybrid_type(void)
+{
+   if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU))
+   return cpuid_eax(0x001a);
+
+   return 0;
+}


Thanks,
Kan


Re: [PATCH 6/9] perf report: Support instruction latency

2021-02-08 Thread Liang, Kan




On 2/6/2021 3:09 AM, Namhyung Kim wrote:

On Fri, Feb 5, 2021 at 11:38 PM Liang, Kan  wrote:


On 2/5/2021 6:08 AM, Namhyung Kim wrote:

On Wed, Feb 3, 2021 at 5:14 AM  wrote:


From: Kan Liang 

The instruction latency information can be recorded on some platforms,
e.g., the Intel Sapphire Rapids server. With both memory latency
(weight) and the new instruction latency information, users can easily
locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications
in different pipeline stages.

The 'weight' field is shared among different architectures. Reusing the
'weight' field may impacts other architectures. Add a new field to store
the instruction latency.

Like the 'weight' support, introduce a 'ins_lat' for the global
instruction latency, and a 'local_ins_lat' for the local instruction
latency version.


Could you please clarify the difference between the global latency
and the local latency?



The global means the total latency.
The local means average latency, aka total / number of samples.


Thanks for the explanation, but I think it's confusing.
Why not call it just total_latency and avg_latency?



The instruction latency field is an extension of the weight field, so I 
follow the same way to name the field. I still think we should make the 
naming consistency.


To address the confusion, I think we may update the document for both 
the weight and the instruction latency fields.


How about the below patch?

From d5e80f541cb7288b24a7c5661ae5faede4747807 Mon Sep 17 00:00:00 2001
From: Kan Liang 
Date: Mon, 8 Feb 2021 05:27:03 -0800
Subject: [PATCH] perf documentation: Add comments to the local/global 
weight related fields


Current 'local' and 'global' prefix is confusing for the weight related
fields, e.g., weight, instruction latency.

Add comments to clarify.
'global' means total weight/instruction latency sum.
'local' means average weight/instruction latency per sample

Signed-off-by: Kan Liang 
---
 tools/perf/Documentation/perf-report.txt | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt

index f546b5e..acc1c1d 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -92,8 +92,9 @@ OPTIONS
- srcfile: file name of the source file of the samples. Requires dwarf
information.
- weight: Event specific weight, e.g. memory latency or transaction
-   abort cost. This is the global weight.
-   - local_weight: Local weight version of the weight above.
+   abort cost. This is the global weight (total weight sum).
+   - local_weight: Local weight (average weight per sample) version of the
+ weight above.
- cgroup_id: ID derived from cgroup namespace device and inode numbers.
- cgroup: cgroup pathname in the cgroupfs.
- transaction: Transaction abort flags.
@@ -110,8 +111,9 @@ OPTIONS
--time-quantum (default 100ms). Specify with overhead and before it.
- code_page_size: the code page size of sampled code address (ip)
 	- ins_lat: Instruction latency in core cycles. This is the global 
instruction

- latency
-   - local_ins_lat: Local instruction latency version
+ latency (total instruction latency sum)
+	- local_ins_lat: Local instruction latency (average instruction 
latency per

+ sample) version

By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
--
2.7.4


Thanks,
Kan




Re: [PATCH 2/9] perf tools: Support the auxiliary event

2021-02-05 Thread Liang, Kan




On 2/5/2021 10:26 AM, Arnaldo Carvalho de Melo wrote:

Em Fri, Feb 05, 2021 at 09:13:34AM -0500, Liang, Kan escreveu:



On 2/5/2021 5:52 AM, Namhyung Kim wrote:

On Wed, Feb 3, 2021 at 5:14 AM  wrote:


From: Kan Liang 

On the Intel Sapphire Rapids server, an auxiliary event has to be
enabled simultaneously with the load latency event to retrieve complete
Memory Info.

Add X86 specific perf_mem_events__name() to handle the auxiliary event.
- Users are only interested in the samples of the mem-loads event.
Sample read the auxiliary event.
- The auxiliary event must be in front of the load latency event in a
group. Assume the second event to sample if the auxiliary event is the
leader.
- Add a weak is_mem_loads_aux_event() to check the auxiliary event for
X86. For other ARCHs, it always return false.

Parse the unique event name, mem-loads-aux, for the auxiliary event.

Signed-off-by: Kan Liang 
---
   tools/perf/arch/x86/util/Build|  1 +
   tools/perf/arch/x86/util/mem-events.c | 44 
+++
   tools/perf/util/evsel.c   |  3 +++
   tools/perf/util/mem-events.c  |  5 
   tools/perf/util/mem-events.h  |  2 ++
   tools/perf/util/parse-events.l|  1 +
   tools/perf/util/record.c  |  5 +++-
   7 files changed, 60 insertions(+), 1 deletion(-)
   create mode 100644 tools/perf/arch/x86/util/mem-events.c

diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 347c39b..d73f548 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -6,6 +6,7 @@ perf-y += perf_regs.o
   perf-y += topdown.o
   perf-y += machine.o
   perf-y += event.o
+perf-y += mem-events.o

   perf-$(CONFIG_DWARF) += dwarf-regs.o
   perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/mem-events.c 
b/tools/perf/arch/x86/util/mem-events.c
new file mode 100644
index 000..11b8469
--- /dev/null
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/pmu.h"
+#include "map_symbol.h"
+#include "mem-events.h"
+
+static char mem_loads_name[100];
+static bool mem_loads_name__init;
+
+#define MEM_LOADS_AUX  0x8203
+#define MEM_LOADS_AUX_NAME 
"{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
+
+bool is_mem_loads_aux_event(struct evsel *leader)
+{
+   if (!pmu_have_event("cpu", "mem-loads-aux"))
+   return false;
+
+   return leader->core.attr.config == MEM_LOADS_AUX;
+}
+
+char *perf_mem_events__name(int i)
+{
+   struct perf_mem_event *e = perf_mem_events__ptr(i);
+
+   if (!e)
+   return NULL;
+
+   if (i == PERF_MEM_EVENTS__LOAD) {
+   if (mem_loads_name__init)
+   return mem_loads_name;
+
+   mem_loads_name__init = true;
+
+   if (pmu_have_event("cpu", "mem-loads-aux")) {
+   scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+ MEM_LOADS_AUX_NAME, 
perf_mem_events__loads_ldlat);


It changes "%u" to an actual latency value, right?
What if the value takes 3 or more digits?
I'm not sure scnprintf() will handle it properly.



Yes, you are right. We should use the sizeof(mem_loads_name) as below.
I will submit a patch to fix it.

diff --git a/tools/perf/arch/x86/util/mem-events.c
b/tools/perf/arch/x86/util/mem-events.c
index 11b8469..588110f 100644
--- a/tools/perf/arch/x86/util/mem-events.c
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -31,7 +31,7 @@ char *perf_mem_events__name(int i)
mem_loads_name__init = true;

if (pmu_have_event("cpu", "mem-loads-aux")) {
-   scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+   scnprintf(mem_loads_name, sizeof(mem_loads_name),
  MEM_LOADS_AUX_NAME, 
perf_mem_events__loads_ldlat);
} else {
scnprintf(mem_loads_name, sizeof(mem_loads_name),


I'll fold this in the relevant cset.



Thanks!

Kan


Re: [PATCH 2/9] perf tools: Support the auxiliary event

2021-02-05 Thread Liang, Kan




On 2/5/2021 5:52 AM, Namhyung Kim wrote:

On Wed, Feb 3, 2021 at 5:14 AM  wrote:


From: Kan Liang 

On the Intel Sapphire Rapids server, an auxiliary event has to be
enabled simultaneously with the load latency event to retrieve complete
Memory Info.

Add X86 specific perf_mem_events__name() to handle the auxiliary event.
- Users are only interested in the samples of the mem-loads event.
   Sample read the auxiliary event.
- The auxiliary event must be in front of the load latency event in a
   group. Assume the second event to sample if the auxiliary event is the
   leader.
- Add a weak is_mem_loads_aux_event() to check the auxiliary event for
   X86. For other ARCHs, it always return false.

Parse the unique event name, mem-loads-aux, for the auxiliary event.

Signed-off-by: Kan Liang 
---
  tools/perf/arch/x86/util/Build|  1 +
  tools/perf/arch/x86/util/mem-events.c | 44 +++
  tools/perf/util/evsel.c   |  3 +++
  tools/perf/util/mem-events.c  |  5 
  tools/perf/util/mem-events.h  |  2 ++
  tools/perf/util/parse-events.l|  1 +
  tools/perf/util/record.c  |  5 +++-
  7 files changed, 60 insertions(+), 1 deletion(-)
  create mode 100644 tools/perf/arch/x86/util/mem-events.c

diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 347c39b..d73f548 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -6,6 +6,7 @@ perf-y += perf_regs.o
  perf-y += topdown.o
  perf-y += machine.o
  perf-y += event.o
+perf-y += mem-events.o

  perf-$(CONFIG_DWARF) += dwarf-regs.o
  perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/mem-events.c 
b/tools/perf/arch/x86/util/mem-events.c
new file mode 100644
index 000..11b8469
--- /dev/null
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/pmu.h"
+#include "map_symbol.h"
+#include "mem-events.h"
+
+static char mem_loads_name[100];
+static bool mem_loads_name__init;
+
+#define MEM_LOADS_AUX  0x8203
+#define MEM_LOADS_AUX_NAME 
"{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
+
+bool is_mem_loads_aux_event(struct evsel *leader)
+{
+   if (!pmu_have_event("cpu", "mem-loads-aux"))
+   return false;
+
+   return leader->core.attr.config == MEM_LOADS_AUX;
+}
+
+char *perf_mem_events__name(int i)
+{
+   struct perf_mem_event *e = perf_mem_events__ptr(i);
+
+   if (!e)
+   return NULL;
+
+   if (i == PERF_MEM_EVENTS__LOAD) {
+   if (mem_loads_name__init)
+   return mem_loads_name;
+
+   mem_loads_name__init = true;
+
+   if (pmu_have_event("cpu", "mem-loads-aux")) {
+   scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+ MEM_LOADS_AUX_NAME, 
perf_mem_events__loads_ldlat);


It changes "%u" to an actual latency value, right?
What if the value takes 3 or more digits?
I'm not sure scnprintf() will handle it properly.



Yes, you are right. We should use the sizeof(mem_loads_name) as below.
I will submit a patch to fix it.

diff --git a/tools/perf/arch/x86/util/mem-events.c 
b/tools/perf/arch/x86/util/mem-events.c

index 11b8469..588110f 100644
--- a/tools/perf/arch/x86/util/mem-events.c
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -31,7 +31,7 @@ char *perf_mem_events__name(int i)
mem_loads_name__init = true;

if (pmu_have_event("cpu", "mem-loads-aux")) {
-   scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+   scnprintf(mem_loads_name, sizeof(mem_loads_name),
  MEM_LOADS_AUX_NAME, 
perf_mem_events__loads_ldlat);
} else {
scnprintf(mem_loads_name, sizeof(mem_loads_name),




Thanks,
Kan



Re: [PATCH 6/9] perf report: Support instruction latency

2021-02-05 Thread Liang, Kan




On 2/5/2021 7:55 AM, Athira Rajeev wrote:

Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to 
capture something else than the Local INSTR Latency.
Can we have some weak function to populate the header string ?

I agree that the var2_w has different meanings among architectures. We should not 
force it to data->ins_lat.

The patch as below should fix it. Does it work for you?

My point about weak function was actually for the arch specific header string. But 
I guess we should not force it to data->ins_lat


Yes, I don't think PowerPC should force var2_w to data->ins_lat. I think 
you can create your own field.



as you mentioned. I checked the below patch defining an 
‘arch_perf_parse_sample_weight' for powerpc and it works.

But one observation is that, for cases with kernel having support for 
PERF_SAMPLE_WEIGHT_STRUCT but missing arch specific support for  
‘arch_perf_parse_sample_weight', it will report ‘Local Weight’ wrongly since 
weak function takes it as 64 bit. Not sure if that is a valid case to consider 
though.


Currently, the PERF_SAMPLE_WEIGHT_STRUCT is only enabled on X86 by default.
https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.li...@linux.intel.com/

For PowerPC, the PERF_SAMPLE_WEIGHT is still the default setting. There 
is no way to set PERF_SAMPLE_WEIGHT_STRUCT via perf tool.

I don't think the above case will happen.

Thanks,
Kan


Re: [PATCH 3/9] perf tools: Support data block and addr block

2021-02-05 Thread Liang, Kan




On 2/5/2021 6:02 AM, Namhyung Kim wrote:

On Wed, Feb 3, 2021 at 5:14 AM  wrote:


From: Kan Liang 

Two new data source fields, to indicate the block reasons of a load
instruction, are introduced on the Intel Sapphire Rapids server. The
fields can be used by the memory profiling.

Add a new sort function, SORT_MEM_BLOCKED, for the two fields.

For the previous platforms or the block reason is unknown, print "N/A"
for the block reason.

Add blocked as a default mem sort key for perf report and
perf mem report.

Signed-off-by: Kan Liang 
---

[SNIP]

+int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
+{
+   size_t l = 0;
+   u64 mask = PERF_MEM_BLK_NA;
+
+   sz -= 1; /* -1 for null termination */
+   out[0] = '\0';
+
+   if (mem_info)
+   mask = mem_info->data_src.mem_blk;
+
+   if (!mask || (mask & PERF_MEM_BLK_NA)) {
+   l += scnprintf(out + l, sz - l, " N/A");
+   return l;
+   }
+   if (mask & PERF_MEM_BLK_DATA)
+   l += scnprintf(out + l, sz - l, " Data");
+   if (mask & PERF_MEM_BLK_ADDR)
+   l += scnprintf(out + l, sz - l, " Addr");


So this means it's possible to have BLK_DATA and BLK_ADDR
together and in that case it'll print "DataAddr", right?


Yes, it's possible. If so, it will print "Data Addr".

Thanks,
Kan




+
+   return l;
+}
+
  int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info 
*mem_info)
  {
 int i = 0;
@@ -348,6 +371,8 @@ int perf_script__meminfo_scnprintf(char *out, size_t sz, 
struct mem_info *mem_in
 i += perf_mem__tlb_scnprintf(out + i, sz - i, mem_info);
 i += scnprintf(out + i, sz - i, "|LCK ");
 i += perf_mem__lck_scnprintf(out + i, sz - i, mem_info);
+   i += scnprintf(out + i, sz - i, "|BLK ");
+   i += perf_mem__blk_scnprintf(out + i, sz - i, mem_info);

 return i;
  }


Re: [PATCH 6/9] perf report: Support instruction latency

2021-02-05 Thread Liang, Kan




On 2/5/2021 6:08 AM, Namhyung Kim wrote:

On Wed, Feb 3, 2021 at 5:14 AM  wrote:


From: Kan Liang 

The instruction latency information can be recorded on some platforms,
e.g., the Intel Sapphire Rapids server. With both memory latency
(weight) and the new instruction latency information, users can easily
locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications
in different pipeline stages.

The 'weight' field is shared among different architectures. Reusing the
'weight' field may impacts other architectures. Add a new field to store
the instruction latency.

Like the 'weight' support, introduce a 'ins_lat' for the global
instruction latency, and a 'local_ins_lat' for the local instruction
latency version.


Could you please clarify the difference between the global latency
and the local latency?



The global means the total latency.
The local means average latency, aka total / number of samples.

Thanks,
Kan


Re: [PATCH V3 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT

2021-02-05 Thread Liang, Kan




On 2/5/2021 5:43 AM, Namhyung Kim wrote:

On Fri, Feb 5, 2021 at 12:24 AM Liang, Kan  wrote:


On 2/4/2021 9:00 AM, Namhyung Kim wrote:

Hi Kan,

On Sat, Jan 30, 2021 at 2:25 AM Liang, Kan  wrote:
[SNIP]

diff --git a/include/uapi/linux/perf_event.h
b/include/uapi/linux/perf_event.h
index b15e344..c50718a 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -145,12 +145,14 @@ enum perf_event_sample_format {
  PERF_SAMPLE_CGROUP  = 1U << 21,
  PERF_SAMPLE_DATA_PAGE_SIZE  = 1U << 22,
  PERF_SAMPLE_CODE_PAGE_SIZE  = 1U << 23,
+   PERF_SAMPLE_WEIGHT_STRUCT   = 1U << 24,

-   PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */

  __PERF_SAMPLE_CALLCHAIN_EARLY   = 1ULL << 63, /* non-ABI; 
internal use */
};

+#define PERF_SAMPLE_WEIGHT_TYPE(PERF_SAMPLE_WEIGHT |
PERF_SAMPLE_WEIGHT_STRUCT)


I'm not sure you want to expose it in the uapi header as it's not
intended to be used IMHO.



I'm not sure I understood, but it's indeed used in the tool patch set.

https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.li...@linux.intel.com/


Well, it's not a big deal.. but I just worried if some users might do

   event.attr.sample_type = PERF_SAMPLE_WEIGHT_TYPE;



Yes, it's possible. For this case, Kernel returns -EINVAL.

Actually, even without the macro, users may still set PERF_SAMPLE_WEIGHT 
| PERF_SAMPLE_WEIGHT_STRUCT manually. We cannot control what a user 
sets. I don't think it's a problem as long as the kernel can handle it 
properly.


Thanks,
Kan


Re: [PATCH V3 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT

2021-02-04 Thread Liang, Kan




On 2/4/2021 9:00 AM, Namhyung Kim wrote:

Hi Kan,

On Sat, Jan 30, 2021 at 2:25 AM Liang, Kan  wrote:
[SNIP]

diff --git a/include/uapi/linux/perf_event.h
b/include/uapi/linux/perf_event.h
index b15e344..c50718a 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -145,12 +145,14 @@ enum perf_event_sample_format {
 PERF_SAMPLE_CGROUP  = 1U << 21,
 PERF_SAMPLE_DATA_PAGE_SIZE  = 1U << 22,
 PERF_SAMPLE_CODE_PAGE_SIZE  = 1U << 23,
+   PERF_SAMPLE_WEIGHT_STRUCT   = 1U << 24,

-   PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */

 __PERF_SAMPLE_CALLCHAIN_EARLY   = 1ULL << 63, /* non-ABI; 
internal use */
   };

+#define PERF_SAMPLE_WEIGHT_TYPE(PERF_SAMPLE_WEIGHT |
PERF_SAMPLE_WEIGHT_STRUCT)


I'm not sure you want to expose it in the uapi header as it's not
intended to be used IMHO.



I'm not sure I understood, but it's indeed used in the tool patch set.

https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.li...@linux.intel.com/


Thanks,
Kan


Re: [PATCH 6/9] perf report: Support instruction latency

2021-02-04 Thread Liang, Kan




On 2/4/2021 8:11 AM, Athira Rajeev wrote:




On 03-Feb-2021, at 1:39 AM, kan.li...@linux.intel.com wrote:

From: Kan Liang 

The instruction latency information can be recorded on some platforms,
e.g., the Intel Sapphire Rapids server. With both memory latency
(weight) and the new instruction latency information, users can easily
locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications
in different pipeline stages.

The 'weight' field is shared among different architectures. Reusing the
'weight' field may impacts other architectures. Add a new field to store
the instruction latency.

Like the 'weight' support, introduce a 'ins_lat' for the global
instruction latency, and a 'local_ins_lat' for the local instruction
latency version.

Add new sort functions, INSTR Latency and Local INSTR Latency,
accordingly.

Add local_ins_lat to the default_mem_sort_order[].

Signed-off-by: Kan Liang 
---
tools/perf/Documentation/perf-report.txt |  6 +++-
tools/perf/util/event.h  |  1 +
tools/perf/util/evsel.c  |  4 ++-
tools/perf/util/hist.c   | 12 ++--
tools/perf/util/hist.h   |  2 ++
tools/perf/util/intel-pt.c   |  5 ++--
tools/perf/util/session.c|  8 --
tools/perf/util/sort.c   | 47 +++-
tools/perf/util/sort.h   |  3 ++
tools/perf/util/synthetic-events.c   |  4 ++-
10 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index 826b5a9..0565b7c 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -108,6 +108,9 @@ OPTIONS
- period: Raw number of event count of sample
- time: Separate the samples by time stamp with the resolution 
specified by
--time-quantum (default 100ms). Specify with overhead and before it.
+   - ins_lat: Instruction latency in core cycles. This is the global
+   instruction latency
+   - local_ins_lat: Local instruction latency version

By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
@@ -154,7 +157,8 @@ OPTIONS
- blocked: reason of blocked load access for the data at the time of 
the sample

And the default sort keys are changed to local_weight, mem, sym, dso,
-   symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
+   symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
+   see '--mem-mode'.

If the data file has tracepoint event(s), following (dynamic) sort keys
are also available:
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ff403ea..5d50a49 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -141,6 +141,7 @@ struct perf_sample {
u16 insn_len;
u8  cpumode;
u16 misc;
+   u16 ins_lat;
bool no_hw_idx; /* No hw_idx collected in branch_stack */
char insn[MAX_INSN];
void *raw_data;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 0a2a307..24c0b59 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union 
perf_event *event,
weight.full = *array;
if (type & PERF_SAMPLE_WEIGHT)
data->weight = weight.full;
-   else
+   else {
data->weight = weight.var1_dw;
+   data->ins_lat = weight.var2_w;
+   }
array++;
}

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6866ab0..8a432f8 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct 
hist_entry *h)
hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
+   hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
+   hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
if (symbol_conf.nanosecs)
hists__new_col_len(hists, HISTC_TIME, 16);
else
@@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
}

static void he_stat__add_period(struct he_stat *he_stat, u64 period,
-   u64 weight)
+   u64 weight, u64 ins_lat)
{

he_stat->period  += period;
he_stat->weight  += weight;
he_stat->nr_events   += 1;
+   he_stat->ins_lat += ins_lat;
}

static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
@@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct 
he_stat 

Re: [PATCH 2/9] perf tools: Support the auxiliary event

2021-02-03 Thread Liang, Kan




On 2/3/2021 3:02 PM, Arnaldo Carvalho de Melo wrote:

Em Tue, Feb 02, 2021 at 12:09:06PM -0800,kan.li...@linux.intel.com  escreveu:

From: Kan Liang
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c26ea822..c48f6de 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2689,6 +2689,9 @@ int evsel__open_strerror(struct evsel *evsel, struct 
target *target,
if (perf_missing_features.aux_output)
return scnprintf(msg, size, "The 'aux_output' feature is not 
supported, update the kernel.");
break;
+   case ENODATA:
+   return scnprintf(msg, size, "Cannot collect data source with the 
load latency event alone. "
+"Please add an auxiliary event in front of the load 
latency event.");

Are you sure this is the only case where ENODATA comes out from
perf_event_open()? Well, according to your comment in:

   61b985e3e775a3a7 ("perf/x86/intel: Add perf core PMU support for Sapphire 
Rapids")

It should be at that point in time, so its safe to merge as-is, but then
I think this is fragile, what if someone else, in the future, not
knowing that ENODATA is supposed to be used only with that ancient CPU,
Sapphire Rapids, uses it?:-)

Please consider adding a check before assuming ENODATA is for this
specific case.


Sure, I will add a check in V2.

Thanks,
Kan


Re: [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT

2021-02-03 Thread Liang, Kan




On 2/3/2021 3:31 PM, Arnaldo Carvalho de Melo wrote:

--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
bit_name(BRANCH_STACK), bit_name(REGS_USER), 
bit_name(STACK_USER),
bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
-   bit_name(CGROUP), bit_name(DATA_PAGE_SIZE),
+   bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), 
bit_name(WEIGHT_STRUCT),

I have CODE_PAGE_SIZE in my perf/core branch, was this somehow removed?

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core=c1de7f3d84ca324c7cda85c3ce27b11741af2124

I see, you did this patchkit on top of upstream, that has just
DATA_PAGE_SIZE, while my perf/core branch has CODE_PAGE_SIZE. I'm
adjusting it, please next time do tooling development on acme/perf/core.


Sorry, I will rebase the patchset on acme/perf/core.

Thanks,
Kan


Re: [PATCH V3 0/5] perf core PMU support for Sapphire Rapids (Kernel)

2021-02-01 Thread Liang, Kan




On 2/1/2021 9:30 AM, Peter Zijlstra wrote:


I made the below changes, does that work?


Yes, it works well.

Thanks,
Kan



---

--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3627,6 +3627,15 @@ static int core_pmu_hw_config(struct per
return intel_pmu_bts_config(event);
  }
  
+#define INTEL_TD_METRIC_AVAILABLE_MAX	(INTEL_TD_METRIC_RETIRING + \

+((x86_pmu.num_topdown_events - 1) << 
8))
+
+static bool is_available_metric_event(struct perf_event *event)
+{
+   return is_metric_event(event) &&
+   event->attr.config <= INTEL_TD_METRIC_AVAILABLE_MAX;
+}
+
  static inline bool is_mem_loads_event(struct perf_event *event)
  {
return (event->attr.config & INTEL_ARCH_EVENT_MASK) == 
X86_CONFIG(.event=0xcd, .umask=0x01);
@@ -3711,7 +3720,7 @@ static int intel_pmu_hw_config(struct pe
if (event->attr.config & X86_ALL_EVENT_FLAGS)
return -EINVAL;
  
-		if (is_metric_event(event)) {

+   if (is_available_metric_event(event)) {
struct perf_event *leader = event->group_leader;
  
  			/* The metric events don't support sampling. */

--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -87,7 +87,14 @@ static inline bool is_topdown_count(stru
return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
  }
  
-static inline bool is_metric_event(struct perf_event *event);

+static inline bool is_metric_event(struct perf_event *event)
+{
+   u64 config = event->attr.config;
+
+   return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+   ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING)  
&&
+   ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX);
+}
  
  static inline bool is_slots_event(struct perf_event *event)

  {
@@ -901,18 +908,6 @@ static struct perf_pmu_events_ht_attr ev
  struct pmu *x86_get_pmu(void);
  extern struct x86_pmu x86_pmu __read_mostly;
  
-#define INTEL_TD_METRIC_AVAILABLE_MAX	(INTEL_TD_METRIC_RETIRING + \

-((x86_pmu.num_topdown_events - 1) << 
8))
-
-static inline bool is_metric_event(struct perf_event *event)
-{
-   u64 config = event->attr.config;
-
-   return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
-   ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING)  
&&
-   ((config & INTEL_ARCH_EVENT_MASK) <= 
INTEL_TD_METRIC_AVAILABLE_MAX);
-}
-
  static __always_inline struct x86_perf_task_context_opt 
*task_context_opt(void *ctx)
  {
if (static_cpu_has(X86_FEATURE_ARCH_LBR))
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -284,10 +284,12 @@ struct x86_pmu_capability {
  #define INTEL_TD_METRIC_BAD_SPEC  0x8100  /* Bad speculation 
metric */
  #define INTEL_TD_METRIC_FE_BOUND  0x8200  /* FE bound metric */
  #define INTEL_TD_METRIC_BE_BOUND  0x8300  /* BE bound metric */
-#define INTEL_TD_METRIC_HEAVY_OPS  0x8400  /* Heavy Operations 
metric */
-#define INTEL_TD_METRIC_BR_MISPREDICT  0x8500  /* Branch Mispredict 
metric */
-#define INTEL_TD_METRIC_FETCH_LAT  0x8600  /* Fetch Latency metric 
*/
-#define INTEL_TD_METRIC_MEM_BOUND  0x8700  /* Memory bound metric 
*/
+/* Level 2 metrics */
+#define INTEL_TD_METRIC_HEAVY_OPS  0x8400  /* Heavy Operations 
metric */
+#define INTEL_TD_METRIC_BR_MISPREDICT  0x8500  /* Branch Mispredict 
metric */
+#define INTEL_TD_METRIC_FETCH_LAT  0x8600  /* Fetch Latency metric 
*/
+#define INTEL_TD_METRIC_MEM_BOUND  0x8700  /* Memory bound metric 
*/
+
  #define INTEL_TD_METRIC_MAX   INTEL_TD_METRIC_MEM_BOUND
  #define INTEL_TD_METRIC_NUM   8
  
--- a/include/uapi/linux/perf_event.h

+++ b/include/uapi/linux/perf_event.h
@@ -908,7 +908,6 @@ enum perf_event_type {
 *  u32 var1_dw;
 *  } && PERF_SAMPLE_WEIGHT_STRUCT
 *  #endif
-*
 *   }
 *  }
 *  { u64   data_src; } && PERF_SAMPLE_DATA_SRC
@@ -1276,29 +1275,23 @@ struct perf_branch_entry {
reserved:40;
  };
  
-#if defined(__LITTLE_ENDIAN_BITFIELD)

  union perf_sample_weight {
__u64   full;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
struct {
__u32   var1_dw;
__u16   var2_w;
__u16   var3_w;
};
-};
-
  #elif defined(__BIG_ENDIAN_BITFIELD)
-
-union perf_sample_weight {
-   __u64   full;
struct {
__u16   var3_w;
__u16   var2_w;
__u32   var1_dw;
};
-};
-
  #else
  #error "Unknown endianness"
  #endif
+};
  
  #endif /* _UAPI_LINUX_PERF_EVENT_H */




Re: [PATCH V3 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT

2021-01-29 Thread Liang, Kan




On 1/28/2021 5:40 PM, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
cost of an action represented by the sample. This allows the profiler
to scale the samples to be more informative to the programmer. It could
also help to locate a hotspot, e.g., when profiling by memory latencies,
the expensive load appear higher up in the histograms. But current
PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
could be a problem, if users want two or more factors to contribute to
the weight. For example, Golden Cove core PMU can provide both the
instruction latency and the cache Latency information as factors for the
memory profiling.

For current X86 platforms, although meminfo::latency is defined as a
u64, only the lower 32 bits include the valid data in practice (No
memory access could last than 4G cycles). The higher 32 bits can be used
to store new factors.

Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
sample weight structure. It shares the same space as the
PERF_SAMPLE_WEIGHT sample type.

Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
they cannot apply both sample types simultaneously.

Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
- For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
   sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
   sample type. PowerPC can re-struct the weight field similarly later.
- For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
   sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
   The following patches will apply the new factors for the
   PERF_SAMPLE_WEIGHT_STRUCT sample type.

The field in the union perf_sample_weight should be shared among
different architectures. A generic name is required, but it's hard to
abstract a name that applies to all architectures. For example, on X86,
the fields are to store all kinds of latency. While on PowerPC, it
stores MMCRA[TECX/TECM], which should not be latency. So a general name
prefix 'var$NUM' is used here.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Kan Liang 
---
  arch/powerpc/perf/core-book3s.c |  2 +-
  arch/x86/events/intel/ds.c  | 17 +++---
  include/linux/perf_event.h  |  4 ++--
  include/uapi/linux/perf_event.h | 49 +++--
  kernel/events/core.c| 11 +
  5 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 28206b1..869d999 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2195,7 +2195,7 @@ static void record_and_restart(struct perf_event *event, 
unsigned long val,
  
  		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&

ppmu->get_mem_weight)
-   ppmu->get_mem_weight();
+   ppmu->get_mem_weight();
  
  		if (perf_event_overflow(event, , regs))

power_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 67dbc91..2f54b1f 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -960,7 +960,8 @@ static void adaptive_pebs_record_size_update(void)
  }
  
  #define PERF_PEBS_MEMINFO_TYPE	(PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |   \

-   PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT | \
+   PERF_SAMPLE_PHYS_ADDR |  \
+   PERF_SAMPLE_WEIGHT_TYPE |\
PERF_SAMPLE_TRANSACTION |\
PERF_SAMPLE_DATA_PAGE_SIZE)
  
@@ -987,7 +988,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)

gprs = (sample_type & PERF_SAMPLE_REGS_INTR) &&
   (attr->sample_regs_intr & PEBS_GP_REGS);
  
-	tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT) &&

+   tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT_TYPE) &&
 ((attr->config & INTEL_ARCH_EVENT_MASK) ==
  x86_pmu.rtm_abort_event);
  
@@ -1369,8 +1370,8 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,

/*
 * Use latency for weight (only avail with PEBS-LL)
 */
-   if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
-   data->weight = pebs->lat;
+   if (fll && (sample_type & PERF_SAMPLE_WEIGHT_TYPE))
+   data->weight.full = pebs->lat;
  
  	/*

 * data.data_src encodes the data source
@@ -1462,8 +1463,8 @@ static void setup_pebs_fixed_sample_data(struct 
perf_event *event,
  
  	if (x86_pmu.intel_cap.pebs_format >= 2) {

/* Only set the TSX weight when no memory weight. */
-  

Re: [PATCH V2 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT

2021-01-27 Thread Liang, Kan




On 1/27/2021 2:03 PM, Peter Zijlstra wrote:

On Wed, Jan 27, 2021 at 07:38:41AM -0800, kan.li...@linux.intel.com wrote:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b15e344..13b4019 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -145,12 +145,14 @@ enum perf_event_sample_format {
PERF_SAMPLE_CGROUP  = 1U << 21,
PERF_SAMPLE_DATA_PAGE_SIZE  = 1U << 22,
PERF_SAMPLE_CODE_PAGE_SIZE  = 1U << 23,
+   PERF_SAMPLE_WEIGHT_STRUCT   = 1U << 24,
  
-	PERF_SAMPLE_MAX = 1U << 24,		/* non-ABI */

+   PERF_SAMPLE_MAX = 1U << 25,   /* non-ABI */
  
  	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */

  };
  
+#define PERF_SAMPLE_WEIGHT_TYPE	(PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT)

  /*
   * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
   *
@@ -890,7 +892,16 @@ enum perf_event_type {
 *char  data[size];
 *u64   dyn_size; } && PERF_SAMPLE_STACK_USER
 *
-*  { u64   weight;   } && PERF_SAMPLE_WEIGHT
+*  { union perf_sample_weight
+*   {
+*  u64 full; && PERF_SAMPLE_WEIGHT
+*  struct {
+*  u32 low_dword;
+*  u16 high_word;
+*  u16 higher_word;
+*  } && PERF_SAMPLE_WEIGHT_STRUCT
+*   }
+*  }
 *  { u64   data_src; } && PERF_SAMPLE_DATA_SRC
 *  { u64   transaction; } && 
PERF_SAMPLE_TRANSACTION
 *  { u64   abi; # enum perf_sample_regs_abi
@@ -1248,4 +1259,13 @@ struct perf_branch_entry {
reserved:40;
  };
  
+union perf_sample_weight {

+   __u64   full;
+   struct {
+   __u32   low_dword;
+   __u16   high_word;
+   __u16   higher_word;
+   };
+};


*urgh*, my naming lives ... anybody got a better suggestion?


I think we need a generic name here, but the problem is that the 
'weight' field has different meanings among architectures.


The 'weight' fields are to store all kinds of latency on X86.
On PowerPC, it stores MMCRA[TECX/TECM], which doesn't look like a latency.

I don't think I can use the name, 'cache_lat' or 'instruction_lat', 
here. Right?

If so, how about 'var'?

u32 var_1_dw;
u16 var_2_w;
u16 var_3_w;




Also, do we want to care about byte order?


Sure. I will add the big-endian and little-endian support.


Thanks,
Kan


Re: [PATCH V2 3/5] perf/x86/intel: Filter unsupported Topdown metrics event

2021-01-27 Thread Liang, Kan




On 1/27/2021 2:13 PM, Peter Zijlstra wrote:

On Wed, Jan 27, 2021 at 07:38:43AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Current perf doesn't check the index of a Topdown metrics event before
updating the event. A perf tool user may get a value from an unsupported
Topdown metrics event.

For example, the L2 topdown event, cpu/event=0x00,umask=0x84/, is not
supported on Ice Lake. A perf tool user may mistakenly use the
unsupported events via RAW format. In this case, the scheduler follows
the unknown event handling and assigns a GP counter to the event. The
icl_get_topdown_value() returns the value of the slots to the event.
The perf tool user will get the value for the unsupported
Topdown metrics event.

Add a check in the __icl_update_topdown_event() and avoid updating
unsupported events.


I was struggling trying to understand how we end up here. Because
userspace can add whatever crap it wants, and why is only this thing a
problem..

But the actual problem is the next patch changing INTEL_TD_METRIC_NUM,
which then means is_metric_idx() will change, and that means that for
ICL we'll accept these raw events as metric events on creation and then
at runtime we get into trouble.

This isn't spelled out.

I do think this is entirely the wrong fix for that though. You're now
adding cycles to the relative hot path, instead of fixing the problem at
event creation, which is the slow path.

Why can't we either refuse the event on ICL or otherwise wreck it on
construction to avoid getting into trouble here?



To reject the unsupported topdown events, I think perf needs to know the 
number of the supported topdown events. Maybe we can add a variable 
num_topdown_events in x86_pmu as below. Is it OK?


diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3d6fdcf..c7f2602 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1061,6 +1061,18 @@ int x86_schedule_events(struct cpu_hw_events 
*cpuc, int n, int *assign)

return unsched ? -EINVAL : 0;
 }

+#define INTEL_TD_METRIC_AVAILABLE_MAX  (INTEL_TD_METRIC_RETIRING + \
+((x86_pmu.num_topdown_events - 1) << 
8))
+
+inline bool is_metric_event(struct perf_event *event)
+{
+   u64 config = event->attr.config;
+
+   return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+   ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING)  
&&
+   ((config & INTEL_ARCH_EVENT_MASK) <= 
INTEL_TD_METRIC_AVAILABLE_MAX);
+}
+
 static int add_nr_metric_event(struct cpu_hw_events *cpuc,
   struct perf_event *event)
 {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index cd4d542..eab1eba 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5769,6 +5769,7 @@ __init int intel_pmu_init(void)
x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xc9, .umask=0x04);
x86_pmu.lbr_pt_coexist = true;
intel_pmu_pebs_data_source_skl(pmem);
+   x86_pmu.num_topdown_events = 4;
x86_pmu.update_topdown_event = icl_update_topdown_event;
x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
pr_cont("Icelake events, ");
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 15977d0..8b05893 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -87,14 +87,7 @@ static inline bool is_topdown_count(struct perf_event 
*event)

return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
 }

-static inline bool is_metric_event(struct perf_event *event)
-{
-   u64 config = event->attr.config;
-
-   return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
-   ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING)  
&&
-   ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX);
-}
+inline bool is_metric_event(struct perf_event *event);

 static inline bool is_slots_event(struct perf_event *event)
 {
@@ -782,6 +775,7 @@ struct x86_pmu {
/*
 * Intel perf metrics
 */
+   int num_topdown_events;
u64 (*update_topdown_event)(struct perf_event *event);
int (*set_topdown_event_period)(struct perf_event *event);


Thanks,
Kan


Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids

2021-01-26 Thread Liang, Kan




On 1/26/2021 10:37 AM, Peter Zijlstra wrote:

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.li...@linux.intel.com wrote:

@@ -1577,9 +1668,20 @@ static void setup_pebs_adaptive_sample_data(struct 
perf_event *event,
}
  
  	if (format_size & PEBS_DATACFG_MEMINFO) {

+   if (sample_type & PERF_SAMPLE_WEIGHT) {
+   u64 weight = meminfo->latency;
+
+   if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
+   weight >>= PEBS_CACHE_LATENCY_OFFSET;
+   data->weight = weight & PEBS_LATENCY_MASK ?:
intel_get_tsx_weight(meminfo->tsx_tuning);
+   }
+
+   if (sample_type & PERF_SAMPLE_WEIGHT_EXT) {
+   data->weight_ext.val = 0;
+   if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
+   data->weight_ext.instr_latency = meminfo->latency 
& PEBS_LATENCY_MASK;
+   }
  
  		if (sample_type & PERF_SAMPLE_DATA_SRC)

data->data_src.val = get_data_src(event, meminfo->aux);


Talk to me about that SAMPLE_WEIGHT stuff I'm not liking it.

Sure you want multiple dimensions, but urgh.

Also, afaict, as proposed you're wasting 80/128 bits. That is, all data
you want to export fits in a single u64 and yet you're using two, which
is mighty daft.

Sure, pebs::lat / pebs_meminfo::latency is defined as a u64, but you
can't tell me that that is ever actually more than 4G cycles. Even the
TSX block latency is u32.

So how about defining SAMPLE_WEIGHT_STRUCT which uses the exact same
data as SAMPLE_WEIGHT but unions it with a struct. I'm not sure if we
want:

union sample_weight {
u64 weight;

struct {
u32 low_dword;
u32 high_dword;
};

/* or */

struct {
u32 low_dword;
u16 high_word;
u16 higher_word;
};
};

Then have the core code enforce SAMPLE_WEIGHT ^ SAMPLE_WEIGHT_STRUCT and
make the existing code never set the high dword.


So the kernel will only accept either SAMPLE_WEIGHT type or 
SAMPLE_WEIGHT_STRUCT type. It should error out if both types are set, right?


I will check if u32 is enough for meminfo::latency on the previous 
platforms.


Thanks,
Kan


Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids

2021-01-26 Thread Liang, Kan




On 1/26/2021 9:43 AM, Peter Zijlstra wrote:

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.li...@linux.intel.com wrote:

@@ -2319,6 +2474,17 @@ static void __icl_update_topdown_event(struct perf_event 
*event,
  {
u64 delta, last = 0;
  
+	/*

+* Although the unsupported topdown events are not exposed to users,
+* users may mistakenly use the unsupported events via RAW format.
+* For example, using L2 topdown event, cpu/event=0x00,umask=0x84/,
+* on Ice Lake. In this case, the scheduler follows the unknown
+* event handling and assigns a GP counter to the event.
+* Check the case, and avoid updating unsupported events.
+*/
+   if (event->hw.idx < INTEL_PMC_IDX_FIXED)
+   return;
+
delta = icl_get_topdown_value(event, slots, metrics);
if (last_slots)
last = icl_get_topdown_value(event, last_slots, last_metrics);


Is this a separate patch?



I will move it to a separate patch.

Thanks,
Kan


Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids

2021-01-26 Thread Liang, Kan




On 1/26/2021 9:44 AM, Peter Zijlstra wrote:

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.li...@linux.intel.com wrote:

@@ -3671,6 +3853,31 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
}
  
+	/*

+* To retrieve complete Memory Info of the load latency event, an
+* auxiliary event has to be enabled simultaneously. Add a check for
+* the load latency event.
+*
+* In a group, the auxiliary event must be in front of the load latency
+* event. The rule is to simplify the implementation of the check.
+* That's because perf cannot have a complete group at the moment.
+*/
+   if (x86_pmu.flags & PMU_FL_MEM_LOADS_AUX &&
+   (event->attr.sample_type & PERF_SAMPLE_DATA_SRC) &&
+   is_mem_loads_event(event)) {
+   struct perf_event *leader = event->group_leader;
+   struct perf_event *sibling = NULL;
+
+   if (!is_mem_loads_aux_event(leader)) {
+   for_each_sibling_event(sibling, leader) {
+   if (is_mem_loads_aux_event(sibling))
+   break;
+   }
+   if (list_entry_is_head(sibling, >sibling_list, 
sibling_list))
+   return -ENODATA;
+   }
+   }
+
if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
return 0;
  


I have vague memories of this getting mentioned in a call at some point.
Pretend I don't know anything and tell me more.



Adding the auxiliary event is for the new data source fields, data block 
& address block. If perf only samples the load latency event, the value 
of the data block & address block fields in a sample is not correct. To 
get the correct value, we have to sample both the auxiliary event and 
the load latency together on SPR. So I add the check in the kernel. I 
also modify the perf mem in the perf tool accordingly.


Thanks,
Kan


Re: [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT

2021-01-26 Thread Liang, Kan




On 1/26/2021 9:42 AM, Peter Zijlstra wrote:

On Tue, Jan 19, 2021 at 12:38:20PM -0800, kan.li...@linux.intel.com wrote:


@@ -900,6 +901,13 @@ enum perf_event_type {
 *char  data[size]; } && PERF_SAMPLE_AUX
 *  { u64   data_page_size;} && 
PERF_SAMPLE_DATA_PAGE_SIZE
 *  { u64   code_page_size;} && 
PERF_SAMPLE_CODE_PAGE_SIZE
+*  { union {
+*  u64 weight_ext;
+*  struct {
+*  u64 instr_latency:16,
+*  reserved:48;
+*  };
+*  } && PERF_SAMPLE_WEIGHT_EXT
 * };
 */
PERF_RECORD_SAMPLE  = 9,
@@ -1248,4 +1256,12 @@ struct perf_branch_entry {
reserved:40;
  };
  
+union perf_weight_ext {

+   __u64   val;
+   struct {
+   __u64   instr_latency:16,
+   reserved:48;
+   };
+};
+
  #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 55d1879..9363d12 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1903,6 +1903,9 @@ static void __perf_event_header_size(struct perf_event 
*event, u64 sample_type)
if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
size += sizeof(data->code_page_size);
  
+	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)

+   size += sizeof(data->weight_ext);
+
event->header_size = size;
  }
  
@@ -6952,6 +6955,9 @@ void perf_output_sample(struct perf_output_handle *handle,

perf_aux_sample_output(event, handle, data);
}
  
+	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)

+   perf_output_put(handle, data->weight_ext);
+
if (!event->attr.watermark) {
int wakeup_events = event->attr.wakeup_events;
  


This patch is broken and will expose uninitialized kernel stack.



Could we initialize the 'weight_ext' in perf_sample_data_init()?

I understand that we prefer not to set the field in 
perf_sample_data_init() to minimize the cachelines touched.
However, the perf_sample_data_init() should be the most proper place to 
do the initialization. Also, the 'weight' is already initialized in it. 
As an extension, I think the 'weight_ext' should be initialized in it as 
well.


In the perf_prepare_sample(), I think we can only clear the unused 
fields. The [0:15] bits may still leak the data.


Thanks,
Kan


Re: [PATCH] perf stat: Add Topdown metrics events as default events

2021-01-21 Thread Liang, Kan




On 1/21/2021 2:57 PM, Arnaldo Carvalho de Melo wrote:

Em Thu, Jan 21, 2021 at 05:37:52AM -0800, kan.li...@linux.intel.com escreveu:

From: Kan Liang 

The Topdown Microarchitecture Analysis (TMA) Method is a structured
analysis methodology to identify critical performance bottlenecks in
out-of-order processors. From the Ice Lake and later platforms, the
Topdown information can be retrieved from the dedicated "metrics"
register, which isn't impacted by other events. Also, the Topdown
metrics support both per thread/process and per core measuring.
Adding Topdown metrics events as default events can enrich the default
measuring information, and would not cost any extra multiplexing.

Introduce arch_evlist__add_default_attrs() to allow architecture
specific default events. Add the Topdown metrics events in the X86
specific arch_evlist__add_default_attrs(). Other architectures can
add their own default events later separately.

With the patch,

  $perf stat sleep 1

  Performance counter stats for 'sleep 1':

0.82 msec task-clock:u  #0.001 CPUs utilized
   0  context-switches:u#0.000 K/sec
   0  cpu-migrations:u  #0.000 K/sec
  61  page-faults:u #0.074 M/sec
 319,941  cycles:u  #0.388 GHz
 242,802  instructions:u#0.76  insn per cycle
  54,380  branches:u#   66.028 M/sec
   4,043  branch-misses:u   #7.43% of all branches
   1,585,555  slots:u   # 1925.189 M/sec
 238,941  topdown-retiring:u# 15.0% retiring
 410,378  topdown-bad-spec:u# 25.8% bad speculation
 634,222  topdown-fe-bound:u# 39.9% frontend bound
 304,675  topdown-be-bound:u# 19.2% backend bound


Shouldn't we be adding this to one of the -d levels?


The -d levels' events are architecture events, which should be available 
on all architectures. The Topdown events are X86 specific for now.

I don't think we should add the Topdown events into any of the -d levels.

Different architectures should have their unique events. I think we 
should allow architectures to default their unique and very useful events.




But you say that this is essentially for free, so you check if this
extra register is in place and if it is, hey, its for free, add it, is
that the rationale?
> I.e. it doesn't cause any impact to what we were getting before, so we
should default to use free stuff?


Free is just one of the benefits. The key motivation is that we think 
the Topdown information is very useful information, which can help users 
identify performance bottlenecks in modern out-of-order processors.


Here is an example that triggers lots of itlb misses.

With the patch, users can simply identify the performance issue with 
default perf stat (98.0% frontend bound, which leads the users to check 
the fetch bandwidth or the fetch latency related events, e.g., itlb 
misses).


$perf stat ./itlb -b1 -n200 -c0
 Performance counter stats for './itlb -b1 -n200 -c0':

 14,567.03 msec task-clock:u  #1.000 CPUs utilized
 0  context-switches:u#0.000 K/sec
 0  cpu-migrations:u  #0.000 K/sec
   683  page-faults:u #0.047 K/sec
55,890,568,522  cycles:u  #3.837 GHz
 3,601,934,137  instructions:u#0.06  insn per cycle
   400,353,806  branches:u#   27.484 M/sec
24,078  branch-misses:u   #0.01% of all 
branches

   279,391,452,475  slots:u   # 19179.715 M/sec
 2,230,224,702  topdown-retiring:u#  0.8% retiring
 3,286,958,264  topdown-bad-spec:u#  1.2% bad 
speculation
   273,953,648,626  topdown-fe-bound:u# 98.0% frontend 
bound

 4,579,804  topdown-be-bound:u#  0.0% backend bound



Without the patch, users cannot get any useful hints from the default 
perf stat. Even with -d -d, users still cannot locate the issue to itlb 
misses, because we don't have an architecture iTLB-loads event.

The -d -d also triggers the multiplexing.

$ perf stat -d -d ./itlb -b1 -n200 -c0

 Performance counter stats for './itlb -b1 -n200 -c0':

 14,633.58 msec task-clock:u  #1.000 CPUs utilized
 0  context-switches:u#0.000 K/sec
 0  cpu-migrations:u  #0.000 K/sec
   682  page-faults:u #0.047 K/sec
56,198,668,109  cycles:u  #3.840 GHz 
   (38.41%)
 3,602,460,111  instructions:u#0.06  insn per 
cycle   (46.12%)
   400,445,992  

Re: [PATCH V4 4/6] perf script: Add support for PERF_SAMPLE_CODE_PAGE_SIZE

2021-01-18 Thread Liang, Kan




On 1/15/2021 2:25 PM, Arnaldo Carvalho de Melo wrote:

Em Tue, Jan 05, 2021 at 11:57:50AM -0800,kan.li...@linux.intel.com  escreveu:

From: Stephane Eranian

Display sampled code page sizes when PERF_SAMPLE_CODE_PAGE_SIZE was set.

For example,
perf script --fields comm,event,ip,code_page_size
 dtlb mem-loads:uP:445777 4K
 dtlb mem-loads:uP:40f724 4K
 dtlb mem-loads:uP:474926 4K
 dtlb mem-loads:uP:401075 4K
 dtlb mem-loads:uP:401095 4K
 dtlb mem-loads:uP:401095 4K
 dtlb mem-loads:uP:4010cc 4K
 dtlb mem-loads:uP:440b6f 4K

Acked-by: Namhyung Kim
Acked-by: Jiri Olsa
Signed-off-by: Stephane Eranian

You missed your Signed-off-by, I'm adding it, please ack this change.


The patch 4 and 5 are from Stephane. I only made minor changes so that 
the code can be rebased to the latest perf/core branch (c07b45a355ee).


May add a tag as below.

[kan.li...@linux.intel.com: Rebase on top of acme's perf/core branch 
commit c07b45a355ee]

Signed-off-by: Kan Liang 


Thanks,
Kan


Re: [PATCH V4 0/6] Add the page size in the perf record (user tools)

2021-01-12 Thread Liang, Kan




On 1/12/2021 12:24 AM, Athira Rajeev wrote:




On 06-Jan-2021, at 1:27 AM, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Changes since V3:
- Rebase on top of acme's perf/core branch
  commit c07b45a355ee ("perf record: Tweak "Lowering..." warning in 
record_opts__config_freq")

Changes since V2:
- Rebase on top of acme perf/core branch
  commit eec7b53d5916 ("perf test: Make sample-parsing test aware of 
PERF_SAMPLE_{CODE,DATA}_PAGE_SIZE")
- Use unit_number__scnprintf() in get_page_size_name()
- Emit warning about kernel not supporting the code page size sample_type bit

Changes since V1:
- Fix the compile warning with GCC 10
- Add Acked-by from Namhyung Kim

Current perf can report both virtual addresses and physical addresses,
but not the page size. Without the page size information of the utilized
page, users cannot decide whether to promote/demote large pages to
optimize memory usage.

The kernel patches have been merged into tip perf/core branch,
commit 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
commit 76a5433f95f3 ("perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE")
commit 4cb6a42e4c4b ("powerpc/perf: Support PERF_SAMPLE_DATA_PAGE_SIZE")
commit 995f088efebe ("perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE")
commit 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")

and Peter's perf/core branch
commit 524680ce47a1 ("mm/gup: Provide gup_get_pte() more generic")
commit 44a35d6937d2 ("mm: Introduce pXX_leaf_size()")
commit 2f1e2f091ad0 ("perf/core: Fix arch_perf_get_page_size()")
commit 7649e44aacdd ("arm64/mm: Implement pXX_leaf_size() support")
commit 1df1ae7e262c ("sparc64/mm: Implement pXX_leaf_size() support")

This patch set is to enable the page size support in user tools.


Hi Kan Liang,

I am trying to check this series on powerpc.

# perf mem --phys-data --data-page-size record 

To my observation, some of the samples returned zero size and comes as ’N/A’ in 
the perf report

# perf mem --phys-data --data-page-size report

For fetching the page size, though initially there was a weak function added ( 
as arch_perf_get_page_size ) here:

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core=51b646b2d9f84d6ff6300e3c1d09f2be4329a424

later I see it got removed here:

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core=8af26be062721e52eba1550caf50b712f774c5fd

I picked kernel changes from 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git , or I am missing 
something ?


I believe all the kernel changes have been merged.

According to the commit message of the recent changes, only Power/8xxx 
is supported for power for now. I guess that may be the reason of some 
'N/A's.

https://lore.kernel.org/patchwork/cover/1345521/

Thanks,
Kan




Thanks
Athira



Kan Liang (3):
  perf mem: Clean up output format
  perf mem: Support data page size
  perf tools: Add support for PERF_SAMPLE_CODE_PAGE_SIZE

Stephane Eranian (3):
  perf script: Add support for PERF_SAMPLE_CODE_PAGE_SIZE
  perf report: Add support for PERF_SAMPLE_CODE_PAGE_SIZE
  perf test: Add test case for PERF_SAMPLE_CODE_PAGE_SIZE

tools/perf/Documentation/perf-mem.txt |   3 +
tools/perf/Documentation/perf-record.txt  |   3 +
tools/perf/Documentation/perf-report.txt  |   1 +
tools/perf/Documentation/perf-script.txt  |   2 +-
tools/perf/builtin-mem.c  | 111 +++---
tools/perf/builtin-record.c   |   2 +
tools/perf/builtin-script.c   |  13 ++-
tools/perf/tests/sample-parsing.c |   4 +
tools/perf/util/event.h   |   1 +
tools/perf/util/evsel.c   |  18 +++-
tools/perf/util/evsel.h   |   1 +
tools/perf/util/hist.c|   2 +
tools/perf/util/hist.h|   1 +
tools/perf/util/perf_event_attr_fprintf.c |   2 +-
tools/perf/util/record.h  |   1 +
tools/perf/util/session.c |   3 +
tools/perf/util/sort.c|  26 +
tools/perf/util/sort.h|   2 +
tools/perf/util/synthetic-events.c|   8 ++
19 files changed, 144 insertions(+), 60 deletions(-)

--
2.25.1







Re: [PATCH 0/2] perf/x86/intel/uncore: Derive die id from NUMA info with more than 8 nodes

2021-01-11 Thread Liang, Kan




On 1/8/2021 10:35 AM, Steve Wahl wrote:

For Intel uncore, the registers being used to identify the die don't
contain enough bits to uniquely identify more than 8 dies.  On
systems with more than 8 dies, this results in error messages of the
form "skx_uncore: probe of :XX:XX.X failed with error -22", and
some perf counters showing up as "".

On such systems, use NUMA information to determine die id.

Continue to use the register information with 8 or fewer numa nodes to
cover cases like NUMA not being enabled.

The first patch moves translation from physical to logical die id
earlier in the code, and stores only the logical id.  The logical id
is the only one that is really used.  Without this change the second
patch would have to store both physical and logical id, which was much
more complicated.

The second patch adds the alternative of deriving the logical die id
from the NUMA information when there are more than 8 nodes.

Steve Wahl (2):
   perf/x86/intel/uncore: Store the logical die id instead of the
 physical die id.
   perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA
 info

  arch/x86/events/intel/uncore.c   |  58 +-
  arch/x86/events/intel/uncore.h   |   5 +-
  arch/x86/events/intel/uncore_snb.c   |   2 +-
  arch/x86/events/intel/uncore_snbep.c | 114 ++-
  4 files changed, 99 insertions(+), 80 deletions(-)



Thanks Steve for working on the issue. The patch set looks good to me.

Reviewed-by: Kan Liang 

Thanks,
Kan


Re: [PATCH] perf intel-pt: Fix 'CPU too large' error

2021-01-07 Thread Liang, Kan




On 1/7/2021 12:41 PM, Adrian Hunter wrote:

In some cases, the number of cpus (nr_cpus_online) is confused with the
maximum cpu number (nr_cpus_avail), which results in the error in the
example below:

Example on system with 8 cpus:

  Before:
# echo 0 > /sys/devices/system/cpu/cpu2/online
# ./perf record --kcore -e intel_pt// taskset --cpu-list 7 uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.147 MB perf.data ]
# ./perf script --itrace=e
Requested CPU 7 too large. Consider raising MAX_NR_CPUS
0x25908 [0x8]: failed to process type: 68 [Invalid argument]

  After:
# ./perf script --itrace=e
#

Fixes: 8c7274691f0d ("perf machine: Replace MAX_NR_CPUS with 
perf_env::nr_cpus_online")
Fixes: 7df4e36a4785 ("perf session: Replace MAX_NR_CPUS with 
perf_env::nr_cpus_online")
Cc: sta...@vger.kernel.org
Signed-off-by: Adrian Hunter 



Tested-by: Kan Liang 

Thanks,
Kan


---
  tools/perf/util/machine.c | 4 ++--
  tools/perf/util/session.c | 2 +-
  2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 1ae32a81639c..46844599d25d 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2977,7 +2977,7 @@ int machines__for_each_thread(struct machines *machines,
  
  pid_t machine__get_current_tid(struct machine *machine, int cpu)

  {
-   int nr_cpus = min(machine->env->nr_cpus_online, MAX_NR_CPUS);
+   int nr_cpus = min(machine->env->nr_cpus_avail, MAX_NR_CPUS);
  
  	if (cpu < 0 || cpu >= nr_cpus || !machine->current_tid)

return -1;
@@ -2989,7 +2989,7 @@ int machine__set_current_tid(struct machine *machine, int 
cpu, pid_t pid,
 pid_t tid)
  {
struct thread *thread;
-   int nr_cpus = min(machine->env->nr_cpus_online, MAX_NR_CPUS);
+   int nr_cpus = min(machine->env->nr_cpus_avail, MAX_NR_CPUS);
  
  	if (cpu < 0)

return -EINVAL;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 3b3c50b12791..2777c2df7d87 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2391,7 +2391,7 @@ int perf_session__cpu_bitmap(struct perf_session *session,
  {
int i, err = -1;
struct perf_cpu_map *map;
-   int nr_cpus = min(session->header.env.nr_cpus_online, MAX_NR_CPUS);
+   int nr_cpus = min(session->header.env.nr_cpus_avail, MAX_NR_CPUS);
  
  	for (i = 0; i < PERF_TYPE_MAX; ++i) {

struct evsel *evsel;



Re: [PATCH V3 5/9] perf mem: Support data page size

2021-01-05 Thread Liang, Kan




On 12/19/2020 3:56 PM, Arnaldo Carvalho de Melo wrote:

Em Wed, Dec 16, 2020 at 10:58:01AM -0800, kan.li...@linux.intel.com escreveu:

From: Kan Liang 

Add option --data-page-size in "perf mem" to record/report data page
size.

Here are some examples.
perf mem --phys-data --data-page-size report -D


So I stopped at this cset, it isn't applying to my tree, I'll test what
I have, which is up to the patch before this one and push to Linus, as
the window is closing.


Hi Arnaldo,

Sorry for the late response. I was on vacation.

I will rebase the rest of the patches on top of your perf/core branch 
and send them out shortly.


Thanks,
Kan



- Arnaldo
  

  # PID, TID, IP, ADDR, PHYS ADDR, DATA PAGE SIZE, LOCAL WEIGHT, DSRC,
  # SYMBOL
20134 20134 0xb5bd2fd0 0x0169a274e96a308 0x00044e96a308
4K  1168 0x5080144
/lib/modules/4.18.0-rc7+/build/vmlinux:perf_ctx_unlock
20134 20134 0xb63f645c 0xb752b814 0xcfb52b814 2M 225
0x26a100142 /lib/modules/4.18.0-rc7+/build/vmlinux:_raw_spin_lock
20134 20134 0xb660300c 0xfe00016b8bb0 0x0 4K 0 0x5080144
/lib/modules/4.18.0-rc7+/build/vmlinux:__x86_indirect_thunk_rax

perf mem --phys-data --data-page-size report --stdio

  # To display the perf.data header info, please use
  # --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 5K of event 'cpu/mem-loads,ldlat=30/P'
  # Total weight : 281234
  # Sort order   :
  # mem,sym,dso,symbol_daddr,dso_daddr,tlb,locked,phys_daddr,data_page_size
  #
  # Overhead   Samples  Memory access Symbol
  # Shared Object Data Symbol  Data
  # Object  TLB access  Locked  Data Physical
  # Address   Data Page Size
  #     
  #   
  # ...  ...
  # ..  ..  ..
  # ..
  #
 28.54%  1826  L1 or L1 hit  [k]
__x86_indirect_thunk_rax  [kernel.vmlinux]  [k] 0xb0df31b0ff28
[unknown]L1 or L2 hitNo  [k]
4K
  6.02%   256  L1 or L1 hit  [.] touch_buffer
dtlb  [.] 0x7ffd50109da8   [stack]
L1 or L2 hitNo  [.] 0x00042454ada8  4K
  3.23% 5  L1 or L1 hit  [k] clear_huge_page
[kernel.vmlinux]  [k] 0x9a2753b8ce60   [unknown]
L1 or L2 hitNo  [k] 0x000453b8ce60  2M
  2.98% 4  L1 or L1 hit  [k] clear_page_erms
[kernel.vmlinux]  [k] 0xb0df31b0fd00   [unknown]
L1 or L2 hitNo  [k] 4K

Acked-by: Namhyung Kim 
Acked-by: Jiri Olsa 
Signed-off-by: Kan Liang 
---
  tools/perf/Documentation/perf-mem.txt |  3 +++
  tools/perf/builtin-mem.c  | 20 +++-
  2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-mem.txt 
b/tools/perf/Documentation/perf-mem.txt
index 199ea0f0a6c0..66177511c5c4 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -63,6 +63,9 @@ OPTIONS
  --phys-data::
Record/Report sample physical addresses
  
+--data-page-size::

+   Record/Report sample data address page size
+
  RECORD OPTIONS
  --
  -e::
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 7d6ee2208709..f3aac85aa9d4 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -30,6 +30,7 @@ struct perf_mem {
booldump_raw;
boolforce;
boolphys_addr;
+   booldata_page_size;
int operation;
const char  *cpu_list;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
@@ -124,6 +125,9 @@ static int __cmd_record(int argc, const char **argv, struct 
perf_mem *mem)
if (mem->phys_addr)
rec_argv[i++] = "--phys-data";
  
+	if (mem->data_page_size)

+   rec_argv[i++] = "--data-page-size";
+
for (j = 0; j < PERF_MEM_EVENTS__MAX; j++) {
e = perf_mem_events__ptr(j);
if (!e->record)
@@ -173,6 +177,7 @@ dump_raw_samples(struct perf_tool *tool,
struct perf_mem *mem = container_of(tool, struct perf_mem, tool);
struct addr_location al;
const char *fmt, *field_sep;
+   char str[PAGE_SIZE_NAME_LEN];
  
  	if (machine__resolve(machine, , sample) < 0) {

fprintf(stderr, "problem processing %d event, skipping it.\n",
@@ -209,6 +214,12 @@ dump_raw_samples(struct perf_tool *tool,
symbol_conf.field_sep);
}
  
+	if (mem->data_page_size) {

+   printf("%s%s",
+   

Re: [PATCH V2 3/3] perf: Optimize sched_task() in a context switch

2020-12-10 Thread Liang, Kan




On 12/10/2020 2:13 AM, Namhyung Kim wrote:

Hi Peter and Kan,

How can we move this forward?


Hi Namhyung,

Thanks for the test. The changes look good to me.

Hi Peter,

Should we resend the patch set for further review?


Thanks,
Kan




Thanks,
Namhyung

On Fri, Dec 4, 2020 at 4:14 PM Namhyung Kim  wrote:


Hi Peter,

On Wed, Dec 2, 2020 at 2:29 AM Peter Zijlstra  wrote:


On Mon, Nov 30, 2020 at 11:38:42AM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Some calls to sched_task() in a context switch can be avoided. For
example, large PEBS only requires flushing the buffer in context switch
out. The current code still invokes the sched_task() for large PEBS in
context switch in.


I still hate this one, how's something like this then?
Which I still don't really like.. but at least its simpler.

(completely untested, may contain spurious edits, might ICE the
compiler and set your pets on fire if it doesn't)


I've tested this version... and it worked well besides the optimization.. :)

[SNIP]

+static void context_sched_task(struct perf_cpu_context *cpuctx,
+  struct perf_event_context *ctx,
+  bool sched_in)
+{
+   struct pmu *pmu = ctx->pmu;
+
+   if (cpuctx->sched_cb_dir[sched_in] && pmu->sched_task)
+   pmu->sched_task(ctx, false);


applied: s/false/sched_in/



+}
+
  static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
  struct task_struct *next)
  {
@@ -3424,9 +3433,7 @@ static void perf_event_context_sched_out
 WRITE_ONCE(next_ctx->task, task);

 perf_pmu_disable(pmu);
-
-   if (cpuctx->sched_cb_usage && pmu->sched_task)
-   pmu->sched_task(ctx, false);
+   context_sched_task(cpuctx, ctx, false);

 /*
  * PMU specific parts of task perf context can require
@@ -3465,8 +3472,7 @@ static void perf_event_context_sched_out
 raw_spin_lock(>lock);
 perf_pmu_disable(pmu);

-   if (cpuctx->sched_cb_usage && pmu->sched_task)
-   pmu->sched_task(ctx, false);
+   context_sched_task(cpuctx, ctx, false);
 task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);

 perf_pmu_enable(pmu);


[SNIP]

@@ -3563,8 +3582,7 @@ void __perf_event_task_sched_out(struct
  {
 int ctxn;

-   if (__this_cpu_read(perf_sched_cb_usage))
-   perf_pmu_sched_task(task, next, false);
+   perf_pmu_sched_task(task, next, false);


I think the reason is this change.  It now calls perf_pmu_sched_task()
without checking the counter.  And this is for per-cpu events.



 if (atomic_read(_switch_events))
 perf_event_switch(task, next, false);
@@ -3828,8 +3846,7 @@ static void perf_event_context_sched_in(
 cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 perf_event_sched_in(cpuctx, ctx, task);

-   if (cpuctx->sched_cb_usage && pmu->sched_task)
-   pmu->sched_task(cpuctx->task_ctx, true);
+   context_sched_task(cpuctx, ctx, true);

 perf_pmu_enable(pmu);

@@ -3875,8 +3892,7 @@ void __perf_event_task_sched_in(struct t
 if (atomic_read(_switch_events))
 perf_event_switch(task, prev, true);

-   if (__this_cpu_read(perf_sched_cb_usage))
-   perf_pmu_sched_task(prev, task, true);
+   perf_pmu_sched_task(prev, task, true);


Ditto.


  }

  static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 
count)


So I made a change like below.. and it could bring the optimization back.

Thanks,
Namhyung


diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9107e7c3ccfb..a30243a9fab5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3528,6 +3528,9 @@ static void __perf_pmu_sched_task(struct
perf_cpu_context *cpuctx, bool sched_in
  {
 struct pmu *pmu;

+   if (!cpuctx->sched_cb_dir[sched_in])
+   return;
+
 pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */

 if (WARN_ON_ONCE(!pmu->sched_task))


Re: [PATCH V2 02/12] perf record: Support new sample type for data page size

2020-12-07 Thread Liang, Kan




On 12/7/2020 12:07 PM, Arnaldo Carvalho de Melo wrote:

Em Mon, Nov 30, 2020 at 09:27:53AM -0800, kan.li...@linux.intel.com escreveu:

From: Kan Liang 

Support new sample type PERF_SAMPLE_DATA_PAGE_SIZE for page size.

Add new option --data-page-size to record sample data page size.


So, trying this on a kernel without this feature I get:

[acme@five perf]$ perf record --data-page-size sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event 
(cycles:u).
/bin/dmesg | grep -i perf may provide additional information.

[acme@five perf]$

I'm adding the following patch right after yours, next time please test
this and provide a similar error message.



Sorry, I missed it.

Besides the PERF_SAMPLE_DATA_PAGE_SIZE, I think we have to fix the 
PERF_SAMPLE_CODE_PAGE_SIZE as well.

Should I send a separate patch to fix it?

Thanks,
Kan



- Arnaldo

commit 2044fec7fcc6070b09f9b6a67922b0b9e4295dba
Author: Arnaldo Carvalho de Melo 
Date:   Mon Dec 7 14:04:05 2020 -0300

 perf evsel: Emit warning about kernel not supporting the data page size 
sample_type bit
 
 Before we had this unhelpful message:
 
   $ perf record --data-page-size sleep 1

   Error:
   The sys_perf_event_open() syscall returned with 22 (Invalid argument) 
for event (cycles:u).
   /bin/dmesg | grep -i perf may provide additional information.
   $
 
 Add support to the perf_missing_features variable to remember what

 caused evsel__open() to fail and then use that information in
 evsel__open_strerror().
 
   $ perf record --data-page-size sleep 1

   Error:
   Asking for the data page size isn't supported by this kernel.
   $
 
 Cc: Kan Liang 

 Cc: Namhyung Kim 
 Cc: Andi Kleen 
 Cc: Jiri Olsa 
 Cc: Mark Rutland 
 Cc: Michael Ellerman 
 Cc: Stephane Eranian 
 Cc: Will Deacon 
 Signed-off-by: Arnaldo Carvalho de Melo 

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 5e6085c3fc761a55..c26ea82220bd8625 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1873,7 +1873,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct 
perf_cpu_map *cpus,
 * Must probe features in the order they were added to the
 * perf_event_attr interface.
 */
-if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
+if (!perf_missing_features.data_page_size &&
+   (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)) {
+   perf_missing_features.data_page_size = true;
+   pr_debug2_peo("Kernel has no PERF_SAMPLE_DATA_PAGE_SIZE support, 
bailing out\n");
+   goto out_close;
+   } else if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) {
perf_missing_features.cgroup = true;
pr_debug2_peo("Kernel has no cgroup sampling support, bailing 
out\n");
goto out_close;
@@ -2673,6 +2678,8 @@ int evsel__open_strerror(struct evsel *evsel, struct 
target *target,
"We found oprofile daemon running, please stop it and try again.");
break;
case EINVAL:
+   if (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE 
&& perf_missing_features.data_page_size)
+   return scnprintf(msg, size, "Asking for the data page size 
isn't supported by this kernel.");
if (evsel->core.attr.write_backward && 
perf_missing_features.write_backward)
return scnprintf(msg, size, "Reading from overwrite event is 
not supported by this kernel.");
if (perf_missing_features.clockid)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 79a860d8e3eefe23..cd1d8dd431997b84 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -144,6 +144,7 @@ struct perf_missing_features {
bool aux_output;
bool branch_hw_idx;
bool cgroup;
+   bool data_page_size;
  };
  
  extern struct perf_missing_features perf_missing_features;




Re: [PATCH V2 06/12] perf mem: Clean up output format

2020-12-07 Thread Liang, Kan




On 12/4/2020 6:27 PM, Jiri Olsa wrote:

On Mon, Nov 30, 2020 at 09:27:57AM -0800, kan.li...@linux.intel.com wrote:

SNIP


@@ -172,7 +172,7 @@ dump_raw_samples(struct perf_tool *tool,
  {
struct perf_mem *mem = container_of(tool, struct perf_mem, tool);
struct addr_location al;
-   const char *fmt;
+   const char *fmt, *field_sep;
  
  	if (machine__resolve(machine, , sample) < 0) {

fprintf(stderr, "problem processing %d event, skipping it.\n",
@@ -186,60 +186,41 @@ dump_raw_samples(struct perf_tool *tool,
if (al.map != NULL)
al.map->dso->hit = 1;
  
-	if (mem->phys_addr) {

-   if (symbol_conf.field_sep) {
-   fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s0x%016"PRIx64
- "%s%"PRIu64"%s0x%"PRIx64"%s%s:%s\n";
-   } else {
-   fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64
- "%s0x%016"PRIx64"%s%5"PRIu64"%s0x%06"PRIx64
- "%s%s:%s\n";
-   symbol_conf.field_sep = " ";
-   }
-
-   printf(fmt,
-   sample->pid,
-   symbol_conf.field_sep,
-   sample->tid,
-   symbol_conf.field_sep,
-   sample->ip,
-   symbol_conf.field_sep,
-   sample->addr,
-   symbol_conf.field_sep,
-   sample->phys_addr,
-   symbol_conf.field_sep,
-   sample->weight,
-   symbol_conf.field_sep,
-   sample->data_src,
-   symbol_conf.field_sep,
-   al.map ? (al.map->dso ? al.map->dso->long_name : "???") : 
"???",
-   al.sym ? al.sym->name : "???");
+   field_sep = symbol_conf.field_sep;


hum, what's the point of having field_sep?



To keep the fmt consistent.

The patch divides the "printf(fmt,..." into two part. In the first half 
part, the symbol_conf.field_sep may be modified to " ";
In the second half part, we should not use the modified 
symbol_conf.field_sep for the below check. The field_sep keeps the 
original value and should be used.







+   if (field_sep) {
+   fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s";
} else {
-   if (symbol_conf.field_sep) {
-   fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s%"PRIu64
- "%s0x%"PRIx64"%s%s:%s\n";
-   } else {
-   fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64
- "%s%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n";
-   symbol_conf.field_sep = " ";
-   }
+   fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64"%s";
+   symbol_conf.field_sep = " ";
+   }
+   printf(fmt,
+   sample->pid,
+   symbol_conf.field_sep,
+   sample->tid,
+   symbol_conf.field_sep,
+   sample->ip,
+   symbol_conf.field_sep,
+   sample->addr,
+   symbol_conf.field_sep);
  
-		printf(fmt,

-   sample->pid,
-   symbol_conf.field_sep,
-   sample->tid,
-   symbol_conf.field_sep,
-   sample->ip,
-   symbol_conf.field_sep,
-   sample->addr,
-   symbol_conf.field_sep,
-   sample->weight,
-   symbol_conf.field_sep,
-   sample->data_src,
-   symbol_conf.field_sep,
-   al.map ? (al.map->dso ? al.map->dso->long_name : "???") : 
"???",
-   al.sym ? al.sym->name : "???");
+   if (mem->phys_addr) {
+   printf("0x%016"PRIx64"%s",
+   sample->phys_addr,
+   symbol_conf.field_sep);
}
+
+   if (field_sep)
+   fmt = "%"PRIu64"%s0x%"PRIx64"%s%s:%s\n";
+   else
+   fmt = "%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n";
+
+   printf(fmt,
+   sample->weight,
+   symbol_conf.field_sep,
+   sample->data_src,
+   symbol_conf.field_sep,
+   al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???",
+   al.sym ? al.sym->name : "???");
  out_put:
addr_location__put();
return 0;
@@ -287,10 +268,12 @@ static int report_raw_events(struct perf_mem *mem)
if (ret < 0)
goto out_delete;
  
+	printf("# PID, TID, IP, ADDR, ");

+
if (mem->phys_addr)
-   printf("# PID, TID, IP, ADDR, PHYS ADDR, LOCAL WEIGHT, DSRC, 
SYMBOL\n");
-   else
-   printf("# PID, TID, IP, ADDR, LOCAL WEIGHT, DSRC, SYMBOL\n");
+  

Re: [PATCH 00/12] Add the page size in the perf record (user tools)

2020-11-30 Thread Liang, Kan




On 11/27/2020 3:22 PM, Jiri Olsa wrote:

hi,
I can't compile this on Fedora 32, check the log below



The patch set was compiled with GCC 9. I will install a GCC 10 and fix 
all the warnings.


Thanks,
Kan


jirka


---
$ make JOBS=1
   BUILD:   Doing 'make -j1' parallel build
Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs 
from latest version at 'include/uapi/linux/perf_event.h'
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
   CC   util/parse-events.o
   CC   util/session.o
util/session.c: In function ‘machines__deliver_event’:
util/session.c:1278:37: error: ‘%lu’ directive output may be truncated writing 
between 1 and 20 bytes into a region of size 10 [-Werror=format-truncation=]
  1278 |  snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]);
   | ^~~
util/session.c:1278:36: note: directive argument in the range [1, 
18446744073709551615]
  1278 |  snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]);
   |^~~
In file included from /usr/include/stdio.h:867,
  from 
/home/jolsa/kernel/linux-perf/tools/lib/perf/include/perf/cpumap.h:6,
  from util/session.c:13:
/usr/include/bits/stdio2.h:67:10: note: ‘__builtin___snprintf_chk’ output 
between 3 and 22 bytes into a destination of size 10
67 |   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
   |  ^~~~
68 |__bos (__s), __fmt, __va_arg_pack ());
   |~
util/session.c:1278:37: error: ‘%lu’ directive output may be truncated writing 
between 1 and 20 bytes into a region of size 10 [-Werror=format-truncation=]
  1278 |  snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]);
   | ^~~
util/session.c:1278:36: note: directive argument in the range [1, 
18446744073709551615]
  1278 |  snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]);
   |^~~
In file included from /usr/include/stdio.h:867,
  from 
/home/jolsa/kernel/linux-perf/tools/lib/perf/include/perf/cpumap.h:6,
  from util/session.c:13:
/usr/include/bits/stdio2.h:67:10: note: ‘__builtin___snprintf_chk’ output 
between 3 and 22 bytes into a destination of size 10
67 |   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
   |  ^~~~
68 |__bos (__s), __fmt, __va_arg_pack ());
   |~
util/session.c: In function ‘get_page_size_name’:
util/session.c:1278:37: error: ‘%lu’ directive output may be truncated writing 
between 1 and 20 bytes into a region of size 10 [-Werror=format-truncation=]
  1278 |  snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]);
   | ^~~
util/session.c:1278:36: note: directive argument in the range [1, 
18446744073709551615]
  1278 |  snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]);
   |^~~
In file included from /usr/include/stdio.h:867,
  from 
/home/jolsa/kernel/linux-perf/tools/lib/perf/include/perf/cpumap.h:6,
  from util/session.c:13:
/usr/include/bits/stdio2.h:67:10: note: ‘__builtin___snprintf_chk’ output 
between 3 and 22 bytes into a destination of size 10
67 |   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
   |  ^~~~
68 |__bos (__s), __fmt, __va_arg_pack ());
   |~
cc1: all warnings being treated as errors
make[4]: *** [/home/jolsa/kernel/linux-perf/tools/build/Makefile.build:97: 
util/session.o] Error 1
make[3]: *** [/home/jolsa/kernel/linux-perf/tools/build/Makefile.build:139: 
util] Error 2
make[2]: *** [Makefile.perf:647: perf-in.o] Error 2
make[1]: *** [Makefile.perf:233: sub-make] Error 2
make: *** [Makefile:70: all] Error 2


Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events

2020-11-24 Thread Liang, Kan




On 11/24/2020 12:42 AM, Madhavan Srinivasan wrote:


On 11/24/20 10:21 AM, Namhyung Kim wrote:

Hello,

On Mon, Nov 23, 2020 at 8:00 PM Michael Ellerman  
wrote:

Namhyung Kim  writes:

Hi Peter and Kan,

(Adding PPC folks)

On Tue, Nov 17, 2020 at 2:01 PM Namhyung Kim  
wrote:

Hello,

On Thu, Nov 12, 2020 at 4:54 AM Liang, Kan 
 wrote:



On 11/11/2020 11:25 AM, Peter Zijlstra wrote:

On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote:

- When the large PEBS was introduced (9c964efa4330), the 
sched_task() should
be invoked to flush the PEBS buffer in each context switch. 
However, The
perf_sched_events in account_event() is not updated accordingly. 
The
perf_event_task_sched_* never be invoked for a pure per-CPU 
context. Only

per-task event works.
 At that time, the perf_pmu_sched_task() is outside of
perf_event_context_sched_in/out. It means that perf has to double
perf_pmu_disable() for per-task event.
- The patch 1 tries to fix broken per-CPU events. The CPU 
context cannot be
retrieved from the task->perf_event_ctxp. So it has to be 
tracked in the
sched_cb_list. Yes, the code is very similar to the original 
codes, but it
is actually the new code for per-CPU events. The optimization 
for per-task

events is still kept.
    For the case, which has both a CPU context and a task 
context, yes, the

__perf_pmu_sched_task() in this patch is not invoked. Because the
sched_task() only need to be invoked once in a context switch. The
sched_task() will be eventually invoked in the task context.
The thing is; your first two patches rely on PERF_ATTACH_SCHED_CB 
and
only set that for large pebs. Are you sure the other users (Intel 
LBR

and PowerPC BHRB) don't need it?
I didn't set it for LBR, because the perf_sched_events is always 
enabled

for LBR. But, yes, we should explicitly set the PERF_ATTACH_SCHED_CB
for LBR.

 if (has_branch_stack(event))
 inc = true;


If they indeed do not require the pmu::sched_task() callback for CPU
events, then I still think the whole perf_sched_cb_{inc,dec}() 
interface

No, LBR requires the pmu::sched_task() callback for CPU events.

Now, The LBR registers have to be reset in sched in even for CPU 
events.


To fix the shorter LBR callstack issue for CPU events, we also 
need to

save/restore LBRs in pmu::sched_task().
https://lore.kernel.org/lkml/1578495789-95006-4-git-send-email-kan.li...@linux.intel.com/ 




is confusing at best.

Can't we do something like this instead?


I think the below patch may have two issues.
- PERF_ATTACH_SCHED_CB is required for LBR (maybe PowerPC BHRB as 
well) now.

- We may disable the large PEBS later if not all PEBS events support
large PEBS. The PMU need a way to notify the generic code to decrease
the nr_sched_task.

Any updates on this?  I've reviewed and tested Kan's patches
and they all look good.

Maybe we can talk to PPC folks to confirm the BHRB case?
Can we move this forward?  I saw patch 3/3 also adds 
PERF_ATTACH_SCHED_CB

for PowerPC too.  But it'd be nice if ppc folks can confirm the change.

Sorry I've read the whole thread, but I'm still not entirely sure I
understand the question.

Thanks for your time and sorry about not being clear enough.

We found per-cpu events are not calling pmu::sched_task()
on context switches.  So PERF_ATTACH_SCHED_CB was
added to indicate the core logic that it needs to invoke the
callback.

The patch 3/3 added the flag to PPC (for BHRB) with other
changes (I think it should be split like in the patch 2/3) and
want to get ACKs from the PPC folks.


Sorry for delay.

I guess first it will be better to split the ppc change to a separate 
patch,


Both PPC and X86 invokes the perf_sched_cb_inc() directly. The patch 
changes the parameters of the perf_sched_cb_inc(). I think we have to 
update the PPC and X86 codes together. Otherwise, there will be a 
compile error, if someone may only applies the change for the 
perf_sched_cb_inc() but forget to applies the changes in PPC or X86 
specific codes.




secondly, we are missing the changes needed in the power_pmu_bhrb_disable()

where perf_sched_cb_dec() needs the "state" to be included.



Ah, right. The below patch should fix the issue.

diff --git a/arch/powerpc/perf/core-book3s.c 
b/arch/powerpc/perf/core-book3s.c

index bced502f64a1..6756d1602a67 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -391,13 +391,18 @@ static void power_pmu_bhrb_enable(struct 
perf_event *event)

 static void power_pmu_bhrb_disable(struct perf_event *event)
 {
struct cpu_hw_events *cpuhw = this_cpu_ptr(_hw_events);
+   int state = PERF_SCHED_CB_SW_IN;

if (!ppmu->bhrb_nr)
return;

WARN_ON_ONCE(!cpuhw->bhrb_users);
cpuhw->bhrb_users--;
-   perf_sched_cb_dec(event->ctx->pmu);
+
+   if (!(event->attach_state & PERF_ATTACH_TASK))
+   state |= PERF_SCHED_CB_CPU;
+
+   perf_

Re: [PATCH 3/5] perf/core: Fix arch_perf_get_page_size()

2020-11-16 Thread Liang, Kan




On 11/13/2020 6:19 AM, Peter Zijlstra wrote:

The (new) page-table walker in arch_perf_get_page_size() is broken in
various ways. Specifically while it is used in a locless manner, it
doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
offset methods, nor is careful to only read each entry only once.

Also the hugetlb support is broken due to calling pte_page() without
first checking pte_special().

Rewrite the whole thing to be a proper lockless page-table walker and
employ the new pXX_leaf_size() pgtable functions to determine the TLB
size without looking at the page-frames.

Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")


The issue 
(https://lkml.kernel.org/r/8e88ba79-7c40-ea32-a7ed-bdc4fc04b...@linux.intel.com) 
has been fixed by this patch set.


Tested-by: Kan Liang 



Signed-off-by: Peter Zijlstra (Intel) 
---
  arch/arm64/include/asm/pgtable.h|3 +
  arch/sparc/include/asm/pgtable_64.h |   13 
  arch/sparc/mm/hugetlbpage.c |   19 --
  include/linux/pgtable.h |   16 +
  kernel/events/core.c|  102 
+---
  5 files changed, 82 insertions(+), 71 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7001,90 +7001,62 @@ static u64 perf_virt_to_phys(u64 virt)
return phys_addr;
  }
  
-#ifdef CONFIG_MMU

-
  /*
- * Return the MMU page size of a given virtual address.
- *
- * This generic implementation handles page-table aligned huge pages, as well
- * as non-page-table aligned hugetlbfs compound pages.
- *
- * If an architecture supports and uses non-page-table aligned pages in their
- * kernel mapping it will need to provide it's own implementation of this
- * function.
+ * Return the MMU/TLB page size of a given virtual address.
   */
-__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+static u64 perf_get_tlb_page_size(struct mm_struct *mm, unsigned long addr)
  {
-   struct page *page;
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
+   u64 size = 0;
  
-	pgd = pgd_offset(mm, addr);

-   if (pgd_none(*pgd))
-   return 0;
+#ifdef CONFIG_HAVE_FAST_GUP
+   pgd_t *pgdp, pgd;
+   p4d_t *p4dp, p4d;
+   pud_t *pudp, pud;
+   pmd_t *pmdp, pmd;
+   pte_t *ptep, pte;
  
-	p4d = p4d_offset(pgd, addr);

-   if (!p4d_present(*p4d))
+   pgdp = pgd_offset(mm, addr);
+   pgd = READ_ONCE(*pgdp);
+   if (pgd_none(pgd))
return 0;
  
-	if (p4d_leaf(*p4d))

-   return 1ULL << P4D_SHIFT;
+   if (pgd_leaf(pgd))
+   return pgd_leaf_size(pgd);
  
-	pud = pud_offset(p4d, addr);

-   if (!pud_present(*pud))
+   p4dp = p4d_offset_lockless(pgdp, pgd, addr);
+   p4d = READ_ONCE(*p4dp);
+   if (!p4d_present(p4d))
return 0;
  
-	if (pud_leaf(*pud)) {

-#ifdef pud_page
-   page = pud_page(*pud);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PUD_SHIFT;
-   }
+   if (p4d_leaf(p4d))
+   return p4d_leaf_size(p4d);
  
-	pmd = pmd_offset(pud, addr);

-   if (!pmd_present(*pmd))
+   pudp = pud_offset_lockless(p4dp, p4d, addr);
+   pud = READ_ONCE(*pudp);
+   if (!pud_present(pud))
return 0;
  
-	if (pmd_leaf(*pmd)) {

-#ifdef pmd_page
-   page = pmd_page(*pmd);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PMD_SHIFT;
-   }
+   if (pud_leaf(pud))
+   return pud_leaf_size(pud);
  
-	pte = pte_offset_map(pmd, addr);

-   if (!pte_present(*pte)) {
-   pte_unmap(pte);
+   pmdp = pmd_offset_lockless(pudp, pud, addr);
+   pmd = READ_ONCE(*pmdp);
+   if (!pmd_present(pmd))
return 0;
-   }
  
-	page = pte_page(*pte);

-   if (PageHuge(page)) {
-   u64 size = page_size(compound_head(page));
-   pte_unmap(pte);
-   return size;
-   }
-
-   pte_unmap(pte);
-   return PAGE_SIZE;
-}
+   if (pmd_leaf(pmd))
+   return pmd_leaf_size(pmd);
  
-#else

+   ptep = pte_offset_map(, addr);
+   pte = ptep_get_lockless(ptep);
+   if (pte_present(pte))
+   size = pte_leaf_size(pte);
+   pte_unmap(ptep);
+#endif /* CONFIG_HAVE_FAST_GUP */
  
-static u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)

-{
-   return 0;
+   return size;
  }
  
-#endif

-
  static u64 perf_get_page_size(unsigned long addr)
  {
struct mm_struct *mm;
@@ -7109,7 +7081,7 @@ static u64 perf_get_page_size(unsigned l
mm = _mm;
}
  
-	size = arch_perf_get_page_size(mm, addr);

+   

Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events

2020-11-11 Thread Liang, Kan




On 11/11/2020 11:25 AM, Peter Zijlstra wrote:

On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote:


- When the large PEBS was introduced (9c964efa4330), the sched_task() should
be invoked to flush the PEBS buffer in each context switch. However, The
perf_sched_events in account_event() is not updated accordingly. The
perf_event_task_sched_* never be invoked for a pure per-CPU context. Only
per-task event works.
At that time, the perf_pmu_sched_task() is outside of
perf_event_context_sched_in/out. It means that perf has to double
perf_pmu_disable() for per-task event.



- The patch 1 tries to fix broken per-CPU events. The CPU context cannot be
retrieved from the task->perf_event_ctxp. So it has to be tracked in the
sched_cb_list. Yes, the code is very similar to the original codes, but it
is actually the new code for per-CPU events. The optimization for per-task
events is still kept.
   For the case, which has both a CPU context and a task context, yes, the
__perf_pmu_sched_task() in this patch is not invoked. Because the
sched_task() only need to be invoked once in a context switch. The
sched_task() will be eventually invoked in the task context.


The thing is; your first two patches rely on PERF_ATTACH_SCHED_CB and
only set that for large pebs. Are you sure the other users (Intel LBR
and PowerPC BHRB) don't need it?


I didn't set it for LBR, because the perf_sched_events is always enabled 
for LBR. But, yes, we should explicitly set the PERF_ATTACH_SCHED_CB 
for LBR.


if (has_branch_stack(event))
inc = true;



If they indeed do not require the pmu::sched_task() callback for CPU
events, then I still think the whole perf_sched_cb_{inc,dec}() interface


No, LBR requires the pmu::sched_task() callback for CPU events.

Now, The LBR registers have to be reset in sched in even for CPU events.

To fix the shorter LBR callstack issue for CPU events, we also need to 
save/restore LBRs in pmu::sched_task().

https://lore.kernel.org/lkml/1578495789-95006-4-git-send-email-kan.li...@linux.intel.com/


is confusing at best.

Can't we do something like this instead?


I think the below patch may have two issues.
- PERF_ATTACH_SCHED_CB is required for LBR (maybe PowerPC BHRB as well) now.
- We may disable the large PEBS later if not all PEBS events support 
large PEBS. The PMU need a way to notify the generic code to decrease 
the nr_sched_task.


The current related code in Intel PMU is as below.

static void
pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
  struct perf_event *event, bool add)
{
struct pmu *pmu = event->ctx->pmu;
/*
 * Make sure we get updated with the first PEBS
 * event. It will trigger also during removal, but
 * that does not hurt:
 */
bool update = cpuc->n_pebs == 1;

if (needed_cb != pebs_needs_sched_cb(cpuc)) {
if (!needed_cb)
perf_sched_cb_inc(pmu);
else
perf_sched_cb_dec(pmu);

update = true;
}

Thanks,
Kan


---
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 546cc89217bb..672d6f039fce 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3565,8 +3565,10 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (!(event->attr.freq || (event->attr.wakeup_events && 
!event->attr.watermark))) {
event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (!(event->attr.sample_type &
- ~intel_pmu_large_pebs_flags(event)))
+ ~intel_pmu_large_pebs_flags(event))) {
event->hw.flags |= PERF_X86_EVENT_LARGE_PEBS;
+   event->attach_state |= PERF_ATTACH_SCHED_CB;
+   }
}
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9a38f579bc76..af9ee539c179 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -606,6 +606,7 @@ struct swevent_hlist {
  #define PERF_ATTACH_TASK  0x04
  #define PERF_ATTACH_TASK_DATA 0x08
  #define PERF_ATTACH_ITRACE0x10
+#define PERF_ATTACH_SCHED_CB   0x20
  
  struct perf_cgroup;

  struct perf_buffer;
@@ -817,6 +818,7 @@ struct perf_event_context {
int is_active;
int nr_stat;
int nr_freq;
+   int nr_sched_task;
int rotate_disable;
/*
 * Set when nr_events != nr_active, except tolerant to events not
@@ -872,7 +874,7 @@ struct perf_cpu_context {
struct list_headcgrp_cpuc

Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-11-10 Thread Liang, Kan




On 11/4/2020 12:11 PM, Liang, Kan wrote:



On 10/13/2020 12:34 PM, Peter Zijlstra wrote:

Subject: perf,mm: Handle non-page-table-aligned hugetlbfs
From: Peter Zijlstra
Date: Fri, 9 Oct 2020 11:09:27 +0200

A limited nunmber of architectures support hugetlbfs sizes that do not
align with the page-tables (ARM64, Power, Sparc64). Add support for
this to the generic perf_get_page_size() implementation, and also
allow an architecture to override this implementation.

This latter is only needed when it uses non-page-table aligned huge
pages in its kernel map.

Signed-off-by: Peter Zijlstra (Intel)
---
  kernel/events/core.c |   39 +--
  1 file changed, 33 insertions(+), 6 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6996,10 +6996,18 @@ static u64 perf_virt_to_phys(u64 virt)
  #ifdef CONFIG_MMU
  /*
- * Return the MMU page size of a given virtual address
+ * Return the MMU page size of a given virtual address.
+ *
+ * This generic implementation handles page-table aligned huge pages, 
as well

+ * as non-page-table aligned hugetlbfs compound pages.
+ *
+ * If an architecture supports and uses non-page-table aligned pages 
in their
+ * kernel mapping it will need to provide it's own implementation of 
this

+ * function.
   */
-static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long 
addr)
+__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned 
long addr)

  {
+    struct page *page;
  pgd_t *pgd;
  p4d_t *p4d;
  pud_t *pud;
@@ -7021,15 +7029,27 @@ static u64 __perf_get_page_size(struct m
  if (!pud_present(*pud))
  return 0;
-    if (pud_leaf(*pud))
+    if (pud_leaf(*pud)) {
+#ifdef pud_page
+    page = pud_page(*pud);
+    if (PageHuge(page))
+    return page_size(compound_head(page));
+#endif
  return 1ULL << PUD_SHIFT;
+    }
  pmd = pmd_offset(pud, addr);
  if (!pmd_present(*pmd))
  return 0;
-    if (pmd_leaf(*pmd))
+    if (pmd_leaf(*pmd)) {
+#ifdef pmd_page
+    page = pmd_page(*pmd);
+    if (PageHuge(page))
+    return page_size(compound_head(page));
+#endif
  return 1ULL << PMD_SHIFT;
+    }
  pte = pte_offset_map(pmd, addr);
  if (!pte_present(*pte)) {
@@ -7037,13 +7057,20 @@ static u64 __perf_get_page_size(struct m
  return 0;
  }
+    page = pte_page(*pte);
+    if (PageHuge(page)) {
+    u64 size = page_size(compound_head(page));
+    pte_unmap(pte);
+    return size;
+    }
+


The PageHuge() check for PTE crashes my machine when I did page size 
test. (Sorry, I didn't find the issue earlier. I just found some time to 
re-run the page size test.)


It seems we don't need the check for PTE here. The size should be always 
PAGE_SIZE, no? After I remove the check, everything looks good.




Hi Peter,

Any comments for this issue?

As my understanding, we don't need the PageHuge() check for PTE page 
here. However, I don't expect that the page fault crashes here. Because 
perf already check the pte_present(*pte) before invoking PageHuge().


I'm not sure if it's triggered by other potential issues, so I'm asking 
here rather than submiting a patch to simply remove the PageHuge() check.


Thanks,
Kan


[  167.383120] BUG: unable to handle page fault for address: 
fca803fb8000

[  167.383121] #PF: supervisor read access in kernel mode
[  167.383121] #PF: error_code(0x) - not-present page
[  167.383121] PGD 4adfc8067 P4D 4adfc8067 PUD 4adfc7067 PMD 0
[  167.383122] Oops:  [#1] SMP NOPTI
[  167.383122] CPU: 0 PID: 2461 Comm: perf Not tainted 
5.10.0-rc1_page_size+ #54
[  167.383123] Hardware name: Intel Corporation Ice Lake Client 
Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS 
ICLSFWR1.R00.3162.A00.1904162000 04/16/2019

[  167.383123] traps: PANIC: double fault, error_code: 0x0
[  167.383123] double fault:  [#2] SMP NOPTI
[  167.383123] CPU: 0 PID: 2461 Comm: perf Not tainted 
5.10.0-rc1_page_size+ #54
[  167.383124] Hardware name: Intel Corporation Ice Lake Client 
Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS 
ICLSFWR1.R00.3162.A00.1904162000 04/16/2019

[  167.383124] RIP: 0010:__sprint_symbol+0xb/0x100
[  167.383124] Code: 85 e4 75 b9 48 85 c0 75 bc 49 89 d8 4c 89 e1 4c 89 
fa 4c 89 f6 4c 89 ef e8 42 70 04 00 eb a6 0f 1f 44 00 00 55 48 89 e5 41 
57 <41> 56 49 89 f6 41 55 41 89 cd 48 8d 4d b8 41 54 49 89 fc 53 48 63

[  167.383125] RSP: 0018:fe00b000 EFLAGS: 00010046
[  167.383125] RAX: 0053 RBX: 0a00 RCX: 
0001
[  167.383125] RDX:  RSI: 9f8a6176 RDI: 
fe00b029
[  167.383126] RBP: fe00b008 R08: a0bf8661 R09: 
0001
[  167.383126] R10: 000f R11: 9e641dc189c8 R12: 
9e641dc189c9
[  167.383126] R13: 9e641dc1a7e0 R14: 0a00ff05 R15: 
fe00b029
[  167.383126] FS:  () GS:9e641dc0() 
knlG

Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events

2020-11-09 Thread Liang, Kan




On 11/9/2020 12:33 PM, Peter Zijlstra wrote:

On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote:

Maybe we can frob x86_pmu_enable()...


Could you please elaborate?


Something horrible like this. It will detect the first time we enable
the PMU on a new task (IOW we did a context switch) and wipe the
counters when user RDPMC is on...



Oh, you mean the RDPMC patch. It should be doable, but I think 
sched_task() may be a better place, especially with the new optimization 
(patch 3). We can set PERF_SCHED_CB_SW_IN bit for the case, so we only 
do the check for per-task events in sched in.


It looks like the below patch has to unconditionally do the check (even 
for the non-RDPMC cases), which should be unnecessary.


Anyway, I think the RDPMC patch should depend on the implementation of 
the sched_task(). We may have further discussion when the design of 
sched_task() is finalized.



Thanks,
Kan


diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 77b963e5e70a..d862927baaef 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1289,6 +1289,15 @@ static void x86_pmu_enable(struct pmu *pmu)
perf_events_lapic_init();
}
  
+	if (cpuc->current != current) {

+   struct mm_struct *mm = current->mm;
+
+   cpuc->current = current;
+
+   if (mm && atomic_read(>context.perf_rdpmc_allowed))
+   wipe_dirty_counters();
+   }
+
cpuc->enabled = 1;
barrier();
  
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h

index 7895cf4c59a7..d16118cb3bd0 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -248,6 +248,8 @@ struct cpu_hw_events {
unsigned inttxn_flags;
int is_fake;
  
+	void			*current;

+
/*
 * Intel DebugStore bits
 */



Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events

2020-11-09 Thread Liang, Kan




On 11/9/2020 6:04 AM, Peter Zijlstra wrote:

On Mon, Nov 09, 2020 at 10:52:35AM +0100, Peter Zijlstra wrote:

On Fri, Nov 06, 2020 at 01:29:33PM -0800, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Sometimes the PMU internal buffers have to be flushed for per-CPU events
during a context switch, e.g., large PEBS. Otherwise, the perf tool may
report samples in locations that do not belong to the process where the
samples are processed in, because PEBS does not tag samples with PID/TID.

The current code only flush the buffers for a per-task event. It doesn't
check a per-CPU event.

Add a new event state flag, PERF_ATTACH_SCHED_CB, to indicate that the
PMU internal buffers have to be flushed for this event during a context
switch.

Add sched_cb_entry and perf_sched_cb_usages back to track the PMU/cpuctx
which is required to be flushed.

Only need to invoke the sched_task() for per-CPU events in this patch.
The per-task events have been handled in perf_event_context_sched_in/out
already.

Fixes: 9c964efa4330 ("perf/x86/intel: Drain the PEBS buffer during context 
switches")


Are you sure? In part this patch looks like a revert of:

   44fae179ce73a26733d9e2d346da4e1a1cb94647
   556cccad389717d6eb4f5a24b45ff41cad3aaabf




The patch tries to fix a long term PEBS bug with per-CPU event. It's not 
a revert. I think we still want to keep the optimization for per-task 
event with 44fae179ce73 and 556cccad3897.


Here is the story.

- When the large PEBS was introduced (9c964efa4330), the sched_task() 
should be invoked to flush the PEBS buffer in each context switch. 
However, The perf_sched_events in account_event() is not updated 
accordingly. The perf_event_task_sched_* never be invoked for a pure 
per-CPU context. Only per-task event works.
   At that time, the perf_pmu_sched_task() is outside of 
perf_event_context_sched_in/out. It means that perf has to double 
perf_pmu_disable() for per-task event.



- The recent optimization (44fae179ce73 and 556cccad3897) only optimize 
the per-task event by moving the sched_task() into 
perf_event_context_sched_in/out. So perf only need to invoke 
perf_pmu_disable() once.
  But it doesn't fix the issue in a pure per-CPU context. The per-CPU 
events are still broken.



- The patch 1 tries to fix broken per-CPU events. The CPU context cannot 
be retrieved from the task->perf_event_ctxp. So it has to be tracked in 
the sched_cb_list. Yes, the code is very similar to the original codes, 
but it is actually the new code for per-CPU events. The optimization for 
per-task events is still kept.
  For the case, which has both a CPU context and a task context, yes, 
the __perf_pmu_sched_task() in this patch is not invoked. Because the 
sched_task() only need to be invoked once in a context switch. The 
sched_task() will be eventually invoked in the task context.



- The following patch 3 tries to optimize the sched_task() further.
  - We handle the per-CPU event and per-task event in different places. 
In a pure per-task context, we don't need to go through the 
sched_cb_list. We can get the information from task->perf_event_ctxp. We 
introduce a flag to distinguish them.
  - For most of the cases, we don't need to invoke the sched_task() in 
both sched in and sched out. We also introduce a flag for PMUs to avoid 
unnecessary calls.



All the implementation and optimization are generic. For example, for 
power, the BHRB entries are reset in context switch in. They can also 
benefit from the optimization.




*groan*... I think I might've made a mistake with those two patches. I
assumed the whole cpuctx->task_ctx thing was relevant, it is not.

As per perf_sched_cb_{inc,dec}(struct pmu *), the thing we care about is
that the *PMU* gets a context switch callback, we don't give a crap
about the actual task context. Except that LBR code does, but I'm
thinking that started the whole confusion -- and I'm still not sure it's
actually correct either.

Now,.. how did we end up with the above two patches anyway... /me frobs
around in the inbox... Ah! that daft user RDPMC thing. I wanted to avoid
yet another pmu::method().

Hmm.. and the reason I proposed that change is because we'd end up with
the sched_cb for every context switch now, not just large-pebs and or
lbr crud. And this form avoids the double perf_pmu_disable() and all
that.


With the patch set,  I think the double perf_pmu_disable() is still avoided.



Maybe we can frob x86_pmu_enable()...


Could you please elaborate?

Thanks,
Kan



Re: [RFC 1/2] perf/core: Enable sched_task callbacks if PMU has it

2020-11-06 Thread Liang, Kan




On 11/5/2020 7:53 PM, Namhyung Kim wrote:

On Fri, Nov 6, 2020 at 4:01 AM Liang, Kan  wrote:



On 11/5/2020 10:45 AM, Namhyung Kim wrote:

Hello,

On Thu, Nov 5, 2020 at 11:47 PM Liang, Kan  wrote:



On 11/2/2020 9:52 AM, Namhyung Kim wrote:

If an event associated with a PMU which has a sched_task callback,
it should be called regardless of cpu/task context.  For example,


I don't think it's necessary. We should call it when we have to.
Otherwise, it just waste cycles.
Shouldn't the patch 2 be enough?

I'm not sure, without this patch __perf_event_task_sched_in/out
cannot be called for per-cpu events (w/o cgroups)  IMHO.
And I could not find any other place to check the
perf_sched_cb_usages.


Yes, it should a bug for large PEBS, and it should has always been there
since the large PEBS was introduced. I just tried some older kernels
(before recent change). Large PEBS is not flushed with per-cpu events.

But from your description, it looks like the issue is only found after
recent change. Could you please double check if the issue can also be
reproduced before the recent change?

Yep, actually Gabriel reported this problem on v4.4 kernel.


Thanks for the confirm.

So large PEBS never works with per-cpu events. :(
I will send a new patch set to address the issue.

Thanks,
Kan


Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events

2020-11-05 Thread Liang, Kan




On 11/5/2020 4:15 PM, Stephane Eranian wrote:

On Thu, Nov 5, 2020 at 11:40 AM Liang, Kan  wrote:




On 11/5/2020 10:54 AM, Namhyung Kim wrote:

-void perf_sched_cb_inc(struct pmu *pmu)
+void perf_sched_cb_inc(struct pmu *pmu, bool systemwide)
{
  struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);

-   if (!cpuctx->sched_cb_usage++)
-   list_add(>sched_cb_entry, this_cpu_ptr(_cb_list));
+   cpuctx->sched_cb_usage++;

-   this_cpu_inc(perf_sched_cb_usages);
+   if (systemwide) {
+   this_cpu_inc(perf_sched_cb_usages);
+   list_add(>sched_cb_entry, this_cpu_ptr(_cb_list));

You need to check the value and make sure it's added only once.


Right, maybe we have to add a new variable for that.


Sure, I tend to agree here that we need a narrower scope trigger, only
when needed, i.e., an event
or PMU feature that requires ctxsw work. In fact, I am also interested
in splitting ctxswin and ctswout
callbacks. The reason is that you have overhead in the way the
callback is invoked. You may end up
calling the sched_task on ctxswout when only ctxwin is needed. In
doing that you pay the cost of
stopping/starting the PMU for possibly nothing. Stopping the PMU can
be expensive, like on AMD
where you need multiple wrmsr.

So splitting or adding a flag to convey that either CTXSW_IN or
CTXSW_OUT is needed would help.
I am suggesting this now given you are adding a flag.



Yes, adding a new flag would avoid the unnecessary PMU stop/restart for 
both per-cpu and per-process event.


How about the patch as below?
(Just did simple test. Should need another patch to split the callback.)

diff --git a/arch/powerpc/perf/core-book3s.c 
b/arch/powerpc/perf/core-book3s.c

index 6586f7e71cfb..e5837fdf82e3 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -370,6 +370,7 @@ static void power_pmu_bhrb_reset(void)
 static void power_pmu_bhrb_enable(struct perf_event *event)
 {
struct cpu_hw_events *cpuhw = this_cpu_ptr(_hw_events);
+   int state = PERF_SCHED_CB_SW_IN;

if (!ppmu->bhrb_nr)
return;
@@ -380,7 +381,11 @@ static void power_pmu_bhrb_enable(struct perf_event 
*event)

cpuhw->bhrb_context = event->ctx;
}
cpuhw->bhrb_users++;
-   perf_sched_cb_inc(event->ctx->pmu);
+
+   if (!(event->attach_state & PERF_ATTACH_TASK))
+   state |= PERF_SCHED_CB_CPU;
+
+   perf_sched_cb_inc(event->ctx->pmu, state);
 }

 static void power_pmu_bhrb_disable(struct perf_event *event)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 444e5f061d04..f4f9d737ca85 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1019,12 +1019,16 @@ pebs_update_state(bool needed_cb, struct 
cpu_hw_events *cpuc,

 * that does not hurt:
 */
bool update = cpuc->n_pebs == 1;
+   int state = PERF_SCHED_CB_SW_OUT;

if (needed_cb != pebs_needs_sched_cb(cpuc)) {
+   if (!(event->attach_state & PERF_ATTACH_TASK))
+   state |= PERF_SCHED_CB_CPU;
+
if (!needed_cb)
-   perf_sched_cb_inc(pmu);
+   perf_sched_cb_inc(pmu, state);
else
-   perf_sched_cb_dec(pmu);
+   perf_sched_cb_dec(pmu, state);

update = true;
}
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 8961653c5dd2..e4c500580df5 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -660,6 +660,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
 {
struct kmem_cache *kmem_cache = event->pmu->task_ctx_cache;
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   int state = PERF_SCHED_CB_SW_IN;

if (!x86_pmu.lbr_nr)
return;
@@ -693,7 +694,13 @@ void intel_pmu_lbr_add(struct perf_event *event)
 */
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users++;
-   perf_sched_cb_inc(event->ctx->pmu);
+
+   if (!(event->attach_state & PERF_ATTACH_TASK))
+   state |= PERF_SCHED_CB_CPU;
+   if (event->attach_state & PERF_ATTACH_TASK_DATA)
+   state |= PERF_SCHED_CB_SW_OUT;
+
+   perf_sched_cb_inc(event->ctx->pmu, state);
if (!cpuc->lbr_users++ && !event->total_time_running)
intel_pmu_lbr_reset();

@@ -724,6 +731,7 @@ void release_lbr_buffers(void)
 void intel_pmu_lbr_del(struct perf_event *event)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   int state = 0;

if (!x86_pmu.lbr_nr)
return;
@@ -740,7 +748,10 @@ void intel_pmu_lbr_del(struct perf_event *event)
cpuc->lbr_use

Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events

2020-11-05 Thread Liang, Kan




On 11/5/2020 10:54 AM, Namhyung Kim wrote:

-void perf_sched_cb_inc(struct pmu *pmu)
+void perf_sched_cb_inc(struct pmu *pmu, bool systemwide)
   {
 struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);

-   if (!cpuctx->sched_cb_usage++)
-   list_add(>sched_cb_entry, this_cpu_ptr(_cb_list));
+   cpuctx->sched_cb_usage++;

-   this_cpu_inc(perf_sched_cb_usages);
+   if (systemwide) {
+   this_cpu_inc(perf_sched_cb_usages);
+   list_add(>sched_cb_entry, this_cpu_ptr(_cb_list));

You need to check the value and make sure it's added only once.


Right, maybe we have to add a new variable for that.


diff --git a/arch/powerpc/perf/core-book3s.c 
b/arch/powerpc/perf/core-book3s.c

index 6586f7e71cfb..63c9b87cab5e 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -380,7 +380,7 @@ static void power_pmu_bhrb_enable(struct perf_event 
*event)

cpuhw->bhrb_context = event->ctx;
}
cpuhw->bhrb_users++;
-   perf_sched_cb_inc(event->ctx->pmu);
+	perf_sched_cb_inc(event->ctx->pmu, !(event->attach_state & 
PERF_ATTACH_TASK));

 }

 static void power_pmu_bhrb_disable(struct perf_event *event)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 444e5f061d04..a34b90c7fa6d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1022,9 +1022,9 @@ pebs_update_state(bool needed_cb, struct 
cpu_hw_events *cpuc,


if (needed_cb != pebs_needs_sched_cb(cpuc)) {
if (!needed_cb)
-   perf_sched_cb_inc(pmu);
+   perf_sched_cb_inc(pmu, !(event->attach_state & 
PERF_ATTACH_TASK));
else
-   perf_sched_cb_dec(pmu);
+   perf_sched_cb_dec(pmu, !(event->attach_state & 
PERF_ATTACH_TASK));

update = true;
}
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 8961653c5dd2..8d4d02cde3d4 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -693,7 +693,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
 */
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users++;
-   perf_sched_cb_inc(event->ctx->pmu);
+	perf_sched_cb_inc(event->ctx->pmu, !(event->attach_state & 
PERF_ATTACH_TASK));

if (!cpuc->lbr_users++ && !event->total_time_running)
intel_pmu_lbr_reset();

@@ -740,7 +740,7 @@ void intel_pmu_lbr_del(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
-   perf_sched_cb_dec(event->ctx->pmu);
+	perf_sched_cb_dec(event->ctx->pmu, !(event->attach_state & 
PERF_ATTACH_TASK));

 }

 static inline bool vlbr_exclude_host(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a1b91f2de264..14f936385cc8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -875,6 +875,7 @@ struct perf_cpu_context {

struct list_headsched_cb_entry;
int sched_cb_usage;
+   int sched_cb_sw_usage;

int online;
/*
@@ -967,8 +968,8 @@ extern const struct perf_event_attr 
*perf_event_attrs(struct perf_event *event);

 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
-extern void perf_sched_cb_dec(struct pmu *pmu);
-extern void perf_sched_cb_inc(struct pmu *pmu);
+extern void perf_sched_cb_dec(struct pmu *pmu, bool systemwide);
+extern void perf_sched_cb_inc(struct pmu *pmu, bool systemwide);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 66a9bd71f3da..af75859c9138 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3484,22 +3484,32 @@ static void perf_event_context_sched_out(struct 
task_struct *task, int ctxn,


 static DEFINE_PER_CPU(struct list_head, sched_cb_list);

-void perf_sched_cb_dec(struct pmu *pmu)
+void perf_sched_cb_dec(struct pmu *pmu, bool systemwide)
 {
struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);

+   --cpuctx->sched_cb_usage;
+
+   if (!systemwide)
+   return;
+
this_cpu_dec(perf_sched_cb_usages);

-   if (!--cpuctx->sched_cb_usage)
+   if (!--cpuctx->sched_cb_sw_usage)
list_del(>sched_cb_entry);
 }


-void perf_sched_cb_inc(struct pmu *pmu)
+void perf_sched_cb_inc(struct pmu *pmu, bool systemwide)
 {
struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);

-   if (!cpuctx->sched_cb_usage++)
+   cpuctx->sched_cb_usage++;
+
+   if (!systemwide)
+   return;
+
+   if 

Re: [RFC 1/2] perf/core: Enable sched_task callbacks if PMU has it

2020-11-05 Thread Liang, Kan




On 11/5/2020 10:45 AM, Namhyung Kim wrote:

Hello,

On Thu, Nov 5, 2020 at 11:47 PM Liang, Kan  wrote:




On 11/2/2020 9:52 AM, Namhyung Kim wrote:

If an event associated with a PMU which has a sched_task callback,
it should be called regardless of cpu/task context.  For example,



I don't think it's necessary. We should call it when we have to.
Otherwise, it just waste cycles.
Shouldn't the patch 2 be enough?


I'm not sure, without this patch __perf_event_task_sched_in/out
cannot be called for per-cpu events (w/o cgroups)  IMHO.
And I could not find any other place to check the
perf_sched_cb_usages.



Yes, it should a bug for large PEBS, and it should has always been there 
since the large PEBS was introduced. I just tried some older kernels 
(before recent change). Large PEBS is not flushed with per-cpu events.


But from your description, it looks like the issue is only found after 
recent change. Could you please double check if the issue can also be 
reproduced before the recent change?




diff --git a/kernel/events/core.c b/kernel/events/core.c
index b458ed3dc81b..aaa0155c4142 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4696,6 +4696,8 @@ static void unaccount_event(struct perf_event *event)
   dec = true;
   if (has_branch_stack(event))
   dec = true;
+ if (event->pmu->sched_task)
+ dec = true;


I think sched_task is a too big hammer. The 
__perf_event_task_sched_in/out will be invoked even for non-pebs per-cpu 
events, which is not necessary.


Maybe we can introduce a flag to indicate the case. How about the patch 
as below?


diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c79748f6921d..953a4bb98330 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3565,8 +3565,10 @@ static int intel_pmu_hw_config(struct perf_event 
*event)
 		if (!(event->attr.freq || (event->attr.wakeup_events && 
!event->attr.watermark))) {

event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
if (!(event->attr.sample_type &
- ~intel_pmu_large_pebs_flags(event)))
+ ~intel_pmu_large_pebs_flags(event))) {
event->hw.flags |= PERF_X86_EVENT_LARGE_PEBS;
+   event->attach_state |= PERF_ATTACH_SCHED_DATA;
+   }
}
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0defb526cd0c..3eef7142aa11 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -606,6 +606,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_TASK   0x04
 #define PERF_ATTACH_TASK_DATA  0x08
 #define PERF_ATTACH_ITRACE 0x10
+#define PERF_ATTACH_SCHED_DATA 0x20

 struct perf_cgroup;
 struct perf_buffer;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dba4ea4e648b..a38133b5543a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4675,7 +4675,7 @@ static void unaccount_event(struct perf_event *event)
if (event->parent)
return;

-   if (event->attach_state & PERF_ATTACH_TASK)
+   if (event->attach_state & (PERF_ATTACH_TASK | PERF_ATTACH_SCHED_DATA))
dec = true;
if (event->attr.mmap || event->attr.mmap_data)
atomic_dec(_mmap_events);
@@ -11204,7 +11204,7 @@ static void account_event(struct perf_event *event)
if (event->parent)
return;

-   if (event->attach_state & PERF_ATTACH_TASK)
+   if (event->attach_state & (PERF_ATTACH_TASK | PERF_ATTACH_SCHED_DATA))
inc = true;
if (event->attr.mmap || event->attr.mmap_data)
atomic_inc(_mmap_events);

Thanks,
Kan


Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events

2020-11-05 Thread Liang, Kan




On 11/2/2020 9:52 AM, Namhyung Kim wrote:

The commit 44fae179ce73 ("perf/core: Pull pmu::sched_task() into
perf_event_context_sched_out()") moved the pmu::sched_task callback to
be called for task event context.  But it missed to call it for
per-cpu events to flush PMU internal buffers (i.e. for PEBS, ...).

This commit basically reverts the commit but keeps the optimization
for task events and only add missing calls for cpu events.

Fixes: 44fae179ce73 ("perf/core: Pull pmu::sched_task() into 
perf_event_context_sched_out()") > Signed-off-by: Namhyung Kim 
---
  include/linux/perf_event.h |  1 +
  kernel/events/core.c   | 38 --
  2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0defb526cd0c..abb70a557cb5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -872,6 +872,7 @@ struct perf_cpu_context {
struct list_headcgrp_cpuctx_entry;
  #endif
  
+	struct list_head		sched_cb_entry;

int sched_cb_usage;
  
  	intonline;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index aaa0155c4142..c04d9a913537 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -383,6 +383,7 @@ static DEFINE_MUTEX(perf_sched_mutex);
  static atomic_t perf_sched_count;
  
  static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);

+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
  static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
  
  static atomic_t nr_mmap_events __read_mostly;

@@ -3480,11 +3481,16 @@ static void perf_event_context_sched_out(struct 
task_struct *task, int ctxn,
}
  }
  
+static DEFINE_PER_CPU(struct list_head, sched_cb_list);

+
  void perf_sched_cb_dec(struct pmu *pmu)
  {
struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
  
-	--cpuctx->sched_cb_usage;

+   this_cpu_dec(perf_sched_cb_usages);
+
+   if (!--cpuctx->sched_cb_usage)
+   list_del(>sched_cb_entry);
  }
  
  
@@ -3492,7 +3498,10 @@ void perf_sched_cb_inc(struct pmu *pmu)

  {
struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
  
-	cpuctx->sched_cb_usage++;

+   if (!cpuctx->sched_cb_usage++)
+   list_add(>sched_cb_entry, this_cpu_ptr(_cb_list));
+
+   this_cpu_inc(perf_sched_cb_usages);
  }
  
  /*

@@ -3521,6 +3530,24 @@ static void __perf_pmu_sched_task(struct 
perf_cpu_context *cpuctx, bool sched_in
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
  }
  
+static void perf_pmu_sched_task(struct task_struct *prev,

+   struct task_struct *next,
+   bool sched_in)
+{
+   struct perf_cpu_context *cpuctx;
+
+   if (prev == next)
+   return;
+
+   list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), 
sched_cb_entry) {
+   /* will be handled in perf_event_context_sched_in/out */
+   if (cpuctx->task_ctx)
+   continue;
+
+   __perf_pmu_sched_task(cpuctx, sched_in);
+   }
+}
+
  static void perf_event_switch(struct task_struct *task,
  struct task_struct *next_prev, bool sched_in);
  
@@ -3543,6 +3570,9 @@ void __perf_event_task_sched_out(struct task_struct *task,

  {
int ctxn;
  
+	if (__this_cpu_read(perf_sched_cb_usages))

+   perf_pmu_sched_task(task, next, false);
+
if (atomic_read(_switch_events))
perf_event_switch(task, next, false);
  
@@ -3850,6 +3880,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
  
  	if (atomic_read(_switch_events))

perf_event_switch(task, prev, true);
+
+   if (__this_cpu_read(perf_sched_cb_usages))
+   perf_pmu_sched_task(prev, task, true);
  }


It looks like the ("perf/core: Pull pmu::sched_task() into 
perf_event_context_sched_in()") is also reverted in the patch.


  
  static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)

@@ -12999,6 +13032,7 @@ static void __init perf_event_init_all_cpus(void)
  #ifdef CONFIG_CGROUP_PERF
INIT_LIST_HEAD(_cpu(cgrp_cpuctx_list, cpu));
  #endif
+   INIT_LIST_HEAD(_cpu(sched_cb_list, cpu));
}
  }
  



Can we only update the perf_sched_cb_usages and sched_cb_list for 
per-cpu event as below patch (not tested)?


If user only uses the per-process event, we don't need to go through the 
list.



diff --git a/arch/powerpc/perf/core-book3s.c 
b/arch/powerpc/perf/core-book3s.c

index 6586f7e71cfb..63c9b87cab5e 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -380,7 +380,7 @@ static void power_pmu_bhrb_enable(struct perf_event 
*event)

cpuhw->bhrb_context = event->ctx;
}
cpuhw->bhrb_users++;
-   perf_sched_cb_inc(event->ctx->pmu);
+	perf_sched_cb_inc(event->ctx->pmu, 

Re: [RFC 1/2] perf/core: Enable sched_task callbacks if PMU has it

2020-11-05 Thread Liang, Kan




On 11/2/2020 9:52 AM, Namhyung Kim wrote:

If an event associated with a PMU which has a sched_task callback,
it should be called regardless of cpu/task context.  For example,



I don't think it's necessary. We should call it when we have to. 
Otherwise, it just waste cycles.

Shouldn't the patch 2 be enough?

Thanks,
Kan


a per-cpu event might enable large PEBS buffers so it needs to flush
the buffer whenever task scheduling happens. >
The underlying PMU may or may not require this for the given event,
but it will be handled in the pmu::sched_task() callback anyway.

Signed-off-by: Namhyung Kim 
---
  kernel/events/core.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b458ed3dc81b..aaa0155c4142 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4696,6 +4696,8 @@ static void unaccount_event(struct perf_event *event)
dec = true;
if (has_branch_stack(event))
dec = true;
+   if (event->pmu->sched_task)
+   dec = true;
if (event->attr.ksymbol)
atomic_dec(_ksymbol_events);
if (event->attr.bpf_event)
@@ -11225,6 +11227,8 @@ static void account_event(struct perf_event *event)
inc = true;
if (is_cgroup_event(event))
inc = true;
+   if (event->pmu->sched_task)
+   inc = true;
if (event->attr.ksymbol)
atomic_inc(_ksymbol_events);
if (event->attr.bpf_event)



Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-11-04 Thread Liang, Kan




On 10/13/2020 12:34 PM, Peter Zijlstra wrote:

Subject: perf,mm: Handle non-page-table-aligned hugetlbfs
From: Peter Zijlstra
Date: Fri, 9 Oct 2020 11:09:27 +0200

A limited nunmber of architectures support hugetlbfs sizes that do not
align with the page-tables (ARM64, Power, Sparc64). Add support for
this to the generic perf_get_page_size() implementation, and also
allow an architecture to override this implementation.

This latter is only needed when it uses non-page-table aligned huge
pages in its kernel map.

Signed-off-by: Peter Zijlstra (Intel)
---
  kernel/events/core.c |   39 +--
  1 file changed, 33 insertions(+), 6 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6996,10 +6996,18 @@ static u64 perf_virt_to_phys(u64 virt)
  #ifdef CONFIG_MMU
  
  /*

- * Return the MMU page size of a given virtual address
+ * Return the MMU page size of a given virtual address.
+ *
+ * This generic implementation handles page-table aligned huge pages, as well
+ * as non-page-table aligned hugetlbfs compound pages.
+ *
+ * If an architecture supports and uses non-page-table aligned pages in their
+ * kernel mapping it will need to provide it's own implementation of this
+ * function.
   */
-static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
  {
+   struct page *page;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -7021,15 +7029,27 @@ static u64 __perf_get_page_size(struct m
if (!pud_present(*pud))
return 0;
  
-	if (pud_leaf(*pud))

+   if (pud_leaf(*pud)) {
+#ifdef pud_page
+   page = pud_page(*pud);
+   if (PageHuge(page))
+   return page_size(compound_head(page));
+#endif
return 1ULL << PUD_SHIFT;
+   }
  
  	pmd = pmd_offset(pud, addr);

if (!pmd_present(*pmd))
return 0;
  
-	if (pmd_leaf(*pmd))

+   if (pmd_leaf(*pmd)) {
+#ifdef pmd_page
+   page = pmd_page(*pmd);
+   if (PageHuge(page))
+   return page_size(compound_head(page));
+#endif
return 1ULL << PMD_SHIFT;
+   }
  
  	pte = pte_offset_map(pmd, addr);

if (!pte_present(*pte)) {
@@ -7037,13 +7057,20 @@ static u64 __perf_get_page_size(struct m
return 0;
}
  
+	page = pte_page(*pte);

+   if (PageHuge(page)) {
+   u64 size = page_size(compound_head(page));
+   pte_unmap(pte);
+   return size;
+   }
+


The PageHuge() check for PTE crashes my machine when I did page size 
test. (Sorry, I didn't find the issue earlier. I just found some time to 
re-run the page size test.)


It seems we don't need the check for PTE here. The size should be always 
PAGE_SIZE, no? After I remove the check, everything looks good.


Thanks,
Kan

[  167.383120] BUG: unable to handle page fault for address: 
fca803fb8000

[  167.383121] #PF: supervisor read access in kernel mode
[  167.383121] #PF: error_code(0x) - not-present page
[  167.383121] PGD 4adfc8067 P4D 4adfc8067 PUD 4adfc7067 PMD 0
[  167.383122] Oops:  [#1] SMP NOPTI
[  167.383122] CPU: 0 PID: 2461 Comm: perf Not tainted 
5.10.0-rc1_page_size+ #54
[  167.383123] Hardware name: Intel Corporation Ice Lake Client 
Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS 
ICLSFWR1.R00.3162.A00.1904162000 04/16/2019

[  167.383123] traps: PANIC: double fault, error_code: 0x0
[  167.383123] double fault:  [#2] SMP NOPTI
[  167.383123] CPU: 0 PID: 2461 Comm: perf Not tainted 
5.10.0-rc1_page_size+ #54
[  167.383124] Hardware name: Intel Corporation Ice Lake Client 
Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS 
ICLSFWR1.R00.3162.A00.1904162000 04/16/2019

[  167.383124] RIP: 0010:__sprint_symbol+0xb/0x100
[  167.383124] Code: 85 e4 75 b9 48 85 c0 75 bc 49 89 d8 4c 89 e1 4c 89 
fa 4c 89 f6 4c 89 ef e8 42 70 04 00 eb a6 0f 1f 44 00 00 55 48 89 e5 41 
57 <41> 56 49 89 f6 41 55 41 89 cd 48 8d 4d b8 41 54 49 89 fc 53 48 63

[  167.383125] RSP: 0018:fe00b000 EFLAGS: 00010046
[  167.383125] RAX: 0053 RBX: 0a00 RCX: 
0001
[  167.383125] RDX:  RSI: 9f8a6176 RDI: 
fe00b029
[  167.383126] RBP: fe00b008 R08: a0bf8661 R09: 
0001
[  167.383126] R10: 000f R11: 9e641dc189c8 R12: 
9e641dc189c9
[  167.383126] R13: 9e641dc1a7e0 R14: 0a00ff05 R15: 
fe00b029
[  167.383126] FS:  () GS:9e641dc0() 
knlGS:

[  167.383127] CS:  0010 DS:  ES:  CR0: 80050033
[  167.383127] CR2: fe00aff8 CR3: 00014a0ba004 CR4: 
00770ef0

[  167.383127] PKRU: 5554
[  167.383127] Call Trace:
[  167.383127]  
[  167.383128]  sprint_symbol+0x15/0x20
[  167.383128]  symbol_string+0x49/0x90
[  

Re: [PATCH] perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY

2020-10-19 Thread Liang, Kan




On 10/19/2020 12:09 PM, Andi Kleen wrote:

Reported-by: Andi Kleen
Signed-off-by: Kan Liang

I guess this should have a Fixes: tag and also be proposed for
stable.


I will send V2 shortly to update the tag and Cc.

Thanks,
Kan


Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-10-13 Thread Liang, Kan




On 10/12/2020 4:48 AM, Will Deacon wrote:

On Sat, Oct 10, 2020 at 12:28:39AM +1100, Michael Ellerman wrote:

Peter Zijlstra  writes:

Patch 4 makes it all far worse by exposing it to pretty much everybody.

Now, I think we can fix at least the user mappings with the below delta,
but if archs are using non-page-table MMU sizes we'll need arch helpers.

ARM64 is in that last boat.


I think we probably need it to be weak so we can provide our own
version.


I guess the same thing applies to us, but I can't really tell how accurate
this stuff needs to be for userspace. If it's trying to use the page-table
configuration to infer properties of the TLB, that's never going to be
reliable because the TLB can choose both to split and coalesce entries
as long as software can't tell.



Hi Peter,

It looks like everybody wants a __weak function. If so, I guess we 
should drop the generic code in this patch. For X86, we have existing 
functions to retrieve the page level and the page size. I think we don't 
need the generic code either.

https://lkml.kernel.org/r/1549648509-12704-2-git-send-email-kan.li...@linux.intel.com/

Should I send a V10 patch to drop the generic code and implement an X86 
specific perf_get_page_size()? Will, Michael, and others can implement 
their version later separately.


Thanks,
Kan


Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-10-09 Thread Liang, Kan




On 10/9/2020 5:09 AM, Peter Zijlstra wrote:

(we might not need the #ifdef gunk, but I've not yet dug out my cross
  compilers this morning)

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7009,6 +7009,7 @@ static u64 perf_virt_to_phys(u64 virt)
   */
  static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr)
  {
+   struct page *page;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -7030,15 +7031,27 @@ static u64 __perf_get_page_size(struct m
if (!pud_present(*pud))
return 0;
  
-	if (pud_leaf(*pud))

+   if (pud_leaf(*pud)) {
+#ifdef pud_page
+   page = pud_page(*pud);
+   if (PageHuge(page))
+   return page_size(compound_head(page));


I think the page_size() returns the Kernel Page Size of a compound page.
What we want is the MMU page size.

If it's for the generic code, I think it should be a problem for X86.

Thanks,
Kan


+#endif
return 1ULL << PUD_SHIFT;
+   }
  
  	pmd = pmd_offset(pud, addr);

if (!pmd_present(*pmd))
return 0;
  
-	if (pmd_leaf(*pmd))

+   if (pmd_leaf(*pmd)) {
+#ifdef pmd_page
+   page = pmd_page(*pmd);
+   if (PageHuge(page))
+   return page_size(compound_head(page));
+#endif
return 1ULL << PMD_SHIFT;
+   }
  
  	pte = pte_offset_map(pmd, addr);

if (!pte_present(*pte)) {
@@ -7046,6 +7059,10 @@ static u64 __perf_get_page_size(struct m
return 0;
}
  
+	page = pte_page(*pte);

+   if (PageHuge(page))
+   return page_size(compound_head(page));
+
pte_unmap(pte);
return PAGE_SIZE;
  }


Re: [PATCH] perf/x86/intel: Fix n_metric for the canceled group

2020-10-02 Thread Liang, Kan




On 10/2/2020 7:02 AM, Peter Zijlstra wrote:

On Wed, Sep 30, 2020 at 07:29:35AM -0700, kan.li...@linux.intel.com wrote:

From: Kan Liang 

When a group that has TopDown members is failed to be scheduled, any
later TopDown groups will not return valid values.

Here is an example.

A background perf that occupies all the GP counters and the fixed
counter 1.
  $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles,
  cycles,cycles}:D" -a

A user monitors a TopDown group. It works well, because the fixed
counter 3 and the PERF_METRICS are available.
  $perf stat -x, --topdown -- ./workload
retiring,bad speculation,frontend bound,backend bound,
18.0,16.1,40.4,25.5,

Then the user tries to monitor a group that has TopDown members.
Because of the cycles event, the group is failed to be scheduled.
  $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound,
  topdown-fe-bound,topdown-bad-spec,cycles}'
  -- ./workload
 ,,slots,0,0.00,,
 ,,topdown-retiring,0,0.00,,
 ,,topdown-be-bound,0,0.00,,
 ,,topdown-fe-bound,0,0.00,,
 ,,topdown-bad-spec,0,0.00,,
 ,,cycles,0,0.00,,

The user tries to monitor a TopDown group again. It doesn't work anymore.
  $perf stat -x, --topdown -- ./workload

 ,

In a txn, cancel_txn() is to truncate the event_list for a canceled
group and update the number of events added in this transaction.
However, the number of TopDown events added in this transaction is not
updated. The kernel will probably fail to add new Topdown events.

Check if the canceled group has Topdown events. If so, subtract the
TopDown events from n_metric accordingly.

Fixes: 7b2c05a15d29 ("perf/x86/intel: Generic support for hardware TopDown 
metrics")
Reported-by: Andi Kleen 
Reviewed-by: Andi Kleen 
Signed-off-by: Kan Liang 
---
  arch/x86/events/core.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 0f3d01562ded..4cb3ccbe2d62 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2017,6 +2017,7 @@ static void x86_pmu_cancel_txn(struct pmu *pmu)
  {
unsigned int txn_flags;
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   int i;
  
  	WARN_ON_ONCE(!cpuc->txn_flags);	/* no txn in flight */
  
@@ -2031,6 +2032,15 @@ static void x86_pmu_cancel_txn(struct pmu *pmu)

 */
__this_cpu_sub(cpu_hw_events.n_added, 
__this_cpu_read(cpu_hw_events.n_txn));
__this_cpu_sub(cpu_hw_events.n_events, 
__this_cpu_read(cpu_hw_events.n_txn));
+
+   /* Subtract Topdown events in the canceled group from n_metric */
+   if (x86_pmu.intel_cap.perf_metrics && cpuc->n_metric) {
+   for (i = 0; i < cpuc->n_txn; i++) {
+   if (is_metric_event(cpuc->event_list[i + 
cpuc->n_events]))
+   __this_cpu_dec(cpu_hw_events.n_metric);
+   }
+   WARN_ON_ONCE(__this_cpu_read(cpu_hw_events.n_metric) < 0);
+   }
perf_pmu_enable(pmu);
  }



Urgh, I'd much rather we add n_txn_metric. But also, while looking at
this, don't we have the same problem with n_pair ?

Something like this perhaps...



Sure. For the perf metric, the below patch fixes the problem.

Tested-by: Kan Liang 

Thanks,
Kan


---
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 757e49755e7c..9b7792c0b6fb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1066,6 +1066,7 @@ static int add_nr_metric_event(struct cpu_hw_events *cpuc,
if (cpuc->n_metric == INTEL_TD_METRIC_NUM)
return -EINVAL;
cpuc->n_metric++;
+   cpuc->n_txn_metric++;
}
  
  	return 0;

@@ -1089,8 +1090,10 @@ static int collect_event(struct cpu_hw_events *cpuc, 
struct perf_event *event,
return -EINVAL;
  
  	cpuc->event_list[n] = event;

-   if (is_counter_pair(>hw))
+   if (is_counter_pair(>hw)) {
cpuc->n_pair++;
+   cpuc->n_txn_pair++;
+   }
  
  	return 0;

  }
@@ -2062,6 +2065,8 @@ static void x86_pmu_start_txn(struct pmu *pmu, unsigned 
int txn_flags)
  
  	perf_pmu_disable(pmu);

__this_cpu_write(cpu_hw_events.n_txn, 0);
+   __this_cpu_write(cpu_hw_events.n_txn_metric, 0);
+   __this_cpu_write(cpu_hw_events.n_txn_pair, 0);
  }
  
  /*

@@ -2087,6 +2092,8 @@ static void x86_pmu_cancel_txn(struct pmu *pmu)
 */
__this_cpu_sub(cpu_hw_events.n_added, 
__this_cpu_read(cpu_hw_events.n_txn));
__this_cpu_sub(cpu_hw_events.n_events, 
__this_cpu_read(cpu_hw_events.n_txn));
+   __this_cpu_sub(cpu_hw_events.n_metric, 
__this_cpu_read(cpu_hw_events.n_txn_metric));
+   __this_cpu_sub(cpu_hw_events.n_pair, 
__this_cpu_read(cpu_hw_events.n_txn_pair));
perf_pmu_enable(pmu);
  }
  
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h

index 

Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-10-01 Thread Liang, Kan




On 9/30/2020 6:45 PM, Stephane Eranian wrote:

On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra  wrote:


On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote:

On 9/30/20 7:42 AM, Liang, Kan wrote:

When I tested on my kernel, it panicked because I suspect
current->active_mm could be NULL. Adding a check for NULL avoided the
problem. But I suspect this is not the correct solution.


I guess the NULL active_mm should be a rare case. If so, I think it's
not bad to add a check and return 0 page size.


I think it would be best to understand why ->active_mm is NULL instead
of just papering over the problem.  If it is papered over, and this is
common, you might end up effectively turning off your shiny new feature
inadvertently.


context_switch() can set prev->active_mm to NULL when it transfers it to
@next. It does this before @current is updated. So an NMI that comes in
between this active_mm swizzling and updating @current will see
!active_mm.


I think Peter is right. This code is called in the context of NMI, so
if active_mm is set to NULL inside
a locked section, this is not enough to protect the perf_events code
from seeing it.


In general though; I think using ->active_mm is a mistake though. That
code should be doing something like:


 mm = current->mm;
 if (!mm)
 mm = _mm;




I will send a V9 with the changes Peter suggests.

Thanks,
Kan


Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-30 Thread Liang, Kan




On 9/30/2020 3:15 AM, Stephane Eranian wrote:

+static u64 perf_get_page_size(unsigned long addr)
+{
+   unsigned long flags;
+   u64 size;
+
+   if (!addr)
+   return 0;
+
+   /*
+* Software page-table walkers must disable IRQs,
+* which prevents any tear down of the page tables.
+*/
+   local_irq_save(flags);
+
+   size = __perf_get_page_size(current->active_mm, addr);
+

When I tested on my kernel, it panicked because I suspect
current->active_mm could be NULL. Adding a check for NULL avoided the
problem. But I suspect this is not the correct solution.



I guess the NULL active_mm should be a rare case. If so, I think it's 
not bad to add a check and return 0 page size.



Thanks,
Kan


Re: [RESEND PATCH V2 5/6] perf/x86/intel/uncore: Generic support for the PCI sub driver

2020-09-22 Thread Liang, Kan




On 9/21/2020 6:19 PM, Bjorn Helgaas wrote:

On Mon, Sep 14, 2020 at 07:34:19AM -0700, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Some uncore counters may be located in the configuration space of a PCI
device, which already has a bonded driver. Currently, the uncore driver
cannot register a PCI uncore PMU for these counters, because, to
register a PCI uncore PMU, the uncore driver must be bond to the device.
However, one device can only have one bonded driver.

Add an uncore PCI sub driver to support such kind of devices.

The sub driver doesn't own the device. In initialization, the sub
driver searches the device via pci_get_device(), and register the
corresponding PMU for the device. In the meantime, the sub driver
registeris a PCI bus notifier, which is used to notify the sub driver
once the device is removed. The sub driver can unregister the PMU
accordingly.


s/registeris/registers/

It looks like this only handles hot-remove of the device, not hot-add.
Maybe that's OK for your use case, I dunno, so just pointing it out.



Hi Bjorn,

Thanks for the review.

Yes, the patch only supports the hot-remove for now, because
- I didn't get any requests for the hot-add support. I doubt anyone 
hot-adds a Root Port device in practice.
- To be honest, I don't have a test case to verify the hot-add of a Root 
Port device.

  I used the command line below to test the hot-remove.
  sudo bash -c 'echo 1 > /sys/bus/pci/devices/\:14\:04.0/remove'
  Then the device is gone. I have no idea how to add it back.

So I only implemented the hot-remove in this patch. If anyone requires 
the hot-add, I can add it later separately.


Thanks,
Kan


Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-17 Thread Liang, Kan




On 9/17/2020 6:02 PM, Dave Hansen wrote:

The problem is that the warning from the perf tool usually includes some
hints regarding the cause of the warning or possible solution to
workaround/fix the warning. What message should we deliver to the users?
"Warning: Too many error page size. Address space isolation feature may
be enabled, please check"?

That's not particularly actionable for an end user.  Do any of the perf
errors just tell them to submit a bug report?  "Excessive number of page
size lookup errors, please report to..."


Yes, found some perf tool warnings end with "Consider reporting to 
linux-kernel@vger.kernel.org."


I will add a warning "Excessive number of page size lookup errors, 
please report to linux-kernel@vger.kernel.org.", if 90% of the samples 
have the page size 0.


Thanks,
Kan


Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-17 Thread Liang, Kan




On 9/17/2020 5:24 PM, Dave Hansen wrote:

On 9/17/20 2:16 PM, Liang, Kan wrote:

One last concern as I look at this: I wish it was a bit more
future-proof.  There are lots of weird things folks are trying to do
with the page tables, like Address Space Isolation.  For instance, if
you get a perf NMI when running userspace, current->mm->pgd is
*different* than the PGD that was in use when userspace was running.
It's close enough today, but it might not stay that way.  But I can't
think of any great ways to future proof this code, other than spitting
out an error message if too many of the page table walks fail when they
shouldn't.



If the page table walks fail, page size 0 will return. So the worst case
is that the page size is not available for users, which is not a fatal
error.


Right, it's not a fatal error.  It will just more or less silently break
this feature.


If my understanding is correct, when the above case happens, there is
nothing we can do for now (because we have no idea what it will become),
except disabling the page size support and throw an error/warning.

 From the user's perspective, throwing an error message or marking the
page size unavailable should be the same. I think we may leave the code
as-is. We can fix the future case later separately.


The only thing I can think of is to record the number of consecutive
page walk errors without a success.  Occasional failures are OK and
expected, such as if reclaim zeroes a PTE and a later perf event goes
and looks at it.  But a *LOT* of consecutive errors indicates a real
problem somewhere.

Maybe if you have 10,000 or 1,000,000 successive walk failures, you do a
WARN_ON_ONCE().


The user space perf tool looks like a better place for this kind of 
warning. The perf tool knows the total number of the samples. It also 
knows the number of the page size 0 samples. We can set a threshold, 
e.g., 90%. If 90% of the samples have the page size 0, perf tool will 
throw out a warning message.


The problem is that the warning from the perf tool usually includes some 
hints regarding the cause of the warning or possible solution to 
workaround/fix the warning. What message should we deliver to the users?
"Warning: Too many error page size. Address space isolation feature may 
be enabled, please check"?



Thanks,
Kan



Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

2020-09-17 Thread Liang, Kan




On 9/17/2020 3:00 PM, Dave Hansen wrote:

On 9/17/20 6:52 AM, kan.li...@linux.intel.com wrote:

+   mm = current->mm;
+   if (!mm) {
+   /*
+* For kernel threads and the like, use init_mm so that
+* we can find kernel memory.
+*/
+   mm = _mm;
+   }


I think it might be better to use current->active_mm instead of
current->mm.  Kernel threads can "take over" the mm of the threads that
switched to them, so they're not actually using all of the page tables
from the init_mm all the time.

It's not _common_, thought.  The only time that I think they can diverge
is when vmalloc PGD sync'ing needs to be done, and there's even an
effort to remove that.

But, it's probably more more precise to just use ->active_mm since
that's what will actually be more consistent with the values loaded into
CR3.

I _think_ ->active_mm is always non-NULL, too.



Thanks. yes, active_mm looks better here. I will use active_mm to 
replace the mm and _mm.



One last concern as I look at this: I wish it was a bit more
future-proof.  There are lots of weird things folks are trying to do
with the page tables, like Address Space Isolation.  For instance, if
you get a perf NMI when running userspace, current->mm->pgd is
*different* than the PGD that was in use when userspace was running.
It's close enough today, but it might not stay that way.  But I can't
think of any great ways to future proof this code, other than spitting
out an error message if too many of the page table walks fail when they
shouldn't.



If the page table walks fail, page size 0 will return. So the worst case 
is that the page size is not available for users, which is not a fatal 
error.


If my understanding is correct, when the above case happens, there is 
nothing we can do for now (because we have no idea what it will become), 
except disabling the page size support and throw an error/warning.


From the user's perspective, throwing an error message or marking the 
page size unavailable should be the same. I think we may leave the code 
as-is. We can fix the future case later separately.


Thanks,
Kan


Re: [PATCH V2 3/4] perf stat: Support new per thread TopDown metrics

2020-09-11 Thread Liang, Kan




On 9/10/2020 11:37 PM, Namhyung Kim wrote:

Hello,

On Thu, Sep 10, 2020 at 10:48 PM  wrote:


From: Andi Kleen 

Icelake has support for reporting per thread TopDown metrics.
These are reported differently than the previous TopDown support,
each metric is standalone, but scaled to pipeline "slots".
We don't need to do anything special for HyperThreading anymore.
Teach perf stat --topdown to handle these new metrics and
print them in the same way as the previous TopDown metrics.
The restrictions of only being able to report information per core is
gone.

Acked-by: Jiri Olsa 
Co-developed-by: Kan Liang 
Signed-off-by: Kan Liang 
Signed-off-by: Andi Kleen 
---
  tools/perf/Documentation/perf-stat.txt |  7 +-
  tools/perf/builtin-stat.c  | 30 -
  tools/perf/util/stat-shadow.c  | 89 ++
  tools/perf/util/stat.c |  4 ++
  tools/perf/util/stat.h |  8 +++
  5 files changed, 134 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index c9bfefc051fb..e803dbdc88a8 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -357,6 +357,11 @@ if the workload is actually bound by the CPU and not by 
something else.
  For best results it is usually a good idea to use it with interval
  mode like -I 1000, as the bottleneck of workloads can change often.

+This enables --metric-only, unless overridden with --no-metric-only.
+
+The following restrictions only apply to older Intel CPUs and Atom,
+on newer CPUs (IceLake and later) TopDown can be collected for any thread:
+
  The top down metrics are collected per core instead of per
  CPU thread. Per core mode is automatically enabled
  and -a (global monitoring) is needed, requiring root rights or
@@ -368,8 +373,6 @@ echo 0 > /proc/sys/kernel/nmi_watchdog
  for best results. Otherwise the bottlenecks may be inconsistent
  on workload with changing phases.

-This enables --metric-only, unless overridden with --no-metric-only.
-
  To interpret the results it is usually needed to know on which
  CPUs the workload runs on. If needed the CPUs can be forced using
  taskset.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 5583e22ca808..6290da5bd142 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -128,6 +128,15 @@ static const char * topdown_attrs[] = {
 NULL,
  };

+static const char *topdown_metric_attrs[] = {
+   "slots",
+   "topdown-retiring",
+   "topdown-bad-spec",
+   "topdown-fe-bound",
+   "topdown-be-bound",
+   NULL,
+};
+
  static const char *smi_cost_attrs = {
 "{"
 "msr/aperf/,"
@@ -1691,6 +1700,24 @@ static int add_default_attributes(void)
 char *str = NULL;
 bool warn = false;

+   if (!force_metric_only)
+   stat_config.metric_only = true;
+
+   if (topdown_filter_events(topdown_metric_attrs, , 1) < 0) {
+   pr_err("Out of memory\n");
+   return -1;
+   }
+   if (topdown_metric_attrs[0] && str) {
+   if (!stat_config.interval && !stat_config.metric_only) {
+   fprintf(stat_config.output,
+   "Topdown accuracy may decrease when 
measuring long periods.\n"
+   "Please print the result regularly, e.g. 
-I1000\n");
+   }
+   goto setup_metrics;
+   }
+
+   str = NULL;


zfree() ?


Yes, even the topdown events don't exist, str is still allocated.
The str should be free.

I will send V3 to fix it shortly.

Thanks,
Kan



Thanks
Namhyung



+
 if (stat_config.aggr_mode != AGGR_GLOBAL &&
 stat_config.aggr_mode != AGGR_CORE) {
 pr_err("top down event configuration requires --per-core 
mode\n");
@@ -1702,8 +1729,6 @@ static int add_default_attributes(void)
 return -1;
 }

-   if (!force_metric_only)
-   stat_config.metric_only = true;
 if (topdown_filter_events(topdown_attrs, ,
 arch_topdown_check_group()) < 0) {
 pr_err("Out of memory\n");
@@ -1712,6 +1737,7 @@ static int add_default_attributes(void)
 if (topdown_attrs[0] && str) {
 if (warn)
 arch_topdown_group_warn();
+setup_metrics:
 err = parse_events(evsel_list, str, );
 if (err) {
 fprintf(stderr,


Re: [PATCH V2 3/3] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2020-09-09 Thread Liang, Kan




On 9/8/2020 11:58 AM, pet...@infradead.org wrote:
> On Mon, Sep 07, 2020 at 06:01:15PM +0200, pet...@infradead.org wrote:
>> On Fri, Aug 21, 2020 at 12:57:54PM -0700, kan.li...@linux.intel.com
>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>> index 0f3d01562ded..fa08d810dcd2 100644
>>> --- a/arch/x86/events/core.c
>>> +++ b/arch/x86/events/core.c
>>> @@ -1440,7 +1440,10 @@ static void x86_pmu_start(struct perf_event
>>> *event, int flags)
>>>
>>>cpuc->events[idx] = event;
>>>__set_bit(idx, cpuc->active_mask);
>>> -  __set_bit(idx, cpuc->running);
>>> +  /* The cpuc->running is only used by the P4 PMU */
>>> +  if (!cpu_has(_cpu_data, X86_FEATURE_ARCH_PERFMON) &&
>>> +  (boot_cpu_data.x86 == 0xf))
>>> +  __set_bit(idx, cpuc->running);
>>>x86_pmu.enable(event);
>>>perf_event_update_userpage(event);
>>>   }
>>
>> Yuck! Use a static_branch() or something. This is a gnarly nest of
>> code
>> that runs 99.9% of the time to conclude a negative. IOW a complete
>> waste
>> of cycles for everybody not running a P4 space heater.
>
> Better still, move it into p4_pmu_enable_event().
>

Sure, I will move the "cpuc->running" into P4 specific code.

On 9/7/2020 12:01 PM, pet...@infradead.org wrote:

@@ -1544,6 +1547,9 @@ static void x86_pmu_del(struct perf_event *event, int 
flags)
if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
goto do_del;
  
+	if (READ_ONCE(x86_pmu.attr_rdpmc) && x86_pmu.sched_task &&

+   test_bit(event->hw.idx, cpuc->active_mask))
+   __set_bit(event->hw.idx, cpuc->dirty);


And that too seems like an overly complicated set of tests and branches.
This should be effectivly true for the 99% common case.


I made a mistake in the V2 version regarding whether P4 PMU supports 
RDPMC. SDM explicitly documents that the RDPMC instruction was 
introduced in P6. I once thought P4 is older than P6, but I'm wrong. P4 
should have NetBurst microarchitecture which is the successor to the P6.

So P4 should also support the RDPMC instruction.

We should not share the memory space between the 'dirty' and the 
'running' variable. I will modify it in V3


I will also unconditionally set cpuc->dirty here. The worst case for an 
RDPMC task is that we may have to clear all counters for the first time 
in x86_pmu_event_mapped. After that, the sched_task() will clear/update 
the 'dirty'. Only the real 'dirty' counters are clear. For non-RDPMC 
task, it's harmless to unconditionally set the cpuc->dirty.




  static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct 
*mm)
  {
if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return;
  
+	/*

+* Enable sched_task() for the RDPMC task,
+* and clear the existing dirty counters.
+*/
+   if (x86_pmu.sched_task && event->hw.target && 
!is_sampling_event(event)) {
+   perf_sched_cb_inc(event->ctx->pmu);
+   x86_pmu_clear_dirty_counters();
+   }


I'm failing to see the correctness of the !is_sampling_event() part
there.


It looks like an RDPMC task can do sampling as well? I once thought it 
only does counting. I will fix it in V3.






/*
 * This function relies on not being called concurrently in two
 * tasks in the same mm.  Otherwise one task could observe
@@ -2246,6 +2286,9 @@ static void x86_pmu_event_unmapped(struct perf_event 
*event, struct mm_struct *m
if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
return;
  
+	if (x86_pmu.sched_task && event->hw.target && !is_sampling_event(event))

+   perf_sched_cb_dec(event->ctx->pmu);
+


Idem.


if (atomic_dec_and_test(>context.perf_rdpmc_allowed))
on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
  }
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c72e4904e056..e67713bfa33a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4166,11 +4166,39 @@ static void intel_pmu_cpu_dead(int cpu)
intel_cpuc_finish(_cpu(cpu_hw_events, cpu));
  }
  
+static void intel_pmu_rdpmc_sched_task(struct perf_event_context *ctx)

+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   struct perf_event *event;
+
+   if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
+   return;
+
+   /*
+* If the new task has the RDPMC enabled, clear the dirty counters to
+* prevent the potential leak. If the new task doesn't have the RDPMC
+* enabled, do nothing.
+*/
+   list_for_each_entry(event, >event_list, event_entry) {
+   if (event->hw.target &&
+   (event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED) &&
+   !is_sampling_event(event) &&
+   atomic_read(>mmap_count))
+   break;
+   }
+   if (>event_entry == >event_list)
+   return;


That's 

Re: [PATCH V2 0/6] Support PCIe3 uncore PMU on Snow Ridge

2020-08-31 Thread Liang, Kan

Hi Peter,

Could you please share the comments for this patch set?

Thanks,
Kan

On 8/5/2020 4:24 PM, kan.li...@linux.intel.com wrote:

From: Kan Liang 

Changes since V1:
- Drop the platform device solution
- A new uncore PCI sub driver solution is introduced which searches
   the PCIe Root Port device via pci_get_device() and id table.
   Register a PCI bus notifier for the remove notification. Once the
   device is removed, the uncore driver can be notified to unregister
   the corresponding PMU.
- Modify the parameters of uncore_pci_pmu_unregister() function.

The Snow Ridge integrated PCIe3 uncore performance monitoring unit (PMU)
can be used to collect the performance data, e.g., the utilization
between the PCIe devices and the components (in M2IOSF) which are
responsible for translating and managing the requests to/from the
device. The performance data is very useful for analyzing the
performance of the PCIe devices.

The PCIe3 uncore PMU was once supported in the Linux kernel, but it was
removed by the commit 2167f1625c2f ("perf/x86/intel/uncore: Remove
PCIe3 unit for SNR") due to the conflict between the PCIe Root Port
driver and the perf uncore driver. The counters of the PCIe3 uncore PMU
are located in the configuration space of the PCIe Root Port device,
which already has a bonded driver portdrv_pci. One device can only have
one bonded driver. The uncore driver is always failed to be loaded.

To re-enable the PCIe3 uncore PMU support in the uncore driver, a new
solution should be introduced, which has to meet the requirements as
below:
- must have a reliable way to find the PCIe Root Port device from the
   uncore driver;
- must be able to access the uncore counters of the PCIe Root Port
   device from the uncore driver;
- must support hotplug. When the PCIe Root Port device is removed, the
   uncore driver has to be notified and unregisters the PCIe3 uncore
   PMU.

In the v1 patch set, a new platform device solution is implemented to
address the issue, but it's rejected.
https://lkml.kernel.org/r/20200708183034.GA466341@bjorn-Precision-5520

A new uncore PCI sub driver solution is introduced from the V2 patch
set, which:
- searches the PCIe Root Port device via pci_get_device() and id table.
   The matched pci_dev can be used to register a PMU for accessing the
   counters in the PCIe Root Port device.
- register a PCI bus notifier. Once the device is removed, the uncore
   driver can be notified to unregister the corresponding PMU.

Kan Liang (6):
   perf/x86/intel/uncore: Factor out uncore_pci_get_dev_die_info()
   perf/x86/intel/uncore: Factor out uncore_pci_find_dev_pmu()
   perf/x86/intel/uncore: Factor out uncore_pci_pmu_register()
   perf/x86/intel/uncore: Factor out uncore_pci_pmu_unregister()
   perf/x86/intel/uncore: Generic support for the PCI sub driver
   perf/x86/intel/uncore: Support PCIe3 unit on Snow Ridge

  arch/x86/events/intel/uncore.c   | 273 +++
  arch/x86/events/intel/uncore.h   |   1 +
  arch/x86/events/intel/uncore_snbep.c |  53 +++
  3 files changed, 265 insertions(+), 62 deletions(-)



Re: [PATCH 0/4] TopDown metrics support for Ice Lake (perf tool)

2020-08-27 Thread Liang, Kan




On 8/26/2020 11:54 AM, Jiri Olsa wrote:

On Thu, Aug 20, 2020 at 09:45:28AM -0700, kan.li...@linux.intel.com wrote:

From: Kan Liang 

The kernel patches have been merged into the tip's perf/core branch.
The patch set is on top of commit 2cb5383b30d4 ("perf/x86/intel: Support
per-thread RDPMC TopDown metrics") of the tip's perf/core branch.

The changes for the perf tool include:
- Extend --topdown option to support per thread TopDown metrics
- Support sample-read topdown metric group
- Add a complete document for the TopDown usage.

Ice Lake has support for measuring the level 1 TopDown metrics
directly in hardware. This is implemented by an additional METRICS
register, and a new Fixed Counter 3 that measures pipeline SLOTS.

New in Icelake
- Do not require generic counters. This allows to collect TopDown always
   in addition to other events.
- Measuring TopDown per thread/process instead of only per core

For the Ice Lake implementation of performance metrics, the values in
PERF_METRICS MSR are derived from fixed counter 3. Software should start
both registers, PERF_METRICS and fixed counter 3, from zero.
Additionally, software is recommended to periodically clear both
registers in order to maintain accurate measurements. The latter is
required for certain scenarios that involve sampling metrics at high
rates. Software should always write fixed counter 3 before write to
PERF_METRICS.

IA32_PERF_GLOBAL_STATUS. OVF_PERF_METRICS[48]: If this bit is set,
it indicates that some PERF_METRICS-related counter has overflowed and
a PMI is triggered. Software has to synchronize, e.g. re-start,
PERF_METRICS as well as fixed counter 3. Otherwise, PERF_METRICS may
return invalid values.

Limitation
- To get accurate result and avoid reading the METRICS register multiple
   times, the TopDown metrics events and SLOTS event have to be in the
   same group.
- METRICS and SLOTS registers have to be cleared after each read by SW.
   That is to prevent the lose of precision.
- Cannot do sampling read SLOTS and TopDown metric events

Please refer SDM Vol3, 18.3.9.3 Performance Metrics for the details of
TopDown metrics.

Andi Kleen (2):
   perf stat: Support new per thread TopDown metrics
   perf, tools: Add documentation for topdown metrics

Kan Liang (2):
   perf tools: Rename group to topdown
   perf record: Support sample-read topdown metric group


I don't have Ice lake to actualy check, but it looks good

Acked-by: Jiri Olsa 




Thanks Jirka for the review.

Hi Arnaldo,

Andi has a comment regarding a grammar error in the printf message. I 
once plan to apply a fix in the next version. I'm not sure whether you 
will have more comments for this patch set.

If yes, I will wait for the comments.
If not, should I send a V2 patch set with this fixed?

On 8/20/2020 4:05 PM, Andi Kleen wrote:
>> +  fprintf(stat_config.output,
>> +  "Topdown accuracy may decreases when 
measuring long period.\n"
> "may decrease" ... "long periods".
>
> (s needs to move to the end)


Thanks,
Kan


Re: [PATCH V6 06/16] perf script: Use ULL for enum perf_output_field

2020-08-12 Thread Liang, Kan




On 8/12/2020 8:21 AM, Arnaldo Carvalho de Melo wrote:

Em Mon, Aug 10, 2020 at 02:24:26PM -0700, Kan Liang escreveu:

The Bitwise-Shift operator (1U << ) is used in the enum
perf_output_field, which has already reached its capacity (32 items).
If more items are added, a compile error will be triggered.

Change the U to ULL, which extend the capacity to 64 items.

The enum perf_output_field is only used to calculate a value for the
'fields' in the output structure. The 'fields' is u64. The change
doesn't break anything.


Jiri did this already:

https://git.kernel.org/torvalds/c/60e5eeb56a1


Thanks for pointing it out.

I will rebase the code on top of it.

Thanks,
Kan

  

Signed-off-by: Kan Liang 
---
  tools/perf/builtin-script.c | 64 ++---
  1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 447457786362..214bec350971 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -82,38 +82,38 @@ static bool native_arch;
  unsigned int scripting_max_stack = PERF_MAX_STACK_DEPTH;
  
  enum perf_output_field {

-   PERF_OUTPUT_COMM= 1U << 0,
-   PERF_OUTPUT_TID = 1U << 1,
-   PERF_OUTPUT_PID = 1U << 2,
-   PERF_OUTPUT_TIME= 1U << 3,
-   PERF_OUTPUT_CPU = 1U << 4,
-   PERF_OUTPUT_EVNAME  = 1U << 5,
-   PERF_OUTPUT_TRACE   = 1U << 6,
-   PERF_OUTPUT_IP  = 1U << 7,
-   PERF_OUTPUT_SYM = 1U << 8,
-   PERF_OUTPUT_DSO = 1U << 9,
-   PERF_OUTPUT_ADDR= 1U << 10,
-   PERF_OUTPUT_SYMOFFSET   = 1U << 11,
-   PERF_OUTPUT_SRCLINE = 1U << 12,
-   PERF_OUTPUT_PERIOD  = 1U << 13,
-   PERF_OUTPUT_IREGS   = 1U << 14,
-   PERF_OUTPUT_BRSTACK = 1U << 15,
-   PERF_OUTPUT_BRSTACKSYM  = 1U << 16,
-   PERF_OUTPUT_DATA_SRC= 1U << 17,
-   PERF_OUTPUT_WEIGHT  = 1U << 18,
-   PERF_OUTPUT_BPF_OUTPUT  = 1U << 19,
-   PERF_OUTPUT_CALLINDENT  = 1U << 20,
-   PERF_OUTPUT_INSN= 1U << 21,
-   PERF_OUTPUT_INSNLEN = 1U << 22,
-   PERF_OUTPUT_BRSTACKINSN = 1U << 23,
-   PERF_OUTPUT_BRSTACKOFF  = 1U << 24,
-   PERF_OUTPUT_SYNTH   = 1U << 25,
-   PERF_OUTPUT_PHYS_ADDR   = 1U << 26,
-   PERF_OUTPUT_UREGS   = 1U << 27,
-   PERF_OUTPUT_METRIC  = 1U << 28,
-   PERF_OUTPUT_MISC= 1U << 29,
-   PERF_OUTPUT_SRCCODE = 1U << 30,
-   PERF_OUTPUT_IPC = 1U << 31,
+   PERF_OUTPUT_COMM= 1ULL << 0,
+   PERF_OUTPUT_TID = 1ULL << 1,
+   PERF_OUTPUT_PID = 1ULL << 2,
+   PERF_OUTPUT_TIME= 1ULL << 3,
+   PERF_OUTPUT_CPU = 1ULL << 4,
+   PERF_OUTPUT_EVNAME  = 1ULL << 5,
+   PERF_OUTPUT_TRACE   = 1ULL << 6,
+   PERF_OUTPUT_IP  = 1ULL << 7,
+   PERF_OUTPUT_SYM = 1ULL << 8,
+   PERF_OUTPUT_DSO = 1ULL << 9,
+   PERF_OUTPUT_ADDR= 1ULL << 10,
+   PERF_OUTPUT_SYMOFFSET   = 1ULL << 11,
+   PERF_OUTPUT_SRCLINE = 1ULL << 12,
+   PERF_OUTPUT_PERIOD  = 1ULL << 13,
+   PERF_OUTPUT_IREGS   = 1ULL << 14,
+   PERF_OUTPUT_BRSTACK = 1ULL << 15,
+   PERF_OUTPUT_BRSTACKSYM  = 1ULL << 16,
+   PERF_OUTPUT_DATA_SRC= 1ULL << 17,
+   PERF_OUTPUT_WEIGHT  = 1ULL << 18,
+   PERF_OUTPUT_BPF_OUTPUT  = 1ULL << 19,
+   PERF_OUTPUT_CALLINDENT  = 1ULL << 20,
+   PERF_OUTPUT_INSN= 1ULL << 21,
+   PERF_OUTPUT_INSNLEN = 1ULL << 22,
+   PERF_OUTPUT_BRSTACKINSN = 1ULL << 23,
+   PERF_OUTPUT_BRSTACKOFF  = 1ULL << 24,
+   PERF_OUTPUT_SYNTH   = 1ULL << 25,
+   PERF_OUTPUT_PHYS_ADDR   = 1ULL << 26,
+   PERF_OUTPUT_UREGS   = 1ULL << 27,
+   PERF_OUTPUT_METRIC  = 1ULL << 28,
+   PERF_OUTPUT_MISC= 1ULL << 29,
+   PERF_OUTPUT_SRCCODE = 1ULL << 30,
+   PERF_OUTPUT_IPC = 1ULL << 31,
  };
  
  struct output_option {

--
2.17.1





  1   2   3   4   5   6   7   8   9   10   >