Re: Copying TLS/user register data per perf-sample?

2024-04-09 Thread Namhyung Kim
Hello,

On Thu, Apr 4, 2024 at 12:26 PM Beau Belgrave  wrote:
>
> Hello,
>
> I'm looking into the possibility of capturing user data that is pointed
> to by a user register (IE: fs/gs for TLS on x86/64) for each sample via
> perf_events.
>
> I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER.
> I think it could even use roughly the same ABI in the perf ring buffer.
> Or it may be possible by some kprobe linked to the perf sample function.
>
> This would allow a profiler to collect TLS (or other values) on x64. In
> the Open Telemetry profiling SIG [1], we are trying to find a fast way
> to grab a tracing association quickly on a per-thread basis. The team
> at Elastic has a bespoke way to do this [2], however, I'd like to see a
> more general way to achieve this. The folks I've been talking with seem
> open to the idea of just having a TLS value for this we could capture
> upon each sample. We could then just state, Open Telemetry SDKs should
> have a TLS value for span correlation. However, we need a way to sample
> the TLS value(s) when a sampling event is generated.
>
> Is this already possible via some other means? It'd be great to be able
> to do this directly at the perf_event sample via the ABI or a probe.

I don't think the current perf ABI allows capturing %fs/%gs + offset.
IIRC kprobes/uprobes don't have that too but I could be wrong.

Thanks,
Namhyung

>
> 1. https://opentelemetry.io/blog/2024/profiling/
> 2. 
> https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation



Re: [PATCH v1 0/4] perf parse-regs: Cleanup config and building

2024-02-16 Thread Namhyung Kim
On Wed, 14 Feb 2024 19:39:43 +0800, Leo Yan wrote:
> Currently, the perf building enables register parsing based on the
> target architecture has supported register feature.
> 
> Furthermore, the perf building system needs to maintain a variable
> 'NO_PERF_REGS' and defines macro 'HAVE_PERF_REGS_SUPPORT' for statically
> compiling the tool.
> 
> [...]

Applied to perf-tools-next, thanks!

Best regards,
-- 
Namhyung Kim 



[PATCH 05/14] tools headers UAPI: Update tools's copy of vhost.h header

2023-11-21 Thread Namhyung Kim
tldr; Just FYI, I'm carrying this on the perf tools tree.

Full explanation:

There used to be no copies, with tools/ code using kernel headers
directly. From time to time tools/perf/ broke due to legitimate kernel
hacking. At some point Linus complained about such direct usage. Then we
adopted the current model.

The way these headers are used in perf are not restricted to just
including them to compile something.

There are sometimes used in scripts that convert defines into string
tables, etc, so some change may break one of these scripts, or new MSRs
may use some different #define pattern, etc.

E.g.:

  $ ls -1 tools/perf/trace/beauty/*.sh | head -5
  tools/perf/trace/beauty/arch_errno_names.sh
  tools/perf/trace/beauty/drm_ioctl.sh
  tools/perf/trace/beauty/fadvise.sh
  tools/perf/trace/beauty/fsconfig.sh
  tools/perf/trace/beauty/fsmount.sh
  $
  $ tools/perf/trace/beauty/fadvise.sh
  static const char *fadvise_advices[] = {
[0] = "NORMAL",
[1] = "RANDOM",
[2] = "SEQUENTIAL",
[3] = "WILLNEED",
[4] = "DONTNEED",
[5] = "NOREUSE",
  };
  $

The tools/perf/check-headers.sh script, part of the tools/ build
process, points out changes in the original files.

So its important not to touch the copies in tools/ when doing changes in
the original kernel headers, that will be done later, when
check-headers.sh inform about the change to the perf tools hackers.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: k...@vger.kernel.org
Cc: virtualizat...@lists.linux.dev
Cc: net...@vger.kernel.org
Signed-off-by: Namhyung Kim 
---
 tools/include/uapi/linux/vhost.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/vhost.h b/tools/include/uapi/linux/vhost.h
index f5c48b61ab62..649560c685f1 100644
--- a/tools/include/uapi/linux/vhost.h
+++ b/tools/include/uapi/linux/vhost.h
@@ -219,4 +219,12 @@
  */
 #define VHOST_VDPA_RESUME  _IO(VHOST_VIRTIO, 0x7E)
 
+/* Get the group for the descriptor table including driver & device areas
+ * of a virtqueue: read index, write group in num.
+ * The virtqueue index is stored in the index field of vhost_vring_state.
+ * The group ID of the descriptor table for this specific virtqueue
+ * is returned via num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F,   
\
+ struct vhost_vring_state)
 #endif
-- 
2.43.0.rc1.413.gea7ed67945-goog




Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-20 Thread Namhyung Kim
Hi Peter,

On Tue, Apr 20, 2021 at 7:28 PM Peter Zijlstra  wrote:
>
> On Fri, Apr 16, 2021 at 06:49:09PM +0900, Namhyung Kim wrote:
> > On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra  
> > wrote:
> > > > +static void perf_update_cgroup_node(struct perf_event *event, struct 
> > > > cgroup *cgrp)
> > > > +{
> > > > + u64 delta_count, delta_time_enabled, delta_time_running;
> > > > + int i;
> > > > +
> > > > + if (event->cgrp_node_count == 0)
> > > > + goto out;
> > > > +
> > > > + delta_count = local64_read(>count) - 
> > > > event->cgrp_node_count;
>
> From here...
>
> > > > + delta_time_enabled = event->total_time_enabled - 
> > > > event->cgrp_node_time_enabled;
> > > > + delta_time_running = event->total_time_running - 
> > > > event->cgrp_node_time_running;
> > > > +
> > > > + /* account delta to all ancestor cgroups */
> > > > + for (i = 0; i <= cgrp->level; i++) {
> > > > + struct perf_cgroup_node *node;
> > > > +
> > > > + node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > > > + if (node) {
> > > > + node->count += delta_count;
> > > > + node->time_enabled += delta_time_enabled;
> > > > + node->time_running += delta_time_running;
> > > > + }
> > > > + }
>
> ... till here, NMI could hit and increment event->count, which then
> means that:
>
> > > > +
> > > > +out:
> > > > + event->cgrp_node_count = local64_read(>count);
>
> This load doesn't match the delta_count load and events will go missing.
>
> Obviously correct solution is:
>
> event->cgrp_node_count += delta_count;
>
>
> > > > + event->cgrp_node_time_enabled = event->total_time_enabled;
> > > > + event->cgrp_node_time_running = event->total_time_running;
>
> And while total_time doesn't have that problem, consistency would then
> have you do:
>
> event->cgrp_node_time_foo += delta_time_foo;
>
> > >
> > > This is wrong; there's no guarantee these are the same values you read
> > > at the begin, IOW you could be loosing events.
> >
> > Could you please elaborate?
>
> You forgot NMI.

Thanks for your explanation.  Maybe I'm missing something but
this event is basically for counting and doesn't allow sampling.
Do you say it's affected by other sampling events?  Note that
it's not reading from the PMU here, what it reads is a snapshot
of last pmu->read(event) afaik.

Thanks,
Namhyung


Re: [PATCH v3 3/3] perf tools: Add 'cgroup-switches' software event

2021-04-19 Thread Namhyung Kim
Hi Arnaldo,

Could you please pick this up?  The kernel part is landed in the
tip.git already.

Thanks,
Namhyung

On Wed, Feb 10, 2021 at 5:33 PM Namhyung Kim  wrote:
>
> It counts how often cgroups are changed actually during the context
> switches.
>
>   # perf stat -a -e context-switches,cgroup-switches -a sleep 1
>
>Performance counter stats for 'system wide':
>
>   11,267  context-switches
>   10,950  cgroup-switches
>
>  1.015634369 seconds time elapsed
>
> Signed-off-by: Namhyung Kim 
> ---
>  tools/include/uapi/linux/perf_event.h | 1 +
>  tools/perf/util/parse-events.c| 4 
>  tools/perf/util/parse-events.l| 1 +
>  3 files changed, 6 insertions(+)
>
> diff --git a/tools/include/uapi/linux/perf_event.h 
> b/tools/include/uapi/linux/perf_event.h
> index b15e3447cd9f..16b9538ad89b 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -112,6 +112,7 @@ enum perf_sw_ids {
> PERF_COUNT_SW_EMULATION_FAULTS  = 8,
> PERF_COUNT_SW_DUMMY = 9,
> PERF_COUNT_SW_BPF_OUTPUT= 10,
> +   PERF_COUNT_SW_CGROUP_SWITCHES   = 11,
>
> PERF_COUNT_SW_MAX,  /* non-ABI */
>  };
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 42c84adeb2fb..09ff678519f3 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -145,6 +145,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] 
> = {
> .symbol = "bpf-output",
> .alias  = "",
> },
> +   [PERF_COUNT_SW_CGROUP_SWITCHES] = {
> +   .symbol = "cgroup-switches",
> +   .alias  = "",
> +   },
>  };
>
>  #define __PERF_EVENT_FIELD(config, name) \
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 9db5097317f4..88f203bb6fab 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -347,6 +347,7 @@ emulation-faults{ return 
> sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
>  dummy  { return sym(yyscanner, 
> PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
>  duration_time  { return tool(yyscanner, 
> PERF_TOOL_DURATION_TIME); }
>  bpf-output { return sym(yyscanner, 
> PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
> +cgroup-switches{ return 
> sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_CGROUP_SWITCHES); }
>
> /*
>  * We have to handle the kernel PMU event 
> cycles-ct/cycles-t/mem-loads/mem-stores separately.
> --
> 2.30.0.478.g8a0d178c01-goog
>


Re: [PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-17 Thread Namhyung Kim
Hi Song,

On Sat, Apr 17, 2021 at 7:13 AM Song Liu  wrote:
>
> Currently, to use BPF to aggregate perf event counters, the user uses
> --bpf-counters option. Enable "use bpf by default" events with a config
> option, stat.bpf-counter-events. Events with name in the option will use
> BPF.
>
> This also enables mixed BPF event and regular event in the same sesssion.
> For example:
>
>perf config stat.bpf-counter-events=instructions
>perf stat -e instructions,cs
>
> The second command will use BPF for "instructions" but not "cs".
>
> Signed-off-by: Song Liu 
> ---
> @@ -535,12 +549,13 @@ static int enable_counters(void)
> struct evsel *evsel;
> int err;
>
> -   if (target__has_bpf()) {
> -   evlist__for_each_entry(evsel_list, evsel) {
> -   err = bpf_counter__enable(evsel);
> -   if (err)
> -   return err;
> -   }
> +   evlist__for_each_entry(evsel_list, evsel) {
> +   if (!evsel__is_bpf(evsel))
> +   continue;
> +
> +   err = bpf_counter__enable(evsel);
> +   if (err)
> +   return err;

I just realized it doesn't have a disable counterpart.

> }
>
> if (stat_config.initial_delay < 0) {
> @@ -784,11 +799,9 @@ static int __run_perf_stat(int argc, const char **argv, 
> int run_idx)
> if (affinity__setup() < 0)
> return -1;
>
> -   if (target__has_bpf()) {
> -   evlist__for_each_entry(evsel_list, counter) {
> -   if (bpf_counter__load(counter, ))
> -   return -1;
> -   }
> +   evlist__for_each_entry(evsel_list, counter) {
> +   if (bpf_counter__load(counter, ))
> +   return -1;
> }
>
> evlist__for_each_cpu (evsel_list, i, cpu) {

[SNIP]
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 2d2614eeaa20e..080ddcfefbcd2 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -492,6 +492,28 @@ const char *evsel__hw_names[PERF_COUNT_HW_MAX] = {
> "ref-cycles",
>  };
>
> +char *evsel__bpf_counter_events;
> +
> +bool evsel__match_bpf_counter_events(const char *name)
> +{
> +   int name_len;
> +   bool match;
> +   char *ptr;
> +
> +   if (!evsel__bpf_counter_events)
> +   return false;
> +
> +   ptr = strstr(evsel__bpf_counter_events, name);
> +   name_len = strlen(name);
> +
> +   /* check name matches a full token in evsel__bpf_counter_events */
> +   match = (ptr != NULL) &&
> +   ((ptr == evsel__bpf_counter_events) || (*(ptr - 1) == ',')) &&
> +   ((*(ptr + name_len) == ',') || (*(ptr + name_len) == '\0'));

I'm not sure we have an event name which is a substring of another.
Maybe it can retry if it fails to match.

Thanks,
Namhyung

> +
> +   return match;
> +}
> +
>  static const char *__evsel__hw_name(u64 config)
>  {
> if (config < PERF_COUNT_HW_MAX && evsel__hw_names[config])


[tip: perf/core] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

2021-04-16 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: d0d1dd628527c77db2391ce0293c1ed344b2365f
Gitweb:
https://git.kernel.org/tip/d0d1dd628527c77db2391ce0293c1ed344b2365f
Author:Namhyung Kim 
AuthorDate:Wed, 10 Feb 2021 17:33:26 +09:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 16 Apr 2021 18:58:52 +02:00

perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

This patch adds a new software event to count context switches
involving cgroup switches.  So it's counted only if cgroups of
previous and next tasks are different.  Note that it only checks the
cgroups in the perf_event subsystem.  For cgroup v2, it shouldn't
matter anyway.

One can argue that we can do this by using existing sched_switch event
with eBPF.  But some systems might not have eBPF for some reason so
I'd like to add this as a simple way.

Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210210083327.22726-2-namhy...@kernel.org
---
 include/linux/perf_event.h  | 7 +++
 include/uapi/linux/perf_event.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 92d51a7..8989b2b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1218,6 +1218,13 @@ static inline void perf_event_task_sched_out(struct 
task_struct *prev,
if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES))
__perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
 
+#ifdef CONFIG_CGROUP_PERF
+   if (__perf_sw_enabled(PERF_COUNT_SW_CGROUP_SWITCHES) &&
+   perf_cgroup_from_task(prev, NULL) !=
+   perf_cgroup_from_task(next, NULL))
+   __perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0);
+#endif
+
if (static_branch_unlikely(_sched_events))
__perf_event_task_sched_out(prev, next);
 }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 31b00e3..0b58970 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_CGROUP_SWITCHES   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };


[tip: perf/core] perf core: Factor out __perf_sw_event_sched

2021-04-16 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 7c8056bb366b1b2dc8e4a3cc0b876e15a8ebca2c
Gitweb:
https://git.kernel.org/tip/7c8056bb366b1b2dc8e4a3cc0b876e15a8ebca2c
Author:Namhyung Kim 
AuthorDate:Wed, 10 Feb 2021 17:33:25 +09:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 16 Apr 2021 18:58:52 +02:00

perf core: Factor out __perf_sw_event_sched

In some cases, we need to check more than whether the software event
is enabled.  So split the condition check and the actual event
handling.  This is a preparation for the next change.

Suggested-by: Peter Zijlstra 
Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210210083327.22726-1-namhy...@kernel.org
---
 include/linux/perf_event.h | 33 -
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7d7280a..92d51a7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1178,30 +1178,24 @@ DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]);
  * which is guaranteed by us not actually scheduling inside other swevents
  * because those disable preemption.
  */
-static __always_inline void
-perf_sw_event_sched(u32 event_id, u64 nr, u64 addr)
+static __always_inline void __perf_sw_event_sched(u32 event_id, u64 nr, u64 
addr)
 {
-   if (static_key_false(_swevent_enabled[event_id])) {
-   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
+   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
 
-   perf_fetch_caller_regs(regs);
-   ___perf_sw_event(event_id, nr, regs, addr);
-   }
+   perf_fetch_caller_regs(regs);
+   ___perf_sw_event(event_id, nr, regs, addr);
 }
 
 extern struct static_key_false perf_sched_events;
 
-static __always_inline bool
-perf_sw_migrate_enabled(void)
+static __always_inline bool __perf_sw_enabled(int swevt)
 {
-   if 
(static_key_false(_swevent_enabled[PERF_COUNT_SW_CPU_MIGRATIONS]))
-   return true;
-   return false;
+   return static_key_false(_swevent_enabled[swevt]);
 }
 
 static inline void perf_event_task_migrate(struct task_struct *task)
 {
-   if (perf_sw_migrate_enabled())
+   if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS))
task->sched_migrated = 1;
 }
 
@@ -1211,11 +1205,9 @@ static inline void perf_event_task_sched_in(struct 
task_struct *prev,
if (static_branch_unlikely(_sched_events))
__perf_event_task_sched_in(prev, task);
 
-   if (perf_sw_migrate_enabled() && task->sched_migrated) {
-   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
-
-   perf_fetch_caller_regs(regs);
-   ___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0);
+   if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS) &&
+   task->sched_migrated) {
+   __perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
task->sched_migrated = 0;
}
 }
@@ -1223,7 +1215,8 @@ static inline void perf_event_task_sched_in(struct 
task_struct *prev,
 static inline void perf_event_task_sched_out(struct task_struct *prev,
 struct task_struct *next)
 {
-   perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
+   if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES))
+   __perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
 
if (static_branch_unlikely(_sched_events))
__perf_event_task_sched_out(prev, next);
@@ -1480,8 +1473,6 @@ static inline int perf_event_refresh(struct perf_event 
*event, int refresh)
 static inline void
 perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr){ }
 static inline void
-perf_sw_event_sched(u32 event_id, u64 nr, u64 addr){ }
-static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
 static inline int perf_register_guest_info_callbacks


[tip: perf/core] perf core: Factor out __perf_sw_event_sched

2021-04-16 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 64f6aeb6dc7a2426278fd9017264cf24bfdbebd6
Gitweb:
https://git.kernel.org/tip/64f6aeb6dc7a2426278fd9017264cf24bfdbebd6
Author:Namhyung Kim 
AuthorDate:Wed, 10 Feb 2021 17:33:25 +09:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 16 Apr 2021 16:32:43 +02:00

perf core: Factor out __perf_sw_event_sched

In some cases, we need to check more than whether the software event
is enabled.  So split the condition check and the actual event
handling.  This is a preparation for the next change.

Suggested-by: Peter Zijlstra 
Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210210083327.22726-1-namhy...@kernel.org
---
 include/linux/perf_event.h | 33 -
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7d7280a..92d51a7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1178,30 +1178,24 @@ DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]);
  * which is guaranteed by us not actually scheduling inside other swevents
  * because those disable preemption.
  */
-static __always_inline void
-perf_sw_event_sched(u32 event_id, u64 nr, u64 addr)
+static __always_inline void __perf_sw_event_sched(u32 event_id, u64 nr, u64 
addr)
 {
-   if (static_key_false(_swevent_enabled[event_id])) {
-   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
+   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
 
-   perf_fetch_caller_regs(regs);
-   ___perf_sw_event(event_id, nr, regs, addr);
-   }
+   perf_fetch_caller_regs(regs);
+   ___perf_sw_event(event_id, nr, regs, addr);
 }
 
 extern struct static_key_false perf_sched_events;
 
-static __always_inline bool
-perf_sw_migrate_enabled(void)
+static __always_inline bool __perf_sw_enabled(int swevt)
 {
-   if 
(static_key_false(_swevent_enabled[PERF_COUNT_SW_CPU_MIGRATIONS]))
-   return true;
-   return false;
+   return static_key_false(_swevent_enabled[swevt]);
 }
 
 static inline void perf_event_task_migrate(struct task_struct *task)
 {
-   if (perf_sw_migrate_enabled())
+   if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS))
task->sched_migrated = 1;
 }
 
@@ -1211,11 +1205,9 @@ static inline void perf_event_task_sched_in(struct 
task_struct *prev,
if (static_branch_unlikely(_sched_events))
__perf_event_task_sched_in(prev, task);
 
-   if (perf_sw_migrate_enabled() && task->sched_migrated) {
-   struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
-
-   perf_fetch_caller_regs(regs);
-   ___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0);
+   if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS) &&
+   task->sched_migrated) {
+   __perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
task->sched_migrated = 0;
}
 }
@@ -1223,7 +1215,8 @@ static inline void perf_event_task_sched_in(struct 
task_struct *prev,
 static inline void perf_event_task_sched_out(struct task_struct *prev,
 struct task_struct *next)
 {
-   perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
+   if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES))
+   __perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
 
if (static_branch_unlikely(_sched_events))
__perf_event_task_sched_out(prev, next);
@@ -1480,8 +1473,6 @@ static inline int perf_event_refresh(struct perf_event 
*event, int refresh)
 static inline void
 perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr){ }
 static inline void
-perf_sw_event_sched(u32 event_id, u64 nr, u64 addr){ }
-static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
 static inline int perf_register_guest_info_callbacks


[tip: perf/core] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

2021-04-16 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: a389ea9c161d142bf11fd4c553988c2daa9f5404
Gitweb:
https://git.kernel.org/tip/a389ea9c161d142bf11fd4c553988c2daa9f5404
Author:Namhyung Kim 
AuthorDate:Wed, 10 Feb 2021 17:33:26 +09:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 16 Apr 2021 16:32:43 +02:00

perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

This patch adds a new software event to count context switches
involving cgroup switches.  So it's counted only if cgroups of
previous and next tasks are different.  Note that it only checks the
cgroups in the perf_event subsystem.  For cgroup v2, it shouldn't
matter anyway.

One can argue that we can do this by using existing sched_switch event
with eBPF.  But some systems might not have eBPF for some reason so
I'd like to add this as a simple way.

Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210210083327.22726-2-namhy...@kernel.org
---
 include/linux/perf_event.h  | 7 +++
 include/uapi/linux/perf_event.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 92d51a7..8989b2b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1218,6 +1218,13 @@ static inline void perf_event_task_sched_out(struct 
task_struct *prev,
if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES))
__perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
 
+#ifdef CONFIG_CGROUP_PERF
+   if (__perf_sw_enabled(PERF_COUNT_SW_CGROUP_SWITCHES) &&
+   perf_cgroup_from_task(prev, NULL) !=
+   perf_cgroup_from_task(next, NULL))
+   __perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0);
+#endif
+
if (static_branch_unlikely(_sched_events))
__perf_event_task_sched_out(prev, next);
 }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 31b00e3..0b58970 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_CGROUP_SWITCHES   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };


Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-16 Thread Namhyung Kim
On Fri, Apr 16, 2021 at 8:59 PM Peter Zijlstra  wrote:
>
> On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote:
> > On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra  wrote:
> > >
> > > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
> > >
> > > > > So I think we've had proposals for being able to close fds in the 
> > > > > past;
> > > > > while preserving groups etc. We've always pushed back on that because 
> > > > > of
> > > > > the resource limit issue. By having each counter be a filedesc we get 
> > > > > a
> > > > > natural limit on the amount of resources you can consume. And in that
> > > > > respect, having to use 400k fds is things working as designed.
> > > > >
> > > > > Anyway, there might be a way around this..
> > >
> > > So how about we flip the whole thing sideways, instead of doing one
> > > event for multiple cgroups, do an event for multiple-cpus.
> > >
> > > Basically, allow:
> > >
> > > perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
> > >
> > > Which would have the kernel create nr_cpus events [the corrolary is that
> > > we'd probably also allow: (.pid=-1, cpu=-1) ].
> >
> > Do you mean it'd have separate perf_events per cpu internally?
> > From a cpu's perspective, there's nothing changed, right?
> > Then it will have the same performance problem as of now.
>
> Yes, but we'll not end up in ioctl() hell. The interface is sooo much
> better. The performance thing just means we need to think harder.
>
> I thought cgroup scheduling got a lot better with the work Ian did a
> while back? What's the actual bottleneck now?

Yep, that's true but it still comes with a high cost of multiplexing in
context (cgroup) switch.  It's inefficient that it programs the PMU
with exactly the same config just for a different cgroup.  You know
accessing the MSRs is no cheap operation.

>
> > > Output could be done by adding FORMAT_PERCPU, which takes the current
> > > read() format and writes a copy for each CPU event. (p)read(v)() could
> > > be used to explode or partial read that.
> >
> > Yeah, I think it's good for read.  But what about mmap?
> > I don't think we can use file offset since it's taken for auxtrace.
> > Maybe we can simply disallow that..
>
> Are you actually using mmap() to read? I had a proposal for FORMAT_GROUP
> like thing for mmap(), but I never implemented that (didn't get the
> enthousiatic response I thought it would). But yeah, there's nowhere
> near enough space in there for PERCPU.

Recently there's a patch to do it with rdpmc which needs to mmap first.

https://lore.kernel.org/lkml/20210414155412.3697605-1-r...@kernel.org/

>
> Not sure how to do that, these counters must not be sampling counters
> because we can't be sharing a buffer from multiple CPUs, so data/aux
> just isn't a concern. But it's weird to have them magically behave
> differently.

Yeah it's weird, and we should limit the sampling use case.

Thanks,
Namhyung


Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-16 Thread Namhyung Kim
On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra  wrote:
>
> On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
>
> > > So I think we've had proposals for being able to close fds in the past;
> > > while preserving groups etc. We've always pushed back on that because of
> > > the resource limit issue. By having each counter be a filedesc we get a
> > > natural limit on the amount of resources you can consume. And in that
> > > respect, having to use 400k fds is things working as designed.
> > >
> > > Anyway, there might be a way around this..
>
> So how about we flip the whole thing sideways, instead of doing one
> event for multiple cgroups, do an event for multiple-cpus.
>
> Basically, allow:
>
> perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
>
> Which would have the kernel create nr_cpus events [the corrolary is that
> we'd probably also allow: (.pid=-1, cpu=-1) ].

Do you mean it'd have separate perf_events per cpu internally?
>From a cpu's perspective, there's nothing changed, right?
Then it will have the same performance problem as of now.

>
> Output could be done by adding FORMAT_PERCPU, which takes the current
> read() format and writes a copy for each CPU event. (p)read(v)() could
> be used to explode or partial read that.

Yeah, I think it's good for read.  But what about mmap?
I don't think we can use file offset since it's taken for auxtrace.
Maybe we can simply disallow that..

>
> This gets rid of the nasty variadic nature of the
> 'get-me-these-n-cgroups'. While still getting rid of the n*m fd issue
> you're facing.

As I said, it's not just a file descriptor problem.  In fact, performance
is more concerning.

Thanks,
Namhyung


Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-16 Thread Namhyung Kim
On Fri, Apr 16, 2021 at 6:29 PM Peter Zijlstra  wrote:
>
>
> Duh.. this is a half-finished email I meant to save for later. Anyway,
> I'll reply more.

Nevermind, and thanks for your time! :-)

Namhyung


Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-16 Thread Namhyung Kim
On Fri, Apr 16, 2021 at 6:27 PM Peter Zijlstra  wrote:
>
> On Fri, Apr 16, 2021 at 08:48:12AM +0900, Namhyung Kim wrote:
> > On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra  
> > wrote:
> > > On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
>
> > > > cgroup event counting (i.e. perf stat).
> > > >
> > > >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > > >  64-bit array to attach given cgroups.  The first element is a
> > > >  number of cgroups in the buffer, and the rest is a list of cgroup
> > > >  ids to add a cgroup info to the given event.
> > >
> > > WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> > > different ways?
> >
> > As you know, we already use cgroup-id for sampling.  Yeah we
> > can do it with the fd but one of the point in this patch is to reduce
> > the number of file descriptors. :)
>
> Well, I found those patches again after I wrote that. But I'm still not
> sure what a cgroup-id is from userspace.

It's a file handle that can be get from name_to_handle_at(2).

>
> How does userspace get one given a cgroup? (I actually mounted cgroupfs
> in order to see if there's some new 'id' file to read, there is not)
> Does having the cgroup-id ensure the cgroup exists? Can the cgroup-id
> get re-used?

It doesn't guarantee the existence of the cgroup as far as I know.
The cgroup can go away anytime.  Actually it doesn't matter for
this interface as users will get 0 result for them.  So I didn't check
the validity of the cgroup-id in the code.

And I don't think the cgroup-id is reused without reboot at least
for 64 bit systems.  It's came from a 64 bit integer increased
when a new cgroup is created.  Tejun?

>
> I really don't konw what the thing is. I don't use cgroups, like ever,
> except when I'm forced to due to some regression or bugreport.

I hope I made it clear.

>
> > Also, having cgroup-id is good to match with the result (from read)
> > as it contains the cgroup information.
>
> What?

I mean we need to match the result to a cgroup.  Either by passing
cgroup-id through ioctl or add the info in the read format.

>
> > > >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > > >  array to get the event counter values.  The first element is size
> > > >  of the array in byte, and the second element is a cgroup id to
> > > >  read.  The rest is to save the counter value and timings.
> > >
> > > :-(
> > >
> > > So basically you're doing a whole seconds cgroup interface, one that
> > > violates the one counter per file premise and lives off of ioctl()s.
> >
> > Right, but I'm not sure that we really want a separate event for each
> > cgroup if underlying hardware events are all the same.
>
> Sure, I see where you're coming from; I just don't much like where it
> got you :-)

Ok, let's make it better. :-)

>
> > > *IF* we're going to do something like this, I feel we should explore the
> > > whole vector-per-fd concept before proceeding. Can we make it less yuck
> > > (less special ioctl() and more regular file ops. Can we apply the
> > > concept to more things?
> >
> > Ideally it'd do without keeping file descriptors open.  Maybe we can make
> > the vector accept various types like vector-per-cgroup_id or so.
>
> So I think we've had proposals for being able to close fds in the past;
> while preserving groups etc. We've always pushed back on that because of
> the resource limit issue. By having each counter be a filedesc we get a
> natural limit on the amount of resources you can consume. And in that
> respect, having to use 400k fds is things working as designed.
>
> Anyway, there might be a way around this..

It's not just a file descriptor problem.  By having each counter per cgroup
it should pay the price on multiplexing or event scheduling.  That caused
serious performance problems in production environment so we had
to limit the number of cgroups monitored at a time.

>
> > > The second patch extends the ioctl() to be more read() like, instead of
> > > doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> > > or whatever. In fact, this whole second ioctl() doesn't make sense to
> > > have if we do indeed want to do vector-per-fd.
> >
> > One of the upside of the ioctl() is that we can pass cgroup-id to read.
> > Probably we can keep the index in the vector and set the file offset
> > with it.  Or else just read the whole vector, and then it has a cgroup-id
> > in the output like PERF_FORMA

Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-16 Thread Namhyung Kim
On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra  wrote:
> Lots of random comments below.
>
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo 
> > Reported-by: kernel test robot 
>
> What, the whole thing?

Oh, it's just for build issues when !CONFIG_CGROUP_PERF

>
> > Acked-by: Song Liu 
> > Signed-off-by: Namhyung Kim 
> > ---
> >  include/linux/perf_event.h  |  22 ++
> >  include/uapi/linux/perf_event.h |   2 +
> >  kernel/events/core.c| 480 ++--
> >  3 files changed, 477 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..4b03cbadf4a0 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -771,6 +771,19 @@ struct perf_event {
> >
> >  #ifdef CONFIG_CGROUP_PERF
> >   struct perf_cgroup  *cgrp; /* cgroup event is attach to */
> > +
> > + /* to share an event for multiple cgroups */
> > + struct hlist_head   *cgrp_node_hash;
> > + struct perf_cgroup_node *cgrp_node_entries;
> > + int nr_cgrp_nodes;
> > + int cgrp_node_hash_bits;
> > +
> > + struct list_headcgrp_node_entry;
>
> Not related to perf_cgroup_node below, afaict the name is just plain
> wrong.

Right, it should be cgrp_event_entry or something, but we
have the notion of "cgroup event" for a different thing.
Maybe cgrp_node_event_entry or cgrp_vec_event_entry
(once we get the vector support)?

>
> > +
> > + /* snapshot of previous reading (for perf_cgroup_node below) */
> > + u64 cgrp_node_count;
> > + u64 cgrp_node_time_enabled;
> > + u64 cgrp_node_time_running;
> >  #endif
> >
> >  #ifdef CONFIG_SECURITY
> > @@ -780,6 +793,13 @@ struct perf_event {
> >  #endif /* CONFIG_PERF_EVENTS */
> >  };
> >
> > +struct perf_cgroup_node {
> > + struct hlist_node   node;
> > + u64 id;
> > + u64 count;
> > + u64 time_enabled;
> > + u64 time_running;
> > +} cacheline_aligned;
> >
> >  struct perf_event_groups {
> >   struct rb_root  tree;
> > @@ -843,6 +863,8 @@ struct perf_event_context {
> >   int pin_count;
> >  #ifdef CONFIG_CGROUP_PERF
> >   int nr_cgroups;  /* cgroup evts */
> > + struct list_headcgrp_node_list;
>
> AFAICT this is actually a list of events, not a list of cgroup_node
> thingies, hence the name is wrong.

Correct, will update.

>
> > + struct list_headcgrp_ctx_entry;
> >  #endif
> >   void*task_ctx_data; /* pmu specific data 
> > */
> >   struct rcu_head rcu_head;
> > diff --git a/include/uapi/linux/perf_event.h 
> > b/include/uapi/linux/perf_event.h
> > index ad15e40d7f5d..06bc7ab13616 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> >  #define PERF_EVENT_IOC_PAUSE_OUTPUT  _IOW('$', 9, __u32)
> >  #define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct 
> > perf_event_query_bpf *)
> >  #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES _IOW('$', 11, struct 
> > perf_event_attr *)
> > +#define PERF_EVENT_IOC_ATTACH_CGROUP _IOW('$', 12, __u64 *)
> > +#define PERF_EVENT_IOC_READ_CGROUP   _IOWR('$', 13, __u64 *)
> >
> >  enum perf_event_ioc_flags {
> >   PERF_IOC_FLAG_GROUP = 1U << 0,
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f07943183041..bcf51c0b7855 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -379,6 +379,7 @@ enum event_type_t {
> >   * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
> >   */
> >
> > +static void perf_sched_enable(void);
> >  static void perf_sched_delayed(struct work_struct *work);
> >  DEFINE_STATIC_KEY_FALSE(perf_sc

Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-15 Thread Namhyung Kim
Hi Peter,

Thanks for your review!

On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra  wrote:
>
> On Tue, Apr 13, 2021 at 08:53:36AM -0700, Namhyung Kim wrote:
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
>
> git grep "This patch" Documentation/

Ok, will change.

>
> > cgroup event counting (i.e. perf stat).
> >
> >  * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> >  64-bit array to attach given cgroups.  The first element is a
> >  number of cgroups in the buffer, and the rest is a list of cgroup
> >  ids to add a cgroup info to the given event.
>
> WTH is a cgroup-id? The syscall takes a fd to the path, why have two
> different ways?

As you know, we already use cgroup-id for sampling.  Yeah we
can do it with the fd but one of the point in this patch is to reduce
the number of file descriptors. :)

Also, having cgroup-id is good to match with the result (from read)
as it contains the cgroup information.


>
> >  * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> >  array to get the event counter values.  The first element is size
> >  of the array in byte, and the second element is a cgroup id to
> >  read.  The rest is to save the counter value and timings.
>
> :-(
>
> So basically you're doing a whole seconds cgroup interface, one that
> violates the one counter per file premise and lives off of ioctl()s.

Right, but I'm not sure that we really want a separate event for each
cgroup if underlying hardware events are all the same.

>
> *IF* we're going to do something like this, I feel we should explore the
> whole vector-per-fd concept before proceeding. Can we make it less yuck
> (less special ioctl() and more regular file ops. Can we apply the
> concept to more things?

Ideally it'd do without keeping file descriptors open.  Maybe we can make
the vector accept various types like vector-per-cgroup_id or so.

>
> The second patch extends the ioctl() to be more read() like, instead of
> doing the sane things and extending read() by adding PERF_FORMAT_VECTOR
> or whatever. In fact, this whole second ioctl() doesn't make sense to
> have if we do indeed want to do vector-per-fd.

One of the upside of the ioctl() is that we can pass cgroup-id to read.
Probably we can keep the index in the vector and set the file offset
with it.  Or else just read the whole vector, and then it has a cgroup-id
in the output like PERF_FORMAT_CGROUP?

>
> Also, I suppose you can already fake this, by having a
> SW_CGROUP_SWITCHES (sorry, I though I picked those up, done now) event

Thanks!

> with PERF_SAMPLE_READ|PERF_SAMPLE_CGROUP and PERF_FORMAT_GROUP in a
> group with a bunch of events. Then the buffer will fill with the values
> you use here.

Right, I'll do an experiment with it.

>
> Yes, I suppose it has higher overhead, but you get the data you want
> without having to do terrible things like this.

That's true.  And we don't need many things in the perf record like
synthesizing task/mmap info.  Also there's a risk we can miss some
samples for some reason.

Another concern is that it'd add huge slow down in the perf event
open as it creates a mixed sw/hw group.  The synchronized_rcu in
the move_cgroup path caused significant problems in my
environment as it adds up in proportion to the number of cpus.

>
>
>
>
> Lots of random comments below.

Thanks for the review, I'll reply in a separate thread.

Namhyung


Re: [PATCH] libperf: xyarray: Add bounds checks to xyarray__entry()

2021-04-14 Thread Namhyung Kim
On Thu, Apr 15, 2021 at 4:58 AM Rob Herring  wrote:
>
> xyarray__entry() is missing any bounds checking yet often the x and y
> parameters come from external callers. Add bounds checks and an
> unchecked __xyarray__entry().
>
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Signed-off-by: Rob Herring 
> ---
>  tools/lib/perf/include/internal/xyarray.h | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/tools/lib/perf/include/internal/xyarray.h 
> b/tools/lib/perf/include/internal/xyarray.h
> index 51e35d6c8ec4..f0896c00b494 100644
> --- a/tools/lib/perf/include/internal/xyarray.h
> +++ b/tools/lib/perf/include/internal/xyarray.h
> @@ -18,11 +18,18 @@ struct xyarray *xyarray__new(int xlen, int ylen, size_t 
> entry_size);
>  void xyarray__delete(struct xyarray *xy);
>  void xyarray__reset(struct xyarray *xy);
>
> -static inline void *xyarray__entry(struct xyarray *xy, int x, int y)
> +static inline void *__xyarray__entry(struct xyarray *xy, int x, int y)
>  {
> return >contents[x * xy->row_size + y * xy->entry_size];
>  }
>
> +static inline void *xyarray__entry(struct xyarray *xy, int x, int y)
> +{
> +   if (x >= xy->max_x || y >= xy->max_y)
> +   return NULL;

Maybe better to check negatives as well.

Thanks,
Namhyung


> +   return __xyarray__entry(xy, x, y);
> +}
> +
>  static inline int xyarray__max_y(struct xyarray *xy)
>  {
> return xy->max_y;
> --
> 2.27.0
>


Re: [PATCH v8 2/4] libperf: Add evsel mmap support

2021-04-14 Thread Namhyung Kim
On Thu, Apr 15, 2021 at 3:23 AM Arnaldo Carvalho de Melo
 wrote:
>
> Em Wed, Apr 14, 2021 at 03:02:08PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, Apr 15, 2021 at 01:41:35AM +0900, Namhyung Kim escreveu:
> > > Hello,
> > >
> > > On Thu, Apr 15, 2021 at 1:07 AM Rob Herring  wrote:
> > > > +void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, int 
> > > > thread)
> > > > +{
> > > > +   if (FD(evsel, cpu, thread) < 0 || MMAP(evsel, cpu, thread) == 
> > > > NULL)
> > > > +   return NULL;
> > >
> > > I think you should check the cpu and the thread is in
> > > a valid range.  Currently xyarray__entry() simply accesses
> > > the content without checking the boundaries.
> >
> > So, since xyarray has the bounds, it should check it, i.e. we need to
> > have a __xyarray__entry() that is what xyarray__entry() does, i.e.
> > assume the values have been bounds checked, then a new method,
> > xyarray__entry() that does bounds check, if it fails, return NULL,
> > otherwise calls __xyarray__entry().
> >
> > I see this is frustrating and I should've chimed in earlier, but at
> > least now this is getting traction, and the end result will be better
> > not just for the feature you've been dilligently working on,
> >
> > Thank you for your persistence,
>
> Re-reading, yeah, this can be done in a separate patch, Namhyung, can I
> have your Reviewed-by? That or an Acked-by?

Sure, for the series:

Acked-by: Namhyung Kim 

Thanks,
Namhyung


Re: [PATCH v8 2/4] libperf: Add evsel mmap support

2021-04-14 Thread Namhyung Kim
On Thu, Apr 15, 2021 at 1:53 AM Rob Herring  wrote:
>
> On Wed, Apr 14, 2021 at 11:41 AM Namhyung Kim  wrote:
> >
> > Hello,
> >
> > On Thu, Apr 15, 2021 at 1:07 AM Rob Herring  wrote:
> > > +void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, int 
> > > thread)
> > > +{
> > > +   if (FD(evsel, cpu, thread) < 0 || MMAP(evsel, cpu, thread) == 
> > > NULL)
> > > +   return NULL;
> >
> > I think you should check the cpu and the thread is in
> > a valid range.  Currently xyarray__entry() simply accesses
> > the content without checking the boundaries.
>
> Happy to add a patch to do that if desired, but I think that's
> separate from this series. That would be something to add to
> xyarray__entry().

Sure, we can do that separately.

Thanks,
Namhyung


Re: [PATCH v8 2/4] libperf: Add evsel mmap support

2021-04-14 Thread Namhyung Kim
Hello,

On Thu, Apr 15, 2021 at 1:07 AM Rob Herring  wrote:
> +void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, int thread)
> +{
> +   if (FD(evsel, cpu, thread) < 0 || MMAP(evsel, cpu, thread) == NULL)
> +   return NULL;

I think you should check the cpu and the thread is in
a valid range.  Currently xyarray__entry() simply accesses
the content without checking the boundaries.

Thanks,
Namhyung


> +
> +   return MMAP(evsel, cpu, thread)->base;
> +}
> +
>  int perf_evsel__read_size(struct perf_evsel *evsel)
>  {
> u64 read_format = evsel->attr.read_format;
> diff --git a/tools/lib/perf/include/internal/evsel.h 
> b/tools/lib/perf/include/internal/evsel.h
> index 1ffd083b235e..1c067d088bc6 100644
> --- a/tools/lib/perf/include/internal/evsel.h
> +++ b/tools/lib/perf/include/internal/evsel.h
> @@ -41,6 +41,7 @@ struct perf_evsel {
> struct perf_cpu_map *own_cpus;
> struct perf_thread_map  *threads;
> struct xyarray  *fd;
> +   struct xyarray  *mmap;
> struct xyarray  *sample_id;
> u64 *id;
> u32  ids;
> diff --git a/tools/lib/perf/include/perf/evsel.h 
> b/tools/lib/perf/include/perf/evsel.h
> index c82ec39a4ad0..60eae25076d3 100644
> --- a/tools/lib/perf/include/perf/evsel.h
> +++ b/tools/lib/perf/include/perf/evsel.h
> @@ -27,6 +27,9 @@ LIBPERF_API int perf_evsel__open(struct perf_evsel *evsel, 
> struct perf_cpu_map *
>  struct perf_thread_map *threads);
>  LIBPERF_API void perf_evsel__close(struct perf_evsel *evsel);
>  LIBPERF_API void perf_evsel__close_cpu(struct perf_evsel *evsel, int cpu);
> +LIBPERF_API int perf_evsel__mmap(struct perf_evsel *evsel, int pages);
> +LIBPERF_API void perf_evsel__munmap(struct perf_evsel *evsel);
> +LIBPERF_API void *perf_evsel__mmap_base(struct perf_evsel *evsel, int cpu, 
> int thread);
>  LIBPERF_API int perf_evsel__read(struct perf_evsel *evsel, int cpu, int 
> thread,
>  struct perf_counts_values *count);
>  LIBPERF_API int perf_evsel__enable(struct perf_evsel *evsel);
> diff --git a/tools/lib/perf/libperf.map b/tools/lib/perf/libperf.map
> index 7be1af8a546c..c0c7ceb11060 100644
> --- a/tools/lib/perf/libperf.map
> +++ b/tools/lib/perf/libperf.map
> @@ -23,6 +23,9 @@ LIBPERF_0.0.1 {
> perf_evsel__disable;
> perf_evsel__open;
> perf_evsel__close;
> +   perf_evsel__mmap;
> +   perf_evsel__munmap;
> +   perf_evsel__mmap_base;
> perf_evsel__read;
> perf_evsel__cpus;
> perf_evsel__threads;
> --
> 2.27.0


Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task

2021-04-14 Thread Namhyung Kim
Hi Kan,

On Wed, Apr 14, 2021 at 4:04 AM  wrote:
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index dd9f3c2..0d4a1a3 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1585,6 +1585,8 @@ static void x86_pmu_del(struct perf_event *event, int 
> flags)
> if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
> goto do_del;
>
> +   __set_bit(event->hw.idx, cpuc->dirty);
> +
> /*
>  * Not a TXN, therefore cleanup properly.
>  */
> @@ -2304,12 +2306,46 @@ static int x86_pmu_event_init(struct perf_event 
> *event)
> return err;
>  }
>
> +void x86_pmu_clear_dirty_counters(void)
> +{
> +   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> +   int i;
> +
> +   if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
> +   return;

Maybe you can check it after clearing assigned counters.

Thanks,
Namhyung

> +
> +/* Don't need to clear the assigned counter. */
> +   for (i = 0; i < cpuc->n_events; i++)
> +   __clear_bit(cpuc->assign[i], cpuc->dirty);
> +
> +   for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) {
> +   /* Metrics and fake events don't have corresponding HW 
> counters. */
> +   if (is_metric_idx(i) || (i == INTEL_PMC_IDX_FIXED_VLBR))
> +   continue;
> +   else if (i >= INTEL_PMC_IDX_FIXED)
> +   wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - 
> INTEL_PMC_IDX_FIXED), 0);
> +   else
> +   wrmsrl(x86_pmu_event_addr(i), 0);
> +   }
> +
> +   bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX);
> +}


Re: [PATCH v7] perf annotate: Fix sample events lost in stdio mode

2021-04-14 Thread Namhyung Kim
Hi Arnaldo,

On Wed, Apr 14, 2021 at 9:23 PM Arnaldo Carvalho de Melo
 wrote:
>
> Em Mon, Apr 12, 2021 at 03:22:29PM +0800, Yang Jihong escreveu:
> > On 2021/3/31 10:18, Yang Jihong wrote:
> > > On 2021/3/30 15:26, Namhyung Kim wrote:
> > > > On Sat, Mar 27, 2021 at 11:16 AM Yang Jihong  
> > > > wrote:
> > > > > On 2021/3/26 20:06, Arnaldo Carvalho de Melo wrote:
> > > > > > So it seems to be working, what am I missing? Is this strictly non
> > > > > > group related?
>
> > > > > Yes, it is non group related.
> > > > > This problem occurs only when different events need to be recorded at
> > > > > the same time, i.e.:
> > > > > perf record -e branch-misses -e branch-instructions -a sleep 1
>
> > > > > The output results of perf script and perf annotate do not match.
> > > > > Some events are not output in perf annotate.
>
> > > > Yeah I think it's related to sort keys.  The code works with a single
> > > > hist_entry for each event and symbol.  But the default sort key
> > > > creates multiple entries for different threads and it causes the
> > > > confusion.
>
> > > Yes, After revome zfree from hists__find_annotations, the output of perf
> > > annotate is repeated, which is related to sort keys.
>
> > > The original problem is that notes->src may correspond to multiple
> > > sample events. Therefore, we cannot simply zfree notes->src to avoid
> > > repeated output.
>
> > > Arnaldo, is there any problem with this patch? :)
>
> > PING :)
> > Is there any problem with this patch that needs to be modified?
>
> I continue having a feeling this is kinda a bandaid, i.e. avoid the
> problem, and since we have a way to work this when using a group, I fail
> to see why it couldn't work when not grouping events.

When we use a group, there's a single iteration only for the leader event.
But if not, it'll iterate the hist entries twice (for two events).
Each iteration
used to have multiple entries for the same symbol (due to the sort key),
so it marked the symbol (by freeing notes->src) to skip the same symbol
during the iteration.

However as the first iteration freed sym->notes->src, then the second
(or later) event cannot see the annotation for the deleted symbols
for that event even if it has some samples.

>
> But since I have no time to dive into this and Namhyung is ok with it,
> I'll merge it now.

Thanks,
Namhyung


[PATCH v3 2/2] perf/core: Support reading group events with shared cgroups

2021-04-13 Thread Namhyung Kim
This enables reading event group's counter values together with a
PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
Users should give a correct size of buffer to be read which includes
the total buffer size and the cgroup id.

Acked-by: Song Liu 
Signed-off-by: Namhyung Kim 
---
 kernel/events/core.c | 120 +--
 1 file changed, 117 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index bcf51c0b7855..7440857d680e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2232,13 +2232,24 @@ static void perf_add_cgrp_node_list(struct perf_event 
*event,
 {
struct list_head *cgrp_ctx_list = this_cpu_ptr(_ctx_list);
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+   struct perf_event *sibling;
bool is_first;
 
lockdep_assert_irqs_disabled();
lockdep_assert_held(>lock);
 
+   /* only group leader can be added directly */
+   if (event->group_leader != event)
+   return;
+
+   if (!event_has_cgroup_node(event))
+   return;
+
is_first = list_empty(>cgrp_node_list);
+
list_add_tail(>cgrp_node_entry, >cgrp_node_list);
+   for_each_sibling_event(sibling, event)
+   list_add_tail(>cgrp_node_entry, >cgrp_node_list);
 
if (is_first)
list_add_tail(>cgrp_ctx_entry, cgrp_ctx_list);
@@ -2250,15 +2261,25 @@ static void perf_del_cgrp_node_list(struct perf_event 
*event,
struct perf_event_context *ctx)
 {
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+   struct perf_event *sibling;
 
lockdep_assert_irqs_disabled();
lockdep_assert_held(>lock);
 
+   /* only group leader can be deleted directly */
+   if (event->group_leader != event)
+   return;
+
+   if (!event_has_cgroup_node(event))
+   return;
+
update_cgroup_node(event, cgrp->css.cgroup);
/* to refresh delta when it's enabled */
event->cgrp_node_count = 0;
 
list_del(>cgrp_node_entry);
+   for_each_sibling_event(sibling, event)
+   list_del(>cgrp_node_entry);
 
if (list_empty(>cgrp_node_list))
list_del(>cgrp_ctx_entry);
@@ -2333,7 +2354,7 @@ static int perf_event_attach_cgroup_node(struct 
perf_event *event, u64 nr_cgrps,
 
raw_spin_unlock_irqrestore(>lock, flags);
 
-   if (is_first && enabled)
+   if (is_first && enabled && event->group_leader == event)
event_function_call(event, perf_attach_cgroup_node, NULL);
 
return 0;
@@ -2370,8 +2391,8 @@ static void __perf_read_cgroup_node(struct perf_event 
*event)
}
 }
 
-static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
-  u64 cgrp_id, char __user *buf)
+static int perf_event_read_cgrp_node_one(struct perf_event *event, u64 cgrp_id,
+char __user *buf)
 {
struct perf_cgroup_node *cgrp;
struct perf_event_context *ctx = event->ctx;
@@ -2406,6 +2427,92 @@ static int perf_event_read_cgroup_node(struct perf_event 
*event, u64 read_size,
 
return n * sizeof(u64);
 }
+
+static int perf_event_read_cgrp_node_sibling(struct perf_event *event,
+u64 read_format, u64 cgrp_id,
+u64 *values)
+{
+   struct perf_cgroup_node *cgrp;
+   int n = 0;
+
+   cgrp = find_cgroup_node(event, cgrp_id);
+   if (cgrp == NULL)
+   return (read_format & PERF_FORMAT_ID) ? 2 : 1;
+
+   values[n++] = cgrp->count;
+   if (read_format & PERF_FORMAT_ID)
+   values[n++] = primary_event_id(event);
+   return n;
+}
+
+static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 
cgrp_id,
+  char __user *buf)
+{
+   struct perf_cgroup_node *cgrp;
+   struct perf_event_context *ctx = event->ctx;
+   struct perf_event *sibling;
+   u64 read_format = event->attr.read_format;
+   unsigned long flags;
+   u64 *values;
+   int n = 1;
+   int ret;
+
+   values = kzalloc(event->read_size, GFP_KERNEL);
+   if (!values)
+   return -ENOMEM;
+
+   values[0] = 1 + event->nr_siblings;
+
+   /* update event count and times (possibly run on other cpu) */
+   (void)perf_event_read(event, true);
+
+   raw_spin_lock_irqsave(>lock, flags);
+
+   cgrp = find_cgroup_node(event, cgrp_id);
+   if (cgrp == NULL) {
+   raw_spin_unlock_irqrestore(>lock, flags);
+   kfree(values);
+   return -ENOENT;
+   }
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_E

[PATCH v3 0/2] perf core: Sharing events with multiple cgroups

2021-04-13 Thread Namhyung Kim
Hello,

This work is to make perf stat more scalable with a lot of cgroups.

Changes in V3)
 * fix build error when !CONFIG_CGROUP_PERF

Changes in v2)
 * use cacheline_aligned macro instead of the padding
 * enclose the cgroup node list initialization
 * add more comments
 * add Acked-by from Song Liu


Currently we need to open a separate perf_event to count an event in a
cgroup.  For a big machine, this requires lots of events like

  256 cpu x 8 events x 200 cgroups = 409600 events

This is very wasteful and not scalable.  In this case, the perf stat
actually counts exactly same events for each cgroup.  I think we can
just use a single event to measure all cgroups running on that cpu.

So I added new ioctl commands to add per-cgroup counters to an
existing perf_event and to read the per-cgroup counters from the
event.  The per-cgroup counters are updated during the context switch
if tasks' cgroups are different (and no need to change the HW PMU).
It keeps the counters in a hash table with cgroup id as a key.

With this change, average processing time of my internal test workload
which runs tasks in a different cgroup and communicates by pipes
dropped from 11.3 usec to 5.8 usec.

Thanks,
Namhyung


Namhyung Kim (2):
  perf/core: Share an event with multiple cgroups
  perf/core: Support reading group events with shared cgroups

 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 591 ++--
 3 files changed, 588 insertions(+), 27 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog


*** BLURB HERE ***

Namhyung Kim (2):
  perf/core: Share an event with multiple cgroups
  perf/core: Support reading group events with shared cgroups

 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 594 ++--
 3 files changed, 591 insertions(+), 27 deletions(-)


base-commit: cface0326a6c2ae5c8f47bd466f07624b3e348a7
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 1/2] perf/core: Share an event with multiple cgroups

2021-04-13 Thread Namhyung Kim
As we can run many jobs (in container) on a big machine, we want to
measure each job's performance during the run.  To do that, the
perf_event can be associated to a cgroup to measure it only.

However such cgroup events need to be opened separately and it causes
significant overhead in event multiplexing during the context switch
as well as resource consumption like in file descriptors and memory
footprint.

As a cgroup event is basically a cpu event, we can share a single cpu
event for multiple cgroups.  All we need is a separate counter (and
two timing variables) for each cgroup.  I added a hash table to map
from cgroup id to the attached cgroups.

With this change, the cpu event needs to calculate a delta of event
counter values when the cgroups of current and the next task are
different.  And it attributes the delta to the current task's cgroup.

This patch adds two new ioctl commands to perf_event for light-weight
cgroup event counting (i.e. perf stat).

 * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
 64-bit array to attach given cgroups.  The first element is a
 number of cgroups in the buffer, and the rest is a list of cgroup
 ids to add a cgroup info to the given event.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
 array to get the event counter values.  The first element is size
 of the array in byte, and the second element is a cgroup id to
 read.  The rest is to save the counter value and timings.

This attaches all cgroups in a single syscall and I didn't add the
DETACH command deliberately to make the implementation simple.  The
attached cgroup nodes would be deleted when the file descriptor of the
perf_event is closed.

Cc: Tejun Heo 
Reported-by: kernel test robot 
Acked-by: Song Liu 
Signed-off-by: Namhyung Kim 
---
 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 480 ++--
 3 files changed, 477 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..4b03cbadf4a0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -771,6 +771,19 @@ struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
struct perf_cgroup  *cgrp; /* cgroup event is attach to */
+
+   /* to share an event for multiple cgroups */
+   struct hlist_head   *cgrp_node_hash;
+   struct perf_cgroup_node *cgrp_node_entries;
+   int nr_cgrp_nodes;
+   int cgrp_node_hash_bits;
+
+   struct list_headcgrp_node_entry;
+
+   /* snapshot of previous reading (for perf_cgroup_node below) */
+   u64 cgrp_node_count;
+   u64 cgrp_node_time_enabled;
+   u64 cgrp_node_time_running;
 #endif
 
 #ifdef CONFIG_SECURITY
@@ -780,6 +793,13 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+struct perf_cgroup_node {
+   struct hlist_node   node;
+   u64 id;
+   u64 count;
+   u64 time_enabled;
+   u64 time_running;
+} cacheline_aligned;
 
 struct perf_event_groups {
struct rb_root  tree;
@@ -843,6 +863,8 @@ struct perf_event_context {
int pin_count;
 #ifdef CONFIG_CGROUP_PERF
int nr_cgroups;  /* cgroup evts */
+   struct list_headcgrp_node_list;
+   struct list_headcgrp_ctx_entry;
 #endif
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..06bc7ab13616 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -479,6 +479,8 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_ATTACH_CGROUP   _IOW('$', 12, __u64 *)
+#define PERF_EVENT_IOC_READ_CGROUP _IOWR('$', 13, __u64 *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..bcf51c0b7855 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -379,6 +379,7 @@ enum event_type_t {
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
  */
 
+static void perf_sched_enable(void);
 static void perf_sched_delayed(struct work_str

[PATCH v2 2/2] perf/core: Support reading group events with shared cgroups

2021-04-12 Thread Namhyung Kim
This enables reading event group's counter values together with a
PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
Users should give a correct size of buffer to be read which includes
the total buffer size and the cgroup id.

Acked-by: Song Liu 
Signed-off-by: Namhyung Kim 
---
 kernel/events/core.c | 120 +--
 1 file changed, 117 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0c6b3848a61f..d483b4b42fe2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2232,13 +2232,24 @@ static void perf_add_cgrp_node_list(struct perf_event 
*event,
 {
struct list_head *cgrp_ctx_list = this_cpu_ptr(_ctx_list);
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+   struct perf_event *sibling;
bool is_first;
 
lockdep_assert_irqs_disabled();
lockdep_assert_held(>lock);
 
+   /* only group leader can be added directly */
+   if (event->group_leader != event)
+   return;
+
+   if (!event_has_cgroup_node(event))
+   return;
+
is_first = list_empty(>cgrp_node_list);
+
list_add_tail(>cgrp_node_entry, >cgrp_node_list);
+   for_each_sibling_event(sibling, event)
+   list_add_tail(>cgrp_node_entry, >cgrp_node_list);
 
if (is_first)
list_add_tail(>cgrp_ctx_entry, cgrp_ctx_list);
@@ -2250,15 +2261,25 @@ static void perf_del_cgrp_node_list(struct perf_event 
*event,
struct perf_event_context *ctx)
 {
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+   struct perf_event *sibling;
 
lockdep_assert_irqs_disabled();
lockdep_assert_held(>lock);
 
+   /* only group leader can be deleted directly */
+   if (event->group_leader != event)
+   return;
+
+   if (!event_has_cgroup_node(event))
+   return;
+
update_cgroup_node(event, cgrp->css.cgroup);
/* to refresh delta when it's enabled */
event->cgrp_node_count = 0;
 
list_del(>cgrp_node_entry);
+   for_each_sibling_event(sibling, event)
+   list_del(>cgrp_node_entry);
 
if (list_empty(>cgrp_node_list))
list_del(>cgrp_ctx_entry);
@@ -2333,7 +2354,7 @@ static int perf_event_attach_cgroup_node(struct 
perf_event *event, u64 nr_cgrps,
 
raw_spin_unlock_irqrestore(>lock, flags);
 
-   if (is_first && enabled)
+   if (is_first && enabled && event->group_leader == event)
event_function_call(event, perf_attach_cgroup_node, NULL);
 
return 0;
@@ -2370,8 +2391,8 @@ static void __perf_read_cgroup_node(struct perf_event 
*event)
}
 }
 
-static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
-  u64 cgrp_id, char __user *buf)
+static int perf_event_read_cgrp_node_one(struct perf_event *event, u64 cgrp_id,
+char __user *buf)
 {
struct perf_cgroup_node *cgrp;
struct perf_event_context *ctx = event->ctx;
@@ -2406,6 +2427,92 @@ static int perf_event_read_cgroup_node(struct perf_event 
*event, u64 read_size,
 
return n * sizeof(u64);
 }
+
+static int perf_event_read_cgrp_node_sibling(struct perf_event *event,
+u64 read_format, u64 cgrp_id,
+u64 *values)
+{
+   struct perf_cgroup_node *cgrp;
+   int n = 0;
+
+   cgrp = find_cgroup_node(event, cgrp_id);
+   if (cgrp == NULL)
+   return (read_format & PERF_FORMAT_ID) ? 2 : 1;
+
+   values[n++] = cgrp->count;
+   if (read_format & PERF_FORMAT_ID)
+   values[n++] = primary_event_id(event);
+   return n;
+}
+
+static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 
cgrp_id,
+  char __user *buf)
+{
+   struct perf_cgroup_node *cgrp;
+   struct perf_event_context *ctx = event->ctx;
+   struct perf_event *sibling;
+   u64 read_format = event->attr.read_format;
+   unsigned long flags;
+   u64 *values;
+   int n = 1;
+   int ret;
+
+   values = kzalloc(event->read_size, GFP_KERNEL);
+   if (!values)
+   return -ENOMEM;
+
+   values[0] = 1 + event->nr_siblings;
+
+   /* update event count and times (possibly run on other cpu) */
+   (void)perf_event_read(event, true);
+
+   raw_spin_lock_irqsave(>lock, flags);
+
+   cgrp = find_cgroup_node(event, cgrp_id);
+   if (cgrp == NULL) {
+   raw_spin_unlock_irqrestore(>lock, flags);
+   kfree(values);
+   return -ENOENT;
+   }
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_E

[PATCH v2 0/2] perf core: Sharing events with multiple cgroups

2021-04-12 Thread Namhyung Kim
Hello,

This work is to make perf stat more scalable with a lot of cgroups.

Changes in v2)
 * use cacheline_aligned macro instead of the padding
 * enclose the cgroup node list initialization
 * add more comments
 * add Acked-by from Song Liu


Currently we need to open a separate perf_event to count an event in a
cgroup.  For a big machine, this requires lots of events like

  256 cpu x 8 events x 200 cgroups = 409600 events

This is very wasteful and not scalable.  In this case, the perf stat
actually counts exactly same events for each cgroup.  I think we can
just use a single event to measure all cgroups running on that cpu.

So I added new ioctl commands to add per-cgroup counters to an
existing perf_event and to read the per-cgroup counters from the
event.  The per-cgroup counters are updated during the context switch
if tasks' cgroups are different (and no need to change the HW PMU).
It keeps the counters in a hash table with cgroup id as a key.

With this change, average processing time of my internal test workload
which runs tasks in a different cgroup and communicates by pipes
dropped from 11.3 usec to 5.8 usec.

Thanks,
Namhyung


Namhyung Kim (2):
  perf/core: Share an event with multiple cgroups
  perf/core: Support reading group events with shared cgroups

 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 591 ++--
 3 files changed, 588 insertions(+), 27 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 1/2] perf/core: Share an event with multiple cgroups

2021-04-12 Thread Namhyung Kim
As we can run many jobs (in container) on a big machine, we want to
measure each job's performance during the run.  To do that, the
perf_event can be associated to a cgroup to measure it only.

However such cgroup events need to be opened separately and it causes
significant overhead in event multiplexing during the context switch
as well as resource consumption like in file descriptors and memory
footprint.

As a cgroup event is basically a cpu event, we can share a single cpu
event for multiple cgroups.  All we need is a separate counter (and
two timing variables) for each cgroup.  I added a hash table to map
from cgroup id to the attached cgroups.

With this change, the cpu event needs to calculate a delta of event
counter values when the cgroups of current and the next task are
different.  And it attributes the delta to the current task's cgroup.

This patch adds two new ioctl commands to perf_event for light-weight
cgroup event counting (i.e. perf stat).

 * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
 64-bit array to attach given cgroups.  The first element is a
 number of cgroups in the buffer, and the rest is a list of cgroup
 ids to add a cgroup info to the given event.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
 array to get the event counter values.  The first element is size
 of the array in byte, and the second element is a cgroup id to
 read.  The rest is to save the counter value and timings.

This attaches all cgroups in a single syscall and I didn't add the
DETACH command deliberately to make the implementation simple.  The
attached cgroup nodes would be deleted when the file descriptor of the
perf_event is closed.

Cc: Tejun Heo 
Acked-by: Song Liu 
Signed-off-by: Namhyung Kim 
---
 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 477 ++--
 3 files changed, 474 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..4b03cbadf4a0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -771,6 +771,19 @@ struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
struct perf_cgroup  *cgrp; /* cgroup event is attach to */
+
+   /* to share an event for multiple cgroups */
+   struct hlist_head   *cgrp_node_hash;
+   struct perf_cgroup_node *cgrp_node_entries;
+   int nr_cgrp_nodes;
+   int cgrp_node_hash_bits;
+
+   struct list_headcgrp_node_entry;
+
+   /* snapshot of previous reading (for perf_cgroup_node below) */
+   u64 cgrp_node_count;
+   u64 cgrp_node_time_enabled;
+   u64 cgrp_node_time_running;
 #endif
 
 #ifdef CONFIG_SECURITY
@@ -780,6 +793,13 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+struct perf_cgroup_node {
+   struct hlist_node   node;
+   u64 id;
+   u64 count;
+   u64 time_enabled;
+   u64 time_running;
+} cacheline_aligned;
 
 struct perf_event_groups {
struct rb_root  tree;
@@ -843,6 +863,8 @@ struct perf_event_context {
int pin_count;
 #ifdef CONFIG_CGROUP_PERF
int nr_cgroups;  /* cgroup evts */
+   struct list_headcgrp_node_list;
+   struct list_headcgrp_ctx_entry;
 #endif
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..06bc7ab13616 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -479,6 +479,8 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_ATTACH_CGROUP   _IOW('$', 12, __u64 *)
+#define PERF_EVENT_IOC_READ_CGROUP _IOWR('$', 13, __u64 *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..0c6b3848a61f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -379,6 +379,7 @@ enum event_type_t {
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
  */
 
+static void perf_sched_enable(void);
 static void perf_sched_delayed(struct work_struct *work);
 DEFINE_STATIC_KEY_FALSE(perf_sch

Re: [PATCH v4 08/12] perf record: introduce --threads= command line option

2021-04-12 Thread Namhyung Kim
Hello,

On Tue, Apr 6, 2021 at 5:49 PM Bayduraev, Alexey V
 wrote:
>
>
> Provide --threads option in perf record command line interface.
> The option can have a value in the form of masks that specify
> cpus to be monitored with data streaming threads and its layout
> in system topology. The masks can be filtered using cpu mask
> provided via -C option.
>
> The specification value can be user defined list of masks. Masks
> separated by colon define cpus to be monitored by one thread and
> affinity mask of that thread is separated by slash. For example:
> /:/
> specifies parallel threads layout that consists of two threads
> with corresponding assigned cpus to be monitored.
>
> The specification value can be a string e.g. "cpu", "core" or
> "socket" meaning creation of data streaming thread for every
> cpu or core or socket to monitor distinct cpus or cpus grouped
> by core or socket.
>
> The option provided with no or empty value defaults to per-cpu
> parallel threads layout creating data streaming thread for every
> cpu being monitored.
>
> Feature design and implementation are based on prototypes [1], [2].
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git 
> -b perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jo...@kernel.org/
>
> Suggested-by: Jiri Olsa 
> Suggested-by: Namhyung Kim 
> Signed-off-by: Alexey Bayduraev 
> ---
[SNIP]
> +static int record__init_thread_masks_spec(struct record *rec, struct 
> perf_cpu_map *cpus,
> + char **maps_spec, char 
> **affinity_spec, u32 nr_spec)
> +{
> +   u32 s;
> +   int ret, nr_threads = 0;
> +   struct mmap_cpu_mask cpus_mask;
> +   struct thread_mask thread_mask, full_mask;
> +
> +   ret = record__mmap_cpu_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   return ret;
> +   record__mmap_cpu_mask_init(_mask, cpus);
> +   ret = record__thread_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   goto out_free_cpu_mask;
> +   ret = record__thread_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   goto out_free_thread_mask;
> +   record__thread_mask_clear(_mask);
> +
> +   for (s = 0; s < nr_spec; s++) {
> +   record__thread_mask_clear(_mask);
> +
> +   record__mmap_cpu_mask_init_spec(_mask.maps, 
> maps_spec[s]);
> +   record__mmap_cpu_mask_init_spec(_mask.affinity, 
> affinity_spec[s]);
> +
> +   if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
> +   cpus_mask.bits, thread_mask.maps.nbits) ||
> +   !bitmap_and(thread_mask.affinity.bits, 
> thread_mask.affinity.bits,
> +   cpus_mask.bits, thread_mask.affinity.nbits))
> +   continue;
> +
> +   ret = record__thread_mask_intersects(_mask, 
> _mask);
> +   if (ret)
> +   return ret;

I think you should free other masks.

> +   record__thread_mask_or(_mask, _mask, _mask);
> +
> +   rec->thread_masks = realloc(rec->thread_masks,
> +   (nr_threads + 1) * sizeof(struct 
> thread_mask));
> +   if (!rec->thread_masks) {
> +   pr_err("Failed to allocate thread masks\n");
> +   ret = -ENOMEM;
> +   goto out_free_full_mask;

But this will leak rec->thread_masks as it's overwritten.


> +   }
> +   rec->thread_masks[nr_threads] = thread_mask;
> +   pr_debug("thread_masks[%d]: addr=", nr_threads);
> +   mmap_cpu_mask__scnprintf(>thread_masks[nr_threads].maps, 
> "maps");
> +   pr_debug("thread_masks[%d]: addr=", nr_threads);
> +   
> mmap_cpu_mask__scnprintf(>thread_masks[nr_threads].affinity, "affinity");
> +   nr_threads++;
> +   ret = record__thread_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   return ret;

Ditto, use goto.

> +   }
> +
> +   rec->nr_threads = nr_threads;
> +   pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
> +
> +out_free_full_mask:
> +   record__thread_mask_free(_mask);
> +out_free_thread_mask:
> +   record__thread_mask_free(_mask);
> +out_free_cpu_mask:
> +   record__mmap_cpu_mask_free(_mask);
> +
> +   return 0;
> +}

[SNIP]
> +
> +static int 

[PATCH] perf record: Disallow -c and -F option at the same time

2021-04-02 Thread Namhyung Kim
It's confusing which one is effective when the both options are given.
The current code happens to use -c in this case but users might not be
aware of it.  We can change it to complain about that instead of
relying on the implicit priority.

Before:
  $ perf record -c 11 -F 99 true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.031 MB perf.data (8 samples) ]

  $ perf evlist -F
  cycles: sample_period=11

After:
  $ perf record -c 11 -F 99 true
  cannot set frequency and period at the same time

So this change can break existing usages, but I think it's rare to
have both options and it'd be better changing them.

Suggested-by: Alexey Alexandrov 
Signed-off-by: Namhyung Kim 
---
 tools/perf/util/record.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index f99852d54b14..43e5b563dee8 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -157,9 +157,15 @@ static int get_max_rate(unsigned int *rate)
 static int record_opts__config_freq(struct record_opts *opts)
 {
bool user_freq = opts->user_freq != UINT_MAX;
+   bool user_interval = opts->user_interval != ULLONG_MAX;
unsigned int max_rate;
 
-   if (opts->user_interval != ULLONG_MAX)
+   if (user_interval && user_freq) {
+   pr_err("cannot set frequency and period at the same time\n");
+   return -1;
+   }
+
+   if (user_interval)
opts->default_interval = opts->user_interval;
if (user_freq)
opts->freq = opts->user_freq;
-- 
2.31.0.208.g409f899ff0-goog



Re: [PATCH] tools: perf: util: Remove duplicate struct declaration

2021-04-01 Thread Namhyung Kim
Hello,

On Thu, Apr 1, 2021 at 3:25 PM Wan Jiabing  wrote:
>
> struct target is declared twice. One has been declared
> at 21st line. Remove the duplicate.
>
> Signed-off-by: Wan Jiabing 

Acked-by: Namhyung Kim 

I think we can move all the forward declarations to the top
(and sort them) as well.

Thanks,
Namhyung


> ---
>  tools/perf/util/evsel.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 6026487353dd..998e5b806696 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -157,7 +157,6 @@ struct perf_missing_features {
>  extern struct perf_missing_features perf_missing_features;
>
>  struct perf_cpu_map;
> -struct target;
>  struct thread_map;
>  struct record_opts;
>
> --
> 2.25.1
>


Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-30 Thread Namhyung Kim
On Tue, Mar 30, 2021 at 3:33 PM Song Liu  wrote:
> > On Mar 29, 2021, at 4:33 AM, Namhyung Kim  wrote:
> >
> > On Mon, Mar 29, 2021 at 2:17 AM Song Liu  wrote:
> >>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> >>>
> >>> As we can run many jobs (in container) on a big machine, we want to
> >>> measure each job's performance during the run.  To do that, the
> >>> perf_event can be associated to a cgroup to measure it only.
> >>>
>
> [...]
>
> >>> + return 0;
> >>> +}
> >>
> >> Could you please explain why we need this logic in can_attach?
> >
> > IIUC the ss->attach() is called after a task's cgroup membership
> > is changed.  But we want to collect the performance numbers for
> > the old cgroup just before the change.  As the logic merely checks
> > the current task's cgroup, it should be done in the can_attach()
> > which is called before the cgroup change.
>
> Thanks for the explanations.
>
> Overall, I really like the core idea, especially that the overhead on
> context switch is bounded (by the depth of cgroup tree).

Thanks!

>
> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
> Specifically, if we can have
>
>   PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
>   PERF_EVENT_IOC_EL_CGROUP  delete a cgroup from the list
>
> we can probably share these events among multiple processes, and
> these processes don't need to know others' cgroup list. I think
> this will be useful for users to build customized monitoring in
> its own container.
>
> Does this make sense?

Maybe we can add ADD/DEL interface for more flexible monitoring
but I'm not sure which use cases it'll be used actually.

For your multi-process sharing case, the original events' file
descriptors should be shared first.  Also adding and deleting
(or just reading) arbitrary cgroups from a container can be a
security concern IMHO.

So I just focused on the single-process multi-cgroup case which is
already used (perf stat --for-each-cgroup) and very important in my
company's setup.  In this case we have a list of interested cgroups
from the beginning so it's more efficient to create a properly sized
hash table and all the nodes at once.

Thanks,
Namhyung


Re: [PATCH v5 0/4] perf stat: Introduce iostat mode to provide I/O performance metrics

2021-03-30 Thread Namhyung Kim
Hello,

On Wed, Mar 24, 2021 at 11:30 PM Alexander Antonov
 wrote:
>
> The previous version can be found at:
> v4: 
> https://lkml.kernel.org/r/20210203135830.38568-1-alexander.anto...@linux.intel.com/
> Changes in this revision are:
> v4 -> v5:
> - Addressed comments from Namhyung Kim:
>   1. Removed AGGR_PCIE_PORT aggregation mode
>   2. Added iostat_prepare() function
>   3. Moved implementation specific fprintf() calls to separate x86-related 
> function
>   4. Fixed code-related issues
> - Moved __weak iostat's functions to separate util/iostat.c file
>
> The previous version can be found at:
> v3: 
> https://lkml.kernel.org/r/20210126080619.30275-1-alexander.anto...@linux.intel.com/
> Changes in this revision are:
> v3 -> v4:
> - Addressed comment from Namhyung Kim:
>   1. Removed NULL-termination of root ports list
>
> The previous version can be found at:
> v2: 
> https://lkml.kernel.org/r/20201223130320.3930-1-alexander.anto...@linux.intel.com
>
> Changes in this revision are:
> v2 -> v3:
> - Addressed comments from Namhyung Kim:
>   1. Removed perf_device pointer from evsel structure. Use priv field instead
>   2. Renamed 'iiostat' to 'iostat'
>   3. Renamed 'show' mode to 'list' mode
>   4. Renamed iiostat_delete_root_ports() to iiostat_release() and
>  iostat_show_root_ports() to iostat_list()
>
> The previous version can be found at:
> v1: 
> https://lkml.kernel.org/r/20201210090340.14358-1-alexander.anto...@linux.intel.com
>
> Changes in this revision are:
> v1 -> v2:
> - Addressed comment from Arnaldo Carvalho de Melo:
>   1. Using 'perf iiostat' subcommand instead of 'perf stat --iiostat':
> - Added perf-iiostat.sh script to use short command
> - Updated manual pages to get help for 'perf iiostat'
> - Added 'perf-iiostat' to perf's gitignore file
>
> Mode is intended to provide four I/O performance metrics in MB per each
> root port:
>  - Inbound Read:   I/O devices below root port read from the host memory
>  - Inbound Write:  I/O devices below root port write to the host memory
>  - Outbound Read:  CPU reads from I/O devices below root port
>  - Outbound Write: CPU writes to I/O devices below root port
>
> Each metric requiries only one uncore event which increments at every 4B
> transfer in corresponding direction. The formulas to compute metrics
> are generic:
> #EventCount * 4B / (1024 * 1024)
>
> Note: iostat introduces new perf data aggregation mode - per PCIe root port
> hence -e and -M options are not supported.
>
> Usage examples:
>
> 1. List all PCIe root ports (example for 2-S platform):
>$ perf iostat list
>S0-uncore_iio_0<:00>
>S1-uncore_iio_0<:80>
>S0-uncore_iio_1<:17>
>S1-uncore_iio_1<:85>
>S0-uncore_iio_2<:3a>
>S1-uncore_iio_2<:ae>
>S0-uncore_iio_3<:5d>
>S1-uncore_iio_3<:d7>
>
> 2. Collect metrics for all PCIe root ports:
>$ perf iostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
>357708+0 records in
>357707+0 records out
>375083606016 bytes (375 GB, 349 GiB) copied, 215.974 s, 1.7 GB/s
>
> Performance counter stats for 'system wide':
>
>   port Inbound Read(MB)Inbound Write(MB)Outbound 
> Read(MB)   Outbound Write(MB)
>:00102 
>3
>:80000 
>0
>:17   352552   430 
>   21
>:85000 
>0
>:3a300 
>0
>:ae000 
>0
>:5d000 
>0
>:d7000 
>0
>
> 3. Collect metrics for comma separated list of PCIe root ports:
>$ perf iostat :17,0:3a -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M 
> oflag=direct
>357708+0 records in
>357707+0 records out
>375083606016 bytes (375 GB, 349 GiB) copied, 197.08 s, 1.9 GB/s
>
> Performance counter stats for 'system wide':
>
>   port Inbound Read(MB)Inbound Write(MB)Outbound 
> Read(MB)   Outbound Write(MB)
>:17   358559   440 
>   22
>

Re: [PATCH v7] perf annotate: Fix sample events lost in stdio mode

2021-03-30 Thread Namhyung Kim
Hi Yang and Arnaldo,

On Sat, Mar 27, 2021 at 11:16 AM Yang Jihong  wrote:
> On 2021/3/26 20:06, Arnaldo Carvalho de Melo wrote:
> > So it seems to be working, what am I missing? Is this strictly non
> > group related?
> >
> Yes, it is non group related.
> This problem occurs only when different events need to be recorded at
> the same time, i.e.:
> perf record -e branch-misses -e branch-instructions -a sleep 1
>
> The output results of perf script and perf annotate do not match.
> Some events are not output in perf annotate.

Yeah I think it's related to sort keys.  The code works with a single
hist_entry for each event and symbol.  But the default sort key
creates multiple entries for different threads and it causes the
confusion.

Thanks,
Namhyung


Re: [PATCH 2/2] perf/core: Support reading group events with shared cgroups

2021-03-29 Thread Namhyung Kim
On Mon, Mar 29, 2021 at 2:32 AM Song Liu  wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> >
> > This enables reading event group's counter values together with a
> > PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
> > Users should give a correct size of buffer to be read.
> >
> > Signed-off-by: Namhyung Kim 
> > ---
> > kernel/events/core.c | 119 +--
> > 1 file changed, 116 insertions(+), 3 deletions(-)
> >
>
> [...]
>
> > +}
> > +
> > +static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 
> > cgrp_id,
> > +char __user *buf)
> > +{
> > + struct perf_cgroup_node *cgrp;
> > + struct perf_event_context *ctx = event->ctx;
> > + struct perf_event *sibling;
> > + u64 read_format = event->attr.read_format;
> > + unsigned long flags;
> > + u64 *values;
> > + int n = 1;
> > + int ret;
> > +
> > + values = kzalloc(event->read_size, GFP_KERNEL);
> > + if (!values)
> > + return -ENOMEM;
> > +
> > + values[0] = 1 + event->nr_siblings;
> > +
> > + /* update event count and times (possibly run on other cpu) */
> > + (void)perf_event_read(event, true);
> > +
> > + raw_spin_lock_irqsave(>lock, flags);
> > +
> > + cgrp = find_cgroup_node(event, cgrp_id);
> > + if (cgrp == NULL) {
> > + raw_spin_unlock_irqrestore(>lock, flags);
> > + kfree(values);
> > + return -ENOENT;
> > + }
> > +
> > + if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> > + values[n++] = cgrp->time_enabled;
> > + if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> > + values[n++] = cgrp->time_running;
> > +
> > + values[n++] = cgrp->count;
> > + if (read_format & PERF_FORMAT_ID)
> > + values[n++] = primary_event_id(event);
> > +
> > + for_each_sibling_event(sibling, event) {
> > + n += perf_event_read_cgrp_node_sibling(sibling, read_format,
> > +cgrp_id, [n]);
> > + }
> > +
> > + raw_spin_unlock_irqrestore(>lock, flags);
> > +
> > + ret = copy_to_user(buf, values, n * sizeof(u64));
> > + kfree(values);
> > + if (ret)
> > + return -EFAULT;
> > +
> > + return n * sizeof(u64);
> > +}
> > +
> > +static int perf_event_read_cgroup_node(struct perf_event *event, u64 
> > read_size,
> > +u64 cgrp_id, char __user *buf)
> > +{
> > + u64 read_format = event->attr.read_format;
> > +
> > + if (read_size < event->read_size + 2 * sizeof(u64))
>
> Why do we need read_size + 2 u64 here?

I should've repeated the following description in the patch 1.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
 array to get the event counter values.  The first element is size
 of the array in byte, and the second element is a cgroup id to
 read.  The rest is to save the counter value and timings.

Thanks,
Namhyung


Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-29 Thread Namhyung Kim
On Mon, Mar 29, 2021 at 2:17 AM Song Liu  wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > 64-bit array to attach given cgroups.  The first element is a
> > number of cgroups in the buffer, and the rest is a list of cgroup
> > ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > array to get the event counter values.  The first element is size
> > of the array in byte, and the second element is a cgroup id to
> > read.  The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo 
> > Signed-off-by: Namhyung Kim 
> > ---
> > include/linux/perf_event.h  |  22 ++
> > include/uapi/linux/perf_event.h |   2 +
> > kernel/events/core.c| 474 ++--
> > 3 files changed, 471 insertions(+), 27 deletions(-)
>
> [...]
>
> > @@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct 
> > perf_event_context *ctx)
> >   INIT_LIST_HEAD(>event_list);
> >   INIT_LIST_HEAD(>pinned_active);
> >   INIT_LIST_HEAD(>flexible_active);
> > + INIT_LIST_HEAD(>cgrp_ctx_entry);
> > + INIT_LIST_HEAD(>cgrp_node_list);
>
> I guess we need ifdef CONFIG_CGROUP_PERF here?

Correct.  Thanks for pointing that out.

>
> >   refcount_set(>refcount, 1);
> > }
> >
> > @@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
> >   if (is_cgroup_event(event))
> >   perf_detach_cgroup(event);
> >
> > + perf_event_destroy_cgroup_nodes(event);
> > +
> >   if (!event->parent) {
> >   if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> >   put_callchain_buffers();
>
> [...]
>
> > +static void perf_sched_enable(void)
> > +{
> > + /*
> > +  * We need the mutex here because static_branch_enable()
> > +  * must complete *before* the perf_sched_count increment
> > +  * becomes visible.
> > +  */
> > + if (atomic_inc_not_zero(_sched_count))
> > + return;
>
> Why don't we use perf_cgroup_events for the new use case?

Maybe.. The two methods are mutually exclusive and I think
this will be preferred in the future due to the lower overhead.
And I'd like to separate it from the existing code to avoid
possible confusions.

For the perf_sched_enable(), the difference between the
existing cgroup events and this approach is when it calls
the function above.  Usually it calls during account_event()
which is a part of the event initialization.  But this approach
calls the function after an event is created.  That's why I
have the do_sched_enable variable in the perf_ioctl below
to ensure it's called exactly once for each event.


>
> > +
> > + mutex_lock(_sched_mutex);
> > + if (!atomic_read(_sched_count)) {
> > + static_branch_enable(_sched_events);
> > + /*
> > +  * Guarantee that all CPUs observe they key change and
> > +  * call the perf scheduling hooks before proceeding to
> > +  * install events t

Re: [PATCH v7] perf annotate: Fix sample events lost in stdio mode

2021-03-25 Thread Namhyung Kim
Hello,

On Fri, Mar 26, 2021 at 11:24 AM Yang Jihong  wrote:
>
> Hello,
> ping :)
>
> On 2021/3/19 20:35, Yang Jihong wrote:
> > In hist__find_annotations function, since different hist_entry may point to 
> > same
> > symbol, we free notes->src to signal already processed this symbol in stdio 
> > mode;
> > when annotate, entry will skipped if notes->src is NULL to avoid repeated 
> > output.
> >
> > However, there is a problem, for example, run the following command:
> >
> >   # perf record -e branch-misses -e branch-instructions -a sleep 1
> >
> > perf.data file contains different types of sample event.
> >
> > If the same IP sample event exists in branch-misses and branch-instructions,
> > this event uses the same symbol. When annotate branch-misses events, 
> > notes->src
> > corresponding to this event is set to null, as a result, when annotate
> > branch-instructions events, this event is skipped and no annotate is output.
> >
> > Solution of this patch is to remove zfree in hists__find_annotations and
> > change sort order to "dso,symbol" to avoid duplicate output when different
> > processes correspond to the same symbol.
> >
> > Signed-off-by: Yang Jihong 

Acked-by: Namhyung Kim 

Thanks,
Namhyung


> > ---
> >
> > Changes since v6:
> >- Remove separate setup_sorting() for branch mode.
> >
> > Changes since v5:
> >- Add Signed-off-by tag.
> >
> > Changes since v4:
> >- Use the same sort key "dso,symbol" in branch stack mode.
> >
> > Changes since v3:
> >- Modify the first line of comments.
> >
> > Changes since v2:
> >- Remove zfree in hists__find_annotations.
> >- Change sort order to avoid duplicate output.
> >
> > Changes since v1:
> >- Change processed flag variable from u8 to bool.
> >
> >   tools/perf/builtin-annotate.c | 29 +++--
> >   1 file changed, 15 insertions(+), 14 deletions(-)
> >
> > diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
> > index a23ba6bb99b6..795c2ac7fcd1 100644
> > --- a/tools/perf/builtin-annotate.c
> > +++ b/tools/perf/builtin-annotate.c
> > @@ -374,13 +374,6 @@ static void hists__find_annotations(struct hists 
> > *hists,
> >   } else {
> >   hist_entry__tty_annotate(he, evsel, ann);
> >   nd = rb_next(nd);
> > - /*
> > -  * Since we have a hist_entry per IP for the same
> > -  * symbol, free he->ms.sym->src to signal we already
> > -  * processed this symbol.
> > -  */
> > - zfree(>src->cycles_hist);
> > - zfree(>src);
> >   }
> >   }
> >   }
> > @@ -619,14 +612,22 @@ int cmd_annotate(int argc, const char **argv)
> >
> >   setup_browser(true);
> >
> > - if ((use_browser == 1 || annotate.use_stdio2) && 
> > annotate.has_br_stack) {
> > + /*
> > +  * Events of different processes may correspond to the same
> > +  * symbol, we do not care about the processes in annotate,
> > +  * set sort order to avoid repeated output.
> > +  */
> > + sort_order = "dso,symbol";
> > +
> > + /*
> > +  * Set SORT_MODE__BRANCH so that annotate display IPC/Cycle
> > +  * if branch info is in perf data in TUI mode.
> > +  */
> > + if ((use_browser == 1 || annotate.use_stdio2) && 
> > annotate.has_br_stack)
> >   sort__mode = SORT_MODE__BRANCH;
> > - if (setup_sorting(annotate.session->evlist) < 0)
> > - usage_with_options(annotate_usage, options);
> > - } else {
> > - if (setup_sorting(NULL) < 0)
> > - usage_with_options(annotate_usage, options);
> > - }
> > +
> > + if (setup_sorting(NULL) < 0)
> > + usage_with_options(annotate_usage, options);
> >
> >   ret = __cmd_annotate();
> >
> >


Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-24 Thread Namhyung Kim
Hi Song,

Thanks for your review!

On Thu, Mar 25, 2021 at 9:56 AM Song Liu  wrote:
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > 64-bit array to attach given cgroups.  The first element is a
> > number of cgroups in the buffer, and the rest is a list of cgroup
> > ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > array to get the event counter values.  The first element is size
> > of the array in byte, and the second element is a cgroup id to
> > read.  The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
> >
> > Cc: Tejun Heo 
> > Signed-off-by: Namhyung Kim 
> > ---
> > include/linux/perf_event.h  |  22 ++
> > include/uapi/linux/perf_event.h |   2 +
> > kernel/events/core.c| 474 ++--
> > 3 files changed, 471 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..2760f3b07534 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -771,6 +771,18 @@ struct perf_event {
> >
> > #ifdef CONFIG_CGROUP_PERF
> >   struct perf_cgroup  *cgrp; /* cgroup event is attach to */
> > +
> > + /* to share an event for multiple cgroups */
> > + struct hlist_head   *cgrp_node_hash;
> > + struct perf_cgroup_node *cgrp_node_entries;
> > + int nr_cgrp_nodes;
> > + int cgrp_node_hash_bits;
> > +
> > + struct list_headcgrp_node_entry;
> > +
> > + u64 cgrp_node_count;
> > + u64 cgrp_node_time_enabled;
> > + u64 cgrp_node_time_running;
>
> A comment saying the above values are from previous reading would be helpful.

Sure, will add.

>
> > #endif
> >
> > #ifdef CONFIG_SECURITY
> > @@ -780,6 +792,14 @@ struct perf_event {
> > #endif /* CONFIG_PERF_EVENTS */
> > };
> >
> > +struct perf_cgroup_node {
> > + struct hlist_node   node;
> > + u64 id;
> > + u64 count;
> > + u64 time_enabled;
> > + u64 time_running;
> > + u64 padding[2];
>
> Do we really need the padding? For cache line alignment?

Yeah I was thinking about it.  It seems I need to use the
___cacheline_aligned macro instead.

>
> > +};
> >
> > struct perf_event_groups {
> >   struct rb_root  tree;
> > @@ -843,6 +863,8 @@ struct perf_event_context {
> >   int pin_count;
> > #ifdef CONFIG_CGROUP_PERF
> >   int nr_cgroups;  /* cgroup evts */
> > + struct list_headcgrp_node_list;
> > + struct list_headcgrp_ctx_entry;
> > #endif
> >   void*task_ctx_data; /* pmu specifi

Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Namhyung Kim
On Wed, Mar 24, 2021 at 12:47 PM Like Xu  wrote:
>
> Hi Namhyung,
>
> On 2021/3/24 9:32, Namhyung Kim wrote:
> > Hello,
> >
> > On Mon, Mar 22, 2021 at 3:14 PM Like Xu  wrote:
> >> +void reserve_lbr_buffers(struct perf_event *event)
> >> +{
> >> +   struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
> >> +   struct cpu_hw_events *cpuc;
> >> +   int cpu;
> >> +
> >> +   if (!static_cpu_has(X86_FEATURE_ARCH_LBR))
> >> +   return;
> >> +
> >> +   for_each_possible_cpu(cpu) {
> >> +   cpuc = per_cpu_ptr(_hw_events, cpu);
> >> +   if (kmem_cache && !cpuc->lbr_xsave && 
> >> !event->attr.precise_ip)
> >> +   cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, 
> >> GFP_KERNEL);
> >> +   }
> >> +}
> >
> > I think we should use kmem_cache_alloc_node().
>
> "kmem_cache_alloc_node - Allocate an object on the specified node"
>
> The reserve_lbr_buffers() is called in __x86_pmu_event_init().
> When the LBR perf_event is scheduled to another node, it seems
> that we will not call init() and allocate again.
>
> Do you mean use kmem_cache_alloc_node() for each numa_nodes_parsed ?

I assume cpuc->lbr_xsave will be accessed for that cpu only.
Then it needs to allocate it in the node that cpu belongs to.
Something like below..

cpuc->lbr_xsave = kmem_cache_alloc_node(kmem_cache, GFP_KERNEL,
   cpu_to_node(cpu));

Thanks,
Namhyung


Re: [PATCH] perf test: Change to use bash for daemon test

2021-03-23 Thread Namhyung Kim
Hi Leo,

On Sat, Mar 20, 2021 at 7:46 PM Leo Yan  wrote:
>
> When executed the daemon test on Arm64 and x86 with Debian (Buster)
> distro, both skip the test case with the log:
>
>   # ./perf test -v 76
>   76: daemon operations   :
>   --- start ---
>   test child forked, pid 11687
>   test daemon list
>   trap: SIGINT: bad trap
>   ./tests/shell/daemon.sh: 173: local: cpu-clock: bad variable name
>   test child finished with -2
>    end 
>   daemon operations: Skip
>
> So the error happens for the variable expansion when use local variable
> in the shell script.  Since Debian Buster uses dash but not bash as
> non-interactive shell, when execute the daemon testing, it hits a
> known issue for dash which was reported [1].
>
> To resolve this issue, one option is to add double quotes for all local
> variables assignment, so need to change the code from:
>
>   local line=`perf daemon --config ${config} -x: | head -2 | tail -1`
>
>   ... to:
>
>   local line="`perf daemon --config ${config} -x: | head -2 | tail -1`"
>
> But the testing script has bunch of local variables, this leads to big
> changes for whole script.
>
> On the other hand, the testing script asks to use the "local" feature
> which is bash-specific, so this patch explicitly uses "#!/bin/bash" to
> ensure running the script with bash.
>
> After:
>
>   # ./perf test -v 76
>   76: daemon operations   :
>   --- start ---
>   test child forked, pid 11329
>   test daemon list
>   test daemon reconfig
>   test daemon stop
>   test daemon signal
>   signal 12 sent to session 'test [11596]'
>   signal 12 sent to session 'test [11596]'
>   test daemon ping
>   test daemon lock
>   test child finished with 0
>    end 
>   daemon operations: Ok
>
> [1] https://bugs.launchpad.net/ubuntu/+source/dash/+bug/139097
>
> Fixes: 2291bb915b55 ("perf tests: Add daemon 'list' command test")
> Signed-off-by: Leo Yan 

Acked-by: Namhyung Kim 

Thanks,
Namhyung


> ---
>  tools/perf/tests/shell/daemon.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/tests/shell/daemon.sh 
> b/tools/perf/tests/shell/daemon.sh
> index ee4a30ca3f57..45fc24af5b07 100755
> --- a/tools/perf/tests/shell/daemon.sh
> +++ b/tools/perf/tests/shell/daemon.sh
> @@ -1,4 +1,4 @@
> -#!/bin/sh
> +#!/bin/bash
>  # daemon operations
>  # SPDX-License-Identifier: GPL-2.0
>
> --
> 2.25.1
>


Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region

2021-03-23 Thread Namhyung Kim
Hello,

On Mon, Mar 22, 2021 at 3:14 PM Like Xu  wrote:
> +void reserve_lbr_buffers(struct perf_event *event)
> +{
> +   struct kmem_cache *kmem_cache = x86_get_pmu()->task_ctx_cache;
> +   struct cpu_hw_events *cpuc;
> +   int cpu;
> +
> +   if (!static_cpu_has(X86_FEATURE_ARCH_LBR))
> +   return;
> +
> +   for_each_possible_cpu(cpu) {
> +   cpuc = per_cpu_ptr(_hw_events, cpu);
> +   if (kmem_cache && !cpuc->lbr_xsave && !event->attr.precise_ip)
> +   cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, 
> GFP_KERNEL);
> +   }
> +}

I think we should use kmem_cache_alloc_node().

Thanks,
Namhyung


Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-23 Thread Namhyung Kim
Hi Song,

On Wed, Mar 24, 2021 at 9:30 AM Song Liu  wrote:
>
>
>
> > On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> >
> > As we can run many jobs (in container) on a big machine, we want to
> > measure each job's performance during the run.  To do that, the
> > perf_event can be associated to a cgroup to measure it only.
> >
> > However such cgroup events need to be opened separately and it causes
> > significant overhead in event multiplexing during the context switch
> > as well as resource consumption like in file descriptors and memory
> > footprint.
> >
> > As a cgroup event is basically a cpu event, we can share a single cpu
> > event for multiple cgroups.  All we need is a separate counter (and
> > two timing variables) for each cgroup.  I added a hash table to map
> > from cgroup id to the attached cgroups.
> >
> > With this change, the cpu event needs to calculate a delta of event
> > counter values when the cgroups of current and the next task are
> > different.  And it attributes the delta to the current task's cgroup.
> >
> > This patch adds two new ioctl commands to perf_event for light-weight
> > cgroup event counting (i.e. perf stat).
> >
> > * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> > 64-bit array to attach given cgroups.  The first element is a
> > number of cgroups in the buffer, and the rest is a list of cgroup
> > ids to add a cgroup info to the given event.
> >
> > * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> > array to get the event counter values.  The first element is size
> > of the array in byte, and the second element is a cgroup id to
> > read.  The rest is to save the counter value and timings.
> >
> > This attaches all cgroups in a single syscall and I didn't add the
> > DETACH command deliberately to make the implementation simple.  The
> > attached cgroup nodes would be deleted when the file descriptor of the
> > perf_event is closed.
>
> This is very interesting idea!

Thanks!

>
> Could you please add some description of the relationship among
> perf_event and contexts? The code is a little confusing. For example,
> why do we need cgroup_ctx_list?

Sure, a perf_event belongs to an event context (hw or sw, mostly) which
takes care of multiplexing, timing, locking and so on.  So many of the
fields in the perf_event are protected by the context lock.  A context has
a list of perf_events and there are per-cpu contexts and per-task contexts.

The cgroup_ctx_list is to traverse contexts (in that cpu) only having
perf_events with attached cgroups.

Hope this makes it clear.  Please let me know if you need more. :)

Thanks,
Namhyung


[PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-23 Thread Namhyung Kim
As we can run many jobs (in container) on a big machine, we want to
measure each job's performance during the run.  To do that, the
perf_event can be associated to a cgroup to measure it only.

However such cgroup events need to be opened separately and it causes
significant overhead in event multiplexing during the context switch
as well as resource consumption like in file descriptors and memory
footprint.

As a cgroup event is basically a cpu event, we can share a single cpu
event for multiple cgroups.  All we need is a separate counter (and
two timing variables) for each cgroup.  I added a hash table to map
from cgroup id to the attached cgroups.

With this change, the cpu event needs to calculate a delta of event
counter values when the cgroups of current and the next task are
different.  And it attributes the delta to the current task's cgroup.

This patch adds two new ioctl commands to perf_event for light-weight
cgroup event counting (i.e. perf stat).

 * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
 64-bit array to attach given cgroups.  The first element is a
 number of cgroups in the buffer, and the rest is a list of cgroup
 ids to add a cgroup info to the given event.

 * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
 array to get the event counter values.  The first element is size
 of the array in byte, and the second element is a cgroup id to
 read.  The rest is to save the counter value and timings.

This attaches all cgroups in a single syscall and I didn't add the
DETACH command deliberately to make the implementation simple.  The
attached cgroup nodes would be deleted when the file descriptor of the
perf_event is closed.

Cc: Tejun Heo 
Signed-off-by: Namhyung Kim 
---
 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 474 ++--
 3 files changed, 471 insertions(+), 27 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..2760f3b07534 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -771,6 +771,18 @@ struct perf_event {
 
 #ifdef CONFIG_CGROUP_PERF
struct perf_cgroup  *cgrp; /* cgroup event is attach to */
+
+   /* to share an event for multiple cgroups */
+   struct hlist_head   *cgrp_node_hash;
+   struct perf_cgroup_node *cgrp_node_entries;
+   int nr_cgrp_nodes;
+   int cgrp_node_hash_bits;
+
+   struct list_headcgrp_node_entry;
+
+   u64 cgrp_node_count;
+   u64 cgrp_node_time_enabled;
+   u64 cgrp_node_time_running;
 #endif
 
 #ifdef CONFIG_SECURITY
@@ -780,6 +792,14 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+struct perf_cgroup_node {
+   struct hlist_node   node;
+   u64 id;
+   u64 count;
+   u64 time_enabled;
+   u64 time_running;
+   u64 padding[2];
+};
 
 struct perf_event_groups {
struct rb_root  tree;
@@ -843,6 +863,8 @@ struct perf_event_context {
int pin_count;
 #ifdef CONFIG_CGROUP_PERF
int nr_cgroups;  /* cgroup evts */
+   struct list_headcgrp_node_list;
+   struct list_headcgrp_ctx_entry;
 #endif
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..06bc7ab13616 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -479,6 +479,8 @@ struct perf_event_query_bpf {
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
 #define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES   _IOW('$', 11, struct 
perf_event_attr *)
+#define PERF_EVENT_IOC_ATTACH_CGROUP   _IOW('$', 12, __u64 *)
+#define PERF_EVENT_IOC_READ_CGROUP _IOWR('$', 13, __u64 *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..38c26a23418a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -379,6 +379,7 @@ enum event_type_t {
  * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu
  */
 
+static void perf_sched_enable(void);
 static void perf_sched_delayed(struct work_struct *work);
 DEFINE_STATIC_KEY_FALSE(perf_sched_events);
 static DECLARE_DELAYED_WORK(perf_sched_work, perf_sche

[PATCH 2/2] perf/core: Support reading group events with shared cgroups

2021-03-23 Thread Namhyung Kim
This enables reading event group's counter values together with a
PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
Users should give a correct size of buffer to be read.

Signed-off-by: Namhyung Kim 
---
 kernel/events/core.c | 119 +--
 1 file changed, 116 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 38c26a23418a..3225177e54d5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2232,13 +2232,24 @@ static void perf_add_cgrp_node_list(struct perf_event 
*event,
 {
struct list_head *cgrp_ctx_list = this_cpu_ptr(_ctx_list);
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+   struct perf_event *sibling;
bool is_first;
 
lockdep_assert_irqs_disabled();
lockdep_assert_held(>lock);
 
+   /* only group leader can be added directly */
+   if (event->group_leader != event)
+   return;
+
+   if (!event_has_cgroup_node(event))
+   return;
+
is_first = list_empty(>cgrp_node_list);
+
list_add_tail(>cgrp_node_entry, >cgrp_node_list);
+   for_each_sibling_event(sibling, event)
+   list_add_tail(>cgrp_node_entry, >cgrp_node_list);
 
if (is_first)
list_add_tail(>cgrp_ctx_entry, cgrp_ctx_list);
@@ -2250,15 +2261,25 @@ static void perf_del_cgrp_node_list(struct perf_event 
*event,
struct perf_event_context *ctx)
 {
struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
+   struct perf_event *sibling;
 
lockdep_assert_irqs_disabled();
lockdep_assert_held(>lock);
 
+   /* only group leader can be deleted directly */
+   if (event->group_leader != event)
+   return;
+
+   if (!event_has_cgroup_node(event))
+   return;
+
update_cgroup_node(event, cgrp->css.cgroup);
/* to refresh delta when it's enabled */
event->cgrp_node_count = 0;
 
list_del(>cgrp_node_entry);
+   for_each_sibling_event(sibling, event)
+   list_del(>cgrp_node_entry);
 
if (list_empty(>cgrp_node_list))
list_del(>cgrp_ctx_entry);
@@ -2333,7 +2354,7 @@ static int perf_event_attach_cgroup_node(struct 
perf_event *event, u64 nr_cgrps,
 
raw_spin_unlock_irqrestore(>lock, flags);
 
-   if (is_first && enabled)
+   if (is_first && enabled && event->group_leader == event)
event_function_call(event, perf_attach_cgroup_node, NULL);
 
return 0;
@@ -2370,8 +2391,8 @@ static void __perf_read_cgroup_node(struct perf_event 
*event)
}
 }
 
-static int perf_event_read_cgroup_node(struct perf_event *event, u64 read_size,
-  u64 cgrp_id, char __user *buf)
+static int perf_event_read_cgrp_node_one(struct perf_event *event, u64 cgrp_id,
+char __user *buf)
 {
struct perf_cgroup_node *cgrp;
struct perf_event_context *ctx = event->ctx;
@@ -2406,6 +2427,91 @@ static int perf_event_read_cgroup_node(struct perf_event 
*event, u64 read_size,
 
return n * sizeof(u64);
 }
+
+static int perf_event_read_cgrp_node_sibling(struct perf_event *event,
+u64 read_format, u64 cgrp_id,
+u64 *values)
+{
+   struct perf_cgroup_node *cgrp;
+   int n = 0;
+
+   cgrp = find_cgroup_node(event, cgrp_id);
+   if (cgrp == NULL)
+   return (read_format & PERF_FORMAT_ID) ? 2 : 1;
+
+   values[n++] = cgrp->count;
+   if (read_format & PERF_FORMAT_ID)
+   values[n++] = primary_event_id(event);
+   return n;
+}
+
+static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 
cgrp_id,
+  char __user *buf)
+{
+   struct perf_cgroup_node *cgrp;
+   struct perf_event_context *ctx = event->ctx;
+   struct perf_event *sibling;
+   u64 read_format = event->attr.read_format;
+   unsigned long flags;
+   u64 *values;
+   int n = 1;
+   int ret;
+
+   values = kzalloc(event->read_size, GFP_KERNEL);
+   if (!values)
+   return -ENOMEM;
+
+   values[0] = 1 + event->nr_siblings;
+
+   /* update event count and times (possibly run on other cpu) */
+   (void)perf_event_read(event, true);
+
+   raw_spin_lock_irqsave(>lock, flags);
+
+   cgrp = find_cgroup_node(event, cgrp_id);
+   if (cgrp == NULL) {
+   raw_spin_unlock_irqrestore(>lock, flags);
+   kfree(values);
+   return -ENOENT;
+   }
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+   values[n++] = cgrp->time_enabled;
+   if (read_format &a

[RFC 0/2] perf core: Sharing events with multiple cgroups

2021-03-23 Thread Namhyung Kim
Hello,

This work is to make perf stat more scalable with a lot of cgroups.

Currently we need to open a separate perf_event to count an event in a
cgroup.  For a big machine, this requires lots of events like

  256 cpu x 8 events x 200 cgroups = 409600 events

This is very wasteful and not scalable.  In this case, the perf stat
actually counts exactly same events for each cgroup.  I think we can
just use a single event to measure all cgroups running on that cpu.

So I added new ioctl commands to add per-cgroup counters to an
existing perf_event and to read the per-cgroup counters from the
event.  The per-cgroup counters are updated during the context switch
if tasks' cgroups are different (and no need to change the HW PMU).
It keeps the counters in a hash table with cgroup id as a key.

With this change, average processing time of my internal test workload
which runs tasks in a different cgroup and communicates by pipes
dropped from 11.3 usec to 5.8 usec.

Thanks,
Namhyung


Namhyung Kim (2):
  perf/core: Share an event with multiple cgroups
  perf/core: Support reading group events with shared cgroups

 include/linux/perf_event.h  |  22 ++
 include/uapi/linux/perf_event.h |   2 +
 kernel/events/core.c| 588 ++--
 3 files changed, 585 insertions(+), 27 deletions(-)

-- 
2.31.0.rc2.261.g7f71774620-goog



Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-19 Thread Namhyung Kim
Hi Arnaldo,

On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo
 wrote:
>
> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
> > On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
> > > > On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
> > > > wrote:
> > > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> > > >>> perf stat -C 1,3,5  107.063 [sec]
> > > >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>
> > > >> I can't see why it's actualy faster than normal perf ;-)
> > > >> would be worth to find out
>
> > > > Isn't this all about contended cases?
>
> > > Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> > > doesn't need it.
>
> > Yep, so for uncontended cases, normal perf should be the same as the
> > baseline (faster than the bperf).  But for contended cases, the bperf
> > works faster.
>
> The difference should be small enough that for people that use this in a
> machine where contention happens most of the time, setting a
> ~/.perfconfig to use it by default should be advantageous, i.e. no need
> to use --bpf-counters on the command line all the time.
>
> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
> a look again now but I want to have this merged on perf/core so that I
> can work on a new BPF SKEL to use this:

I have a concern for the per cpu target, but it can be done later, so

Acked-by: Namhyung Kim 

>
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable

Interesting!  Actually I was thinking about the similar too. :)

Thanks,
Namhyung


Re: [PATCH V2 1/5] perf/x86/intel/uncore: Parse uncore discovery tables

2021-03-18 Thread Namhyung Kim
Hi Kan,

On Thu, Mar 18, 2021 at 3:05 AM  wrote:
>
> From: Kan Liang 
>
> A self-describing mechanism for the uncore PerfMon hardware has been
> introduced with the latest Intel platforms. By reading through an MMIO
> page worth of information, perf can 'discover' all the standard uncore
> PerfMon registers in a machine.
>
> The discovery mechanism relies on BIOS's support. With a proper BIOS,
> a PCI device with the unique capability ID 0x23 can be found on each
> die. Perf can retrieve the information of all available uncore PerfMons
> from the device via MMIO. The information is composed of one global
> discovery table and several unit discovery tables.
> - The global discovery table includes global uncore information of the
>   die, e.g., the address of the global control register, the offset of
>   the global status register, the number of uncore units, the offset of
>   unit discovery tables, etc.
> - The unit discovery table includes generic uncore unit information,
>   e.g., the access type, the counter width, the address of counters,
>   the address of the counter control, the unit ID, the unit type, etc.
>   The unit is also called "box" in the code.
> Perf can provide basic uncore support based on this information
> with the following patches.
>
> To locate the PCI device with the discovery tables, check the generic
> PCI ID first. If it doesn't match, go through the entire PCI device tree
> and locate the device with the unique capability ID.
>
> The uncore information is similar among dies. To save parsing time and
> space, only completely parse and store the discovery tables on the first
> die and the first box of each die. The parsed information is stored in
> an
> RB tree structure, intel_uncore_discovery_type. The size of the stored
> discovery tables varies among platforms. It's around 4KB for a Sapphire
> Rapids server.
>
> If a BIOS doesn't support the 'discovery' mechanism, the uncore driver
> will exit with -ENODEV. There is nothing changed.
>
> Add a module parameter to disable the discovery feature. If a BIOS gets
> the discovery tables wrong, users can have an option to disable the
> feature. For the current patchset, the uncore driver will exit with
> -ENODEV. In the future, it may fall back to the hardcode uncore driver
> on a known platform.
>
> Signed-off-by: Kan Liang 
> ---
>  arch/x86/events/intel/Makefile   |   2 +-
>  arch/x86/events/intel/uncore.c   |  31 ++-
>  arch/x86/events/intel/uncore_discovery.c | 318 
> +++
>  arch/x86/events/intel/uncore_discovery.h | 105 ++
>  4 files changed, 448 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/events/intel/uncore_discovery.c
>  create mode 100644 arch/x86/events/intel/uncore_discovery.h
>
> diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
> index e67a588..10bde6c 100644
> --- a/arch/x86/events/intel/Makefile
> +++ b/arch/x86/events/intel/Makefile
> @@ -3,6 +3,6 @@ obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o
>  obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o
>  obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o
>  obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o
> -intel-uncore-objs  := uncore.o uncore_nhmex.o 
> uncore_snb.o uncore_snbep.o
> +intel-uncore-objs  := uncore.o uncore_nhmex.o 
> uncore_snb.o uncore_snbep.o uncore_discovery.o
>  obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
>  intel-cstate-objs  := cstate.o
> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
> index 33c8180..d111370 100644
> --- a/arch/x86/events/intel/uncore.c
> +++ b/arch/x86/events/intel/uncore.c
> @@ -4,7 +4,12 @@
>  #include 
>  #include 
>  #include "uncore.h"
> +#include "uncore_discovery.h"
>
> +static bool uncore_no_discover;
> +module_param(uncore_no_discover, bool, 0);

Wouldn't it be better to use a positive form like 'uncore_discover = true'?
To disable, the module param can be set to 'uncore_discover = false'.

> +MODULE_PARM_DESC(uncore_no_discover, "Don't enable the Intel uncore PerfMon 
> discovery mechanism "
> +"(default: enable the discovery 
> mechanism).");
>  static struct intel_uncore_type *empty_uncore[] = { NULL, };
>  struct intel_uncore_type **uncore_msr_uncores = empty_uncore;
>  struct intel_uncore_type **uncore_pci_uncores = empty_uncore;

[SNIP]
> +enum uncore_access_type {
> +   UNCORE_ACCESS_MSR   = 0,
> +   UNCORE_ACCESS_MMIO,
> +   UNCORE_ACCESS_PCI,
> +
> +   UNCORE_ACCESS_MAX,
> +};
> +
> +struct uncore_global_discovery {
> +   union {
> +   u64 table1;
> +   struct {
> +   u64 type : 8,
> +   stride : 8,
> +   max_units : 10,
> +   __reserved_1 : 36,
> +   access_type 

Re: [RESEND PATCH v2] perf stat: improve readability of shadow stats

2021-03-18 Thread Namhyung Kim
Hello,

On Mon, Mar 15, 2021 at 11:31 PM Changbin Du  wrote:
>
> This adds function convert_unit_double() and selects appropriate
> unit for shadow stats between K/M/G.
>
> $ sudo ./perf stat -a -- sleep 1
>
> Before: Unit 'M' is selected even the number is very small.
>  Performance counter stats for 'system wide':
>
>   4,003.06 msec cpu-clock #3.998 CPUs utilized
> 16,179  context-switches  #0.004 M/sec
>161  cpu-migrations#0.040 K/sec
>  4,699  page-faults   #0.001 M/sec
>  6,135,801,925  cycles#1.533 GHz  
> (83.21%)
>  5,783,308,491  stalled-cycles-frontend   #   94.26% frontend cycles 
> idle (83.21%)
>  4,543,694,050  stalled-cycles-backend#   74.05% backend cycles 
> idle  (66.49%)
>  4,720,130,587  instructions  #0.77  insn per cycle
>   #1.23  stalled cycles 
> per insn  (83.28%)
>753,848,078  branches  #  188.318 M/sec
> (83.61%)
> 37,457,747  branch-misses #4.97% of all branches  
> (83.48%)
>
>1.001283725 seconds time elapsed
>
> After:
> $ sudo ./perf stat -a -- sleep 2
>
>  Performance counter stats for 'system wide':
>
>   8,005.52 msec cpu-clock #3.999 CPUs utilized
> 10,715  context-switches  #1.338 K/sec
>785  cpu-migrations#   98.057 /sec
>102  page-faults   #   12.741 /sec
>  1,948,202,279  cycles#0.243 GHz
>  2,816,470,932  stalled-cycles-frontend   #  144.57% frontend cycles 
> idle
>  2,661,172,207  stalled-cycles-backend#  136.60% backend cycles 
> idle
>464,172,105  instructions  #0.24  insn per cycle
>   #6.07  stalled cycles 
> per insn
> 91,567,662  branches  #   11.438 M/sec
>  7,756,054  branch-misses #8.47% of all branches
>
>2.002040043 seconds time elapsed
>
> Signed-off-by: Changbin Du 

Acked-by: Namhyung Kim 

Thanks,
Namhyung

>
> v2:
>   o do not change 'sec' to 'cpu-sec'.
>   o use convert_unit_double to implement convert_unit.
> ---
>  tools/perf/util/stat-shadow.c | 16 +++-
>  tools/perf/util/units.c   | 21 ++---
>  tools/perf/util/units.h   |  1 +
>  3 files changed, 22 insertions(+), 16 deletions(-)
>
> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> index 6ccf21a72f06..3f800e71126f 100644
> --- a/tools/perf/util/stat-shadow.c
> +++ b/tools/perf/util/stat-shadow.c
> @@ -9,6 +9,7 @@
>  #include "expr.h"
>  #include "metricgroup.h"
>  #include "cgroup.h"
> +#include "units.h"
>  #include 
>
>  /*
> @@ -1270,18 +1271,15 @@ void perf_stat__print_shadow_stats(struct 
> perf_stat_config *config,
> generic_metric(config, evsel->metric_expr, 
> evsel->metric_events, NULL,
> evsel->name, evsel->metric_name, NULL, 1, 
> cpu, out, st);
> } else if (runtime_stat_n(st, STAT_NSECS, cpu, ) != 0) {
> -   char unit = 'M';
> -   char unit_buf[10];
> +   char unit = ' ';
> +   char unit_buf[10] = "/sec";
>
> total = runtime_stat_avg(st, STAT_NSECS, cpu, );
> -
> if (total)
> -   ratio = 1000.0 * avg / total;
> -   if (ratio < 0.001) {
> -   ratio *= 1000;
> -   unit = 'K';
> -   }
> -   snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
> +   ratio = convert_unit_double(10.0 * avg / 
> total, );
> +
> +   if (unit != ' ')
> +   snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
> print_metric(config, ctxp, NULL, "%8.3f", unit_buf, ratio);
> } else if (perf_stat_evsel__is(evsel, SMI_NUM)) {
> print_smi_cost(config, cpu, out, st, );
> diff --git a/tools/perf/util/units.c b/tools/perf/util/units.c
> index a46762aec4c9..32c39cfe209b 100644
> --- a/tools/perf/util/units.c
> +++ b/tools/perf/util/units.c
> @@ -33,28 +33,35 @@ unsigned long parse_tag_value(const char *str, struct 
> parse_tag *tags)

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Namhyung Kim
On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
>
>
>
> > On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> >
> >
> >
> > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  wrote:
> >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> >>>
> >>>
> >>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
> >>  wrote:
> >>>>
> >>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >>>>> Hi Song,
> >>>>>
> >>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu 
> >> wrote:
> >>>>>>
> >>>>>> perf uses performance monitoring counters (PMCs) to monitor
> >> system
> >>>>>> performance. The PMCs are limited hardware resources. For
> >> example,
> >>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>>>>>
> >>>>>> Modern data center systems use these PMCs in many different ways:
> >>>>>> system level monitoring, (maybe nested) container level
> >> monitoring, per
> >>>>>> process monitoring, profiling (in sample mode), etc. In some
> >> cases,
> >>>>>> there are more active perf_events than available hardware PMCs.
> >> To allow
> >>>>>> all perf_events to have a chance to run, it is necessary to do
> >> expensive
> >>>>>> time multiplexing of events.
> >>>>>>
> >>>>>> On the other hand, many monitoring tools count the common metrics
> >> (cycles,
> >>>>>> instructions). It is a waste to have multiple tools create
> >> multiple
> >>>>>> perf_events of "cycles" and occupy multiple PMCs.
> >>>>>
> >>>>> Right, it'd be really helpful when the PMCs are frequently or
> >> mostly shared.
> >>>>> But it'd also increase the overhead for uncontended cases as BPF
> >> programs
> >>>>> need to run on every context switch.  Depending on the workload,
> >> it may
> >>>>> cause a non-negligible performance impact.  So users should be
> >> aware of it.
> >>>>
> >>>> Would be interesting to, humm, measure both cases to have a firm
> >> number
> >>>> of the impact, how many instructions are added when sharing using
> >>>> --bpf-counters?
> >>>>
> >>>> I.e. compare the "expensive time multiplexing of events" with its
> >>>> avoidance by using --bpf-counters.
> >>>>
> >>>> Song, have you perfmormed such measurements?
> >>>
> >>> I have got some measurements with perf-bench-sched-messaging:
> >>>
> >>> The system: x86_64 with 23 cores (46 HT)
> >>>
> >>> The perf-stat command:
> >>> perf stat -e
> >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles  >> etc.>
> >>>
> >>> The benchmark command and output:
> >>> ./perf bench sched messaging -g 40 -l 5 -t
> >>> # Running 'sched/messaging' benchmark:
> >>> # 20 sender and receiver threads per group
> >>> # 40 groups == 1600 threads run
> >>> Total time: 10X.XXX [sec]
> >>>
> >>>
> >>> I use the "Total time" as measurement, so smaller number is better.
> >>>
> >>> For each condition, I run the command 5 times, and took the median of
> >>
> >>> "Total time".
> >>>
> >>> Baseline (no perf-stat) 104.873 [sec]
> >>> # global
> >>> perf stat -a107.887 [sec]
> >>> perf stat -a --bpf-counters 106.071 [sec]
> >>> # per task
> >>> perf stat   106.314 [sec]
> >>> perf stat --bpf-counters105.965 [sec]
> >>> # per cpu
> >>> perf stat -C 1,3,5  107.063 [sec]
> >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
> >>
> >> I can't see why it's actualy faster than normal perf ;-)
> >> would be worth to find out
> >
> > Isn't this all about contended cases?
>
> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> doesn't need it.

Yep, so for uncontended cases, normal perf should be the same as the
baseline (faster than the bperf).  But for contended cases, the bperf
works faster.

Thanks,
Namhyung


Re: [PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-18 Thread Namhyung Kim
On Thu, Mar 18, 2021 at 4:22 PM Song Liu  wrote:
>
>
>
> > On Mar 17, 2021, at 10:54 PM, Namhyung Kim  wrote:
> >
>
> [...]
>
> >> +
> >> +static int bperf_reload_leader_program(struct evsel *evsel, int 
> >> attr_map_fd,
> >> +  struct perf_event_attr_map_entry 
> >> *entry)
> >> +{
> >> +   struct bperf_leader_bpf *skel = bperf_leader_bpf__open();
> >> +   int link_fd, diff_map_fd, err;
> >> +   struct bpf_link *link = NULL;
> >> +
> >> +   if (!skel) {
> >> +   pr_err("Failed to open leader skeleton\n");
> >> +   return -1;
> >> +   }
> >> +
> >> +   bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
> >> +   err = bperf_leader_bpf__load(skel);
> >> +   if (err) {
> >> +   pr_err("Failed to load leader skeleton\n");
> >> +   goto out;
> >> +   }
> >> +
> >> +   err = -1;
> >> +   link = bpf_program__attach(skel->progs.on_switch);
> >> +   if (!link) {
> >> +   pr_err("Failed to attach leader program\n");
> >> +   goto out;
> >> +   }
> >> +
> >> +   link_fd = bpf_link__fd(link);
> >> +   diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
> >> +   entry->link_id = bpf_link_get_id(link_fd);
> >> +   entry->diff_map_id = bpf_map_get_id(diff_map_fd);
> >> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, entry, 
> >> BPF_ANY);
> >> +   assert(err == 0);
> >> +
> >> +   evsel->bperf_leader_link_fd = 
> >> bpf_link_get_fd_by_id(entry->link_id);
> >> +   assert(evsel->bperf_leader_link_fd >= 0);
> >
> > Isn't it the same as link_fd?
>
> This is a different fd on the same link.

Ok

>
> >
> >> +
> >> +   /*
> >> +* save leader_skel for install_pe, which is called within
> >> +* following evsel__open_per_cpu call
> >> +*/
> >> +   evsel->leader_skel = skel;
> >> +   evsel__open_per_cpu(evsel, all_cpu_map, -1);
> >> +
> >> +out:
> >> +   bperf_leader_bpf__destroy(skel);
> >> +   bpf_link__destroy(link);
> >
> > Why do we destroy it?  Is it because we get an another reference?
>
> Yes. We only need evsel->bperf_leader_link_fd to keep the whole
> skeleton attached.
>
> When multiple perf-stat sessions are sharing the leader skeleton,
> only the first one loads the leader skeleton, by calling
> bperf_reload_leader_program(). Other sessions simply hold a fd to
> the bpf_link. More explanation in bperf__load() below.

Ok.

>
>
> >
> >> +   return err;
> >> +}
> >> +
> >> +static int bperf__load(struct evsel *evsel, struct target *target)
> >> +{
> >> +   struct perf_event_attr_map_entry entry = {0x, 0x};
> >> +   int attr_map_fd, diff_map_fd = -1, err;
> >> +   enum bperf_filter_type filter_type;
> >> +   __u32 filter_entry_cnt, i;
> >> +
> >> +   if (bperf_check_target(evsel, target, _type, 
> >> _entry_cnt))
> >> +   return -1;
> >> +
> >> +   if (!all_cpu_map) {
> >> +   all_cpu_map = perf_cpu_map__new(NULL);
> >> +   if (!all_cpu_map)
> >> +   return -1;
> >> +   }
> >> +
> >> +   evsel->bperf_leader_prog_fd = -1;
> >> +   evsel->bperf_leader_link_fd = -1;
> >> +
> >> +   /*
> >> +* Step 1: hold a fd on the leader program and the bpf_link, if
> >> +* the program is not already gone, reload the program.
> >> +* Use flock() to ensure exclusive access to the perf_event_attr
> >> +* map.
> >> +*/
> >> +   attr_map_fd = bperf_lock_attr_map(target);
> >> +   if (attr_map_fd < 0) {
> >> +   pr_err("Failed to lock perf_event_attr map\n");
> >> +   return -1;
> >> +   }
> >> +
> >> +   err = bpf_map_lookup_elem(attr_map_fd, >core.attr, );
> >> +   if (err) {
> >> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, 
> >> , BPF_ANY);
> >> +   if 

Re: [PATCH v6] perf annotate: Fix sample events lost in stdio mode

2021-03-18 Thread Namhyung Kim
Hello,

On Wed, Mar 17, 2021 at 6:44 PM Yang Jihong  wrote:
>
> In hist__find_annotations function, since different hist_entry may point to 
> same
> symbol, we free notes->src to signal already processed this symbol in stdio 
> mode;
> when annotate, entry will skipped if notes->src is NULL to avoid repeated 
> output.
>
> However, there is a problem, for example, run the following command:
>
>  # perf record -e branch-misses -e branch-instructions -a sleep 1
>
> perf.data file contains different types of sample event.
>
> If the same IP sample event exists in branch-misses and branch-instructions,
> this event uses the same symbol. When annotate branch-misses events, 
> notes->src
> corresponding to this event is set to null, as a result, when annotate
> branch-instructions events, this event is skipped and no annotate is output.
>
> Solution of this patch is to remove zfree in hists__find_annotations and
> change sort order to "dso,symbol" to avoid duplicate output when different
> processes correspond to the same symbol.
>
> Signed-off-by: Yang Jihong 
> ---
>
> Changes since v5:
>   - Add Signed-off-by tag.
>
> Changes since v4:
>   - Use the same sort key "dso,symbol" in branch stack mode.
>
> Changes since v3:
>   - Modify the first line of comments.
>
> Changes since v2:
>   - Remove zfree in hists__find_annotations.
>   - Change sort order to avoid duplicate output.
>
> Changes since v1:
>   - Change processed flag variable from u8 to bool.
>
>  tools/perf/builtin-annotate.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
> index a23ba6bb99b6..92c55f292c11 100644
> --- a/tools/perf/builtin-annotate.c
> +++ b/tools/perf/builtin-annotate.c
> @@ -374,13 +374,6 @@ static void hists__find_annotations(struct hists *hists,
> } else {
> hist_entry__tty_annotate(he, evsel, ann);
> nd = rb_next(nd);
> -   /*
> -* Since we have a hist_entry per IP for the same
> -* symbol, free he->ms.sym->src to signal we already
> -* processed this symbol.
> -*/
> -   zfree(>src->cycles_hist);
> -   zfree(>src);
> }
> }
>  }
> @@ -619,6 +612,12 @@ int cmd_annotate(int argc, const char **argv)
>
> setup_browser(true);
>
> +   /*
> +* Events of different processes may correspond to the same
> +* symbol, we do not care about the processes in annotate,
> +* set sort order to avoid repeated output.
> +*/
> +   sort_order = "dso,symbol";

At this point, I think there's not much value having separate
setup_sorting() for branch mode.

Thanks,
Namhyung


> if ((use_browser == 1 || annotate.use_stdio2) && 
> annotate.has_br_stack) {
> sort__mode = SORT_MODE__BRANCH;
> if (setup_sorting(annotate.session->evlist) < 0)
> --
> 2.30.GIT
>


Re: [PATCH v2 3/3] perf-test: add a test for perf-stat --bpf-counters option

2021-03-18 Thread Namhyung Kim
On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>
> Add a test to compare the output of perf-stat with and without option
> --bpf-counters. If the difference is more than 10%, the test is considered
> as failed.
>
> For stable results between two runs (w/ and w/o --bpf-counters), the test
> program should: 1) be long enough for better signal-noise-ratio; 2) not
> depend on the behavior of IO subsystem (for less noise from caching). So
> far, the best option we found is stressapptest.
>
> Signed-off-by: Song Liu 
> ---
>  tools/perf/tests/shell/stat_bpf_counters.sh | 34 +
>  1 file changed, 34 insertions(+)
>  create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
>
> diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh 
> b/tools/perf/tests/shell/stat_bpf_counters.sh
> new file mode 100755
> index 0..c0bcb38d6b53c
> --- /dev/null
> +++ b/tools/perf/tests/shell/stat_bpf_counters.sh
> @@ -0,0 +1,34 @@
> +#!/bin/sh
> +# perf stat --bpf-counters test
> +# SPDX-License-Identifier: GPL-2.0
> +
> +set -e
> +
> +# check whether $2 is within +/- 10% of $1
> +compare_number()
> +{
> +   first_num=$1
> +   second_num=$2
> +
> +   # upper bound is first_num * 110%
> +   upper=$(( $first_num + $first_num / 10 ))
> +   # lower bound is first_num * 90%
> +   lower=$(( $first_num - $first_num / 10 ))
> +
> +   if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then
> +   echo "The difference between $first_num and $second_num are 
> greater than 10%."
> +   exit 1
> +   fi
> +}
> +
> +# skip if --bpf-counters is not supported
> +perf stat --bpf-counters true > /dev/null 2>&1 || exit 2
> +
> +# skip if stressapptest is not available
> +stressapptest -s 1 -M 100 -m 1 > /dev/null 2>&1 || exit 2

I don't know how popular it is, but we can print some info
in case we miss it.

> +
> +base_cycles=$(perf stat --no-big-num -e cycles -- stressapptest -s 3 -M 100 
> -m 1 2>&1 | grep -e cycles | awk '{print $1}')
> +bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- 
> stressapptest -s 3 -M 100 -m 1 2>&1 | grep -e cycles | awk '{print $1}')

I think just awk '/cycles/ {print $1}' should work.

Thanks,
Namhyung


> +
> +compare_number $base_cycles $bpf_cycles
> +exit 0
> --
> 2.30.2
>


Re: [PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-17 Thread Namhyung Kim
On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> +static int bperf_check_target(struct evsel *evsel,
> + struct target *target,
> + enum bperf_filter_type *filter_type,
> + __u32 *filter_entry_cnt)
> +{
> +   if (evsel->leader->core.nr_members > 1) {
> +   pr_err("bpf managed perf events do not yet support 
> groups.\n");
> +   return -1;
> +   }
> +
> +   /* determine filter type based on target */
> +   if (target->system_wide) {
> +   *filter_type = BPERF_FILTER_GLOBAL;
> +   *filter_entry_cnt = 1;
> +   } else if (target->cpu_list) {
> +   *filter_type = BPERF_FILTER_CPU;
> +   *filter_entry_cnt = perf_cpu_map__nr(evsel__cpus(evsel));
> +   } else if (target->tid) {
> +   *filter_type = BPERF_FILTER_PID;
> +   *filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
> +   } else if (target->pid || evsel->evlist->workload.pid != -1) {
> +   *filter_type = BPERF_FILTER_TGID;
> +   *filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
> +   } else {
> +   pr_err("bpf managed perf events do not yet support these 
> targets.\n");
> +   return -1;
> +   }
> +
> +   return 0;
> +}
> +
> +static struct perf_cpu_map *all_cpu_map;
> +
> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
> +  struct perf_event_attr_map_entry 
> *entry)
> +{
> +   struct bperf_leader_bpf *skel = bperf_leader_bpf__open();
> +   int link_fd, diff_map_fd, err;
> +   struct bpf_link *link = NULL;
> +
> +   if (!skel) {
> +   pr_err("Failed to open leader skeleton\n");
> +   return -1;
> +   }
> +
> +   bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
> +   err = bperf_leader_bpf__load(skel);
> +   if (err) {
> +   pr_err("Failed to load leader skeleton\n");
> +   goto out;
> +   }
> +
> +   err = -1;
> +   link = bpf_program__attach(skel->progs.on_switch);
> +   if (!link) {
> +   pr_err("Failed to attach leader program\n");
> +   goto out;
> +   }
> +
> +   link_fd = bpf_link__fd(link);
> +   diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
> +   entry->link_id = bpf_link_get_id(link_fd);
> +   entry->diff_map_id = bpf_map_get_id(diff_map_fd);
> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, entry, 
> BPF_ANY);
> +   assert(err == 0);
> +
> +   evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
> +   assert(evsel->bperf_leader_link_fd >= 0);

Isn't it the same as link_fd?

> +
> +   /*
> +* save leader_skel for install_pe, which is called within
> +* following evsel__open_per_cpu call
> +*/
> +   evsel->leader_skel = skel;
> +   evsel__open_per_cpu(evsel, all_cpu_map, -1);
> +
> +out:
> +   bperf_leader_bpf__destroy(skel);
> +   bpf_link__destroy(link);

Why do we destroy it?  Is it because we get an another reference?

> +   return err;
> +}
> +
> +static int bperf__load(struct evsel *evsel, struct target *target)
> +{
> +   struct perf_event_attr_map_entry entry = {0x, 0x};
> +   int attr_map_fd, diff_map_fd = -1, err;
> +   enum bperf_filter_type filter_type;
> +   __u32 filter_entry_cnt, i;
> +
> +   if (bperf_check_target(evsel, target, _type, 
> _entry_cnt))
> +   return -1;
> +
> +   if (!all_cpu_map) {
> +   all_cpu_map = perf_cpu_map__new(NULL);
> +   if (!all_cpu_map)
> +   return -1;
> +   }
> +
> +   evsel->bperf_leader_prog_fd = -1;
> +   evsel->bperf_leader_link_fd = -1;
> +
> +   /*
> +* Step 1: hold a fd on the leader program and the bpf_link, if
> +* the program is not already gone, reload the program.
> +* Use flock() to ensure exclusive access to the perf_event_attr
> +* map.
> +*/
> +   attr_map_fd = bperf_lock_attr_map(target);
> +   if (attr_map_fd < 0) {
> +   pr_err("Failed to lock perf_event_attr map\n");
> +   return -1;
> +   }
> +
> +   err = bpf_map_lookup_elem(attr_map_fd, >core.attr, );
> +   if (err) {
> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, 
> , BPF_ANY);
> +   if (err)
> +   goto out;
> +   }
> +
> +   evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry.link_id);
> +   if (evsel->bperf_leader_link_fd < 0 &&
> +   bperf_reload_leader_program(evsel, attr_map_fd, ))
> +   goto out;
> +
> +   /*
> +* The bpf_link holds reference to the leader program, and the
> +* leader program holds reference to the maps. 

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-17 Thread Namhyung Kim
On Thu, Mar 18, 2021 at 12:52 PM Song Liu  wrote:
>
>
>
> > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  
> > wrote:
> >
> > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >> Hi Song,
> >>
> >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> >>>
> >>> perf uses performance monitoring counters (PMCs) to monitor system
> >>> performance. The PMCs are limited hardware resources. For example,
> >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>>
> >>> Modern data center systems use these PMCs in many different ways:
> >>> system level monitoring, (maybe nested) container level monitoring, per
> >>> process monitoring, profiling (in sample mode), etc. In some cases,
> >>> there are more active perf_events than available hardware PMCs. To allow
> >>> all perf_events to have a chance to run, it is necessary to do expensive
> >>> time multiplexing of events.
> >>>
> >>> On the other hand, many monitoring tools count the common metrics (cycles,
> >>> instructions). It is a waste to have multiple tools create multiple
> >>> perf_events of "cycles" and occupy multiple PMCs.
> >>
> >> Right, it'd be really helpful when the PMCs are frequently or mostly 
> >> shared.
> >> But it'd also increase the overhead for uncontended cases as BPF programs
> >> need to run on every context switch.  Depending on the workload, it may
> >> cause a non-negligible performance impact.  So users should be aware of it.
> >
> > Would be interesting to, humm, measure both cases to have a firm number
> > of the impact, how many instructions are added when sharing using
> > --bpf-counters?
> >
> > I.e. compare the "expensive time multiplexing of events" with its
> > avoidance by using --bpf-counters.
> >
> > Song, have you perfmormed such measurements?
>
> I have got some measurements with perf-bench-sched-messaging:
>
> The system: x86_64 with 23 cores (46 HT)
>
> The perf-stat command:
> perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 
> 
>
> The benchmark command and output:
> ./perf bench sched messaging -g 40 -l 5 -t
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 40 groups == 1600 threads run
>  Total time: 10X.XXX [sec]
>
>
> I use the "Total time" as measurement, so smaller number is better.
>
> For each condition, I run the command 5 times, and took the median of
> "Total time".
>
> Baseline (no perf-stat) 104.873 [sec]
> # global
> perf stat -a107.887 [sec]
> perf stat -a --bpf-counters 106.071 [sec]
> # per task
> perf stat   106.314 [sec]
> perf stat --bpf-counters105.965 [sec]
> # per cpu
> perf stat -C 1,3,5  107.063 [sec]
> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>
> From the data, --bpf-counters is slightly better than the regular event
> for all targets. I noticed that the results are not very stable. There
> are a couple 108.xx runs in some of the conditions (w/ and w/o
> --bpf-counters).

Hmm.. so this result is when multiplexing happened, right?
I wondered how/why the regular perf stat is slower..

Thanks,
Namhyung

>
>
> I also measured the average runtime of the BPF programs, with
>
> sysctl kernel.bpf_stats_enabled=1
>
> For each event, if we have one leader and two followers, the total run
> time is about 340ns. IOW, 340ns for two perf-stat reading instructions,
> 340ns for two perf-stat reading cycles, etc.
>
> Thanks,
> Song


[PATCH] libbpf: Fix error path in bpf_object__elf_init()

2021-03-17 Thread Namhyung Kim
When it failed to get section names, it should call
bpf_object__elf_finish() like others.

Signed-off-by: Namhyung Kim 
---
 tools/lib/bpf/libbpf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2f351d3ad3e7..8d610259f4be 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1194,7 +1194,8 @@ static int bpf_object__elf_init(struct bpf_object *obj)
if (!elf_rawdata(elf_getscn(obj->efile.elf, obj->efile.shstrndx), 
NULL)) {
pr_warn("elf: failed to get section names strings from %s: 
%s\n",
obj->path, elf_errmsg(-1));
-   return -LIBBPF_ERRNO__FORMAT;
+   err = -LIBBPF_ERRNO__FORMAT;
+   goto errout;
}
 
/* Old LLVM set e_machine to EM_NONE */
-- 
2.31.0.rc2.261.g7f71774620-goog



[tip: perf/core] perf core: Allocate perf_buffer in the target node memory

2021-03-17 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 9483409ab5067941860754e78a4a44a60311d276
Gitweb:
https://git.kernel.org/tip/9483409ab5067941860754e78a4a44a60311d276
Author:Namhyung Kim 
AuthorDate:Mon, 15 Mar 2021 12:34:36 +09:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 16 Mar 2021 21:44:42 +01:00

perf core: Allocate perf_buffer in the target node memory

I found the ring buffer pages are allocated in the node but the ring
buffer itself is not.  Let's convert it to use kzalloc_node() too.

Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210315033436.682438-1-namhy...@kernel.org
---
 kernel/events/ring_buffer.c |  9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ef91ae7..bd55ccc 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -804,7 +804,7 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, 
int cpu, int flags)
 {
struct perf_buffer *rb;
unsigned long size;
-   int i;
+   int i, node;
 
size = sizeof(struct perf_buffer);
size += nr_pages * sizeof(void *);
@@ -812,7 +812,8 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, 
int cpu, int flags)
if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
goto fail;
 
-   rb = kzalloc(size, GFP_KERNEL);
+   node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+   rb = kzalloc_node(size, GFP_KERNEL, node);
if (!rb)
goto fail;
 
@@ -906,11 +907,13 @@ struct perf_buffer *rb_alloc(int nr_pages, long 
watermark, int cpu, int flags)
struct perf_buffer *rb;
unsigned long size;
void *all_buf;
+   int node;
 
size = sizeof(struct perf_buffer);
size += sizeof(void *);
 
-   rb = kzalloc(size, GFP_KERNEL);
+   node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+   rb = kzalloc_node(size, GFP_KERNEL, node);
if (!rb)
goto fail;
 


[tip: perf/core] perf core: Allocate perf_event in the target node memory

2021-03-17 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: ff65338e78418e5970a7aabbabb94c46f2bb821d
Gitweb:
https://git.kernel.org/tip/ff65338e78418e5970a7aabbabb94c46f2bb821d
Author:Namhyung Kim 
AuthorDate:Thu, 11 Mar 2021 20:54:13 +09:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 16 Mar 2021 21:44:43 +01:00

perf core: Allocate perf_event in the target node memory

For cpu events, it'd better allocating them in the corresponding node
memory as they would be mostly accessed by the target cpu.  Although
perf tools sets the cpu affinity before calling perf_event_open, there
are places it doesn't (notably perf record) and we should consider
other external users too.

Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/2021035413.07-2-namhy...@kernel.org
---
 kernel/events/core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f526ddb..6182cb1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11288,13 +11288,16 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
struct perf_event *event;
struct hw_perf_event *hwc;
long err = -EINVAL;
+   int node;
 
if ((unsigned)cpu >= nr_cpu_ids) {
if (!task || cpu != -1)
return ERR_PTR(-EINVAL);
}
 
-   event = kmem_cache_zalloc(perf_event_cache, GFP_KERNEL);
+   node = (cpu >= 0) ? cpu_to_node(cpu) : -1;
+   event = kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO,
+ node);
if (!event)
return ERR_PTR(-ENOMEM);
 


[tip: perf/core] perf core: Add a kmem_cache for struct perf_event

2021-03-17 Thread tip-bot2 for Namhyung Kim
The following commit has been merged into the perf/core branch of tip:

Commit-ID: bdacfaf26da166dd56c62f23f27a4b3e71f2d89e
Gitweb:
https://git.kernel.org/tip/bdacfaf26da166dd56c62f23f27a4b3e71f2d89e
Author:Namhyung Kim 
AuthorDate:Thu, 11 Mar 2021 20:54:12 +09:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 16 Mar 2021 21:44:42 +01:00

perf core: Add a kmem_cache for struct perf_event

The kernel can allocate a lot of struct perf_event when profiling. For
example, 256 cpu x 8 events x 20 cgroups = 40K instances of the struct
would be allocated on a large system.

The size of struct perf_event in my setup is 1152 byte. As it's
allocated by kmalloc, the actual allocation size would be rounded up
to 2K.

Then there's 896 byte (~43%) of waste per instance resulting in total
~35MB with 40K instances. We can create a dedicated kmem_cache to
avoid such a big unnecessary memory consumption.

With this change, I can see below (note this machine has 112 cpus).

  # grep perf_event /proc/slabinfo
  perf_event224784   115272 : tunables   24   128 : 
slabdata112112  0

The sixth column is pages-per-slab which is 2, and the fifth column is
obj-per-slab which is 7.  Thus actually it can use 1152 x 7 = 8064
byte in the 8K, and wasted memory is (8192 - 8064) / 7 = ~18 byte per
instance.

Signed-off-by: Namhyung Kim 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/2021035413.07-1-namhy...@kernel.org
---
 kernel/events/core.c |  9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 03db40f..f526ddb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -405,6 +405,7 @@ static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
 static struct srcu_struct pmus_srcu;
 static cpumask_var_t perf_online_mask;
+static struct kmem_cache *perf_event_cache;
 
 /*
  * perf event paranoia level:
@@ -4611,7 +4612,7 @@ static void free_event_rcu(struct rcu_head *head)
if (event->ns)
put_pid_ns(event->ns);
perf_event_free_filter(event);
-   kfree(event);
+   kmem_cache_free(perf_event_cache, event);
 }
 
 static void ring_buffer_attach(struct perf_event *event,
@@ -11293,7 +11294,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
return ERR_PTR(-EINVAL);
}
 
-   event = kzalloc(sizeof(*event), GFP_KERNEL);
+   event = kmem_cache_zalloc(perf_event_cache, GFP_KERNEL);
if (!event)
return ERR_PTR(-ENOMEM);
 
@@ -11497,7 +11498,7 @@ err_ns:
put_pid_ns(event->ns);
if (event->hw.target)
put_task_struct(event->hw.target);
-   kfree(event);
+   kmem_cache_free(perf_event_cache, event);
 
return ERR_PTR(err);
 }
@@ -13130,6 +13131,8 @@ void __init perf_event_init(void)
ret = init_hw_breakpoint();
WARN(ret, "hw_breakpoint initialization failed with: %d", ret);
 
+   perf_event_cache = KMEM_CACHE(perf_event, SLAB_PANIC);
+
/*
 * Build time assertion that we keep the data_head at the intended
 * location.  IOW, validation we got the __reserved[] size right.


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-16 Thread Namhyung Kim
Hi Song,

On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
>
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.

Right, it'd be really helpful when the PMCs are frequently or mostly shared.
But it'd also increase the overhead for uncontended cases as BPF programs
need to run on every context switch.  Depending on the workload, it may
cause a non-negligible performance impact.  So users should be aware of it.

Thanks,
Namhyung

>
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
>
> Changes v1 => v2:
>   1. Add documentation.
>   2. Add a shell test.
>   3. Rename options, default path of the atto-map, and some variables.
>   4. Add a separate patch that moves clock_gettime() in __run_perf_stat()
>  to after enable_counters().
>   5. Make perf_cpu_map for all cpus a global variable.
>   6. Use sysfs__mountpoint() for default attr-map path.
>   7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus().
>   8. Add flag "enabled" to the follower program. Then move follower attach
>  to bperf__load() and simplify bperf__enable().
>
> Song Liu (3):
>   perf-stat: introduce bperf, share hardware PMCs with BPF
>   perf-stat: measure t0 and ref_time after enable_counters()
>   perf-test: add a test for perf-stat --bpf-counters option
>
>  tools/perf/Documentation/perf-stat.txt|  11 +
>  tools/perf/Makefile.perf  |   1 +
>  tools/perf/builtin-stat.c |  20 +-
>  tools/perf/tests/shell/stat_bpf_counters.sh   |  34 ++
>  tools/perf/util/bpf_counter.c | 519 +-
>  tools/perf/util/bpf_skel/bperf.h  |  14 +
>  tools/perf/util/bpf_skel/bperf_follower.bpf.c |  69 +++
>  tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>  tools/perf/util/bpf_skel/bperf_u.h|  14 +
>  tools/perf/util/evsel.h   |  20 +-
>  tools/perf/util/target.h  |   4 +-
>  11 files changed, 742 insertions(+), 10 deletions(-)
>  create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
>  create mode 100644 tools/perf/util/bpf_skel/bperf.h
>  create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>  create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>  create mode 100644 tools/perf/util/bpf_skel/bperf_u.h
>
> --
> 2.30.2


Re: [PATCH 1/2] perf/x86/intel: Fix a crash caused by zero PEBS status

2021-03-16 Thread Namhyung Kim
On Fri, Mar 12, 2021 at 05:21:37AM -0800, kan.li...@linux.intel.com wrote:
> From: Kan Liang 
> 
> A repeatable crash can be triggered by the perf_fuzzer on some Haswell
> system.
> https://lore.kernel.org/lkml/7170d3b-c17f-1ded-52aa-cc6d9ae99...@maine.edu/
> 
> For some old CPUs (HSW and earlier), the PEBS status in a PEBS record
> may be mistakenly set to 0. To minimize the impact of the defect, the
> commit was introduced to try to avoid dropping the PEBS record for some
> cases. It adds a check in the intel_pmu_drain_pebs_nhm(), and updates
> the local pebs_status accordingly. However, it doesn't correct the PEBS
> status in the PEBS record, which may trigger the crash, especially for
> the large PEBS.
> 
> It's possible that all the PEBS records in a large PEBS have the PEBS
> status 0. If so, the first get_next_pebs_record_by_bit() in the
> __intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large
> PEBS, the 'count' parameter must > 1. The second
> get_next_pebs_record_by_bit() will crash.
> 
> Besides the local pebs_status, correct the PEBS status in the PEBS
> record as well.
> 
> Fixes: 01330d7288e0 ("perf/x86: Allow zero PEBS status with only single 
> active event")
> Reported-by: Vince Weaver 
> Suggested-by: Peter Zijlstra (Intel) 
> Signed-off-by: Kan Liang 
> Cc: sta...@vger.kernel.org

Tested-by: Namhyung Kim 

Thanks,
Namhyung


> ---
>  arch/x86/events/intel/ds.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 7ebae18..bcf4fa5 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2010,7 +2010,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
> *iregs, struct perf_sample_d
>*/
>   if (!pebs_status && cpuc->pebs_enabled &&
>   !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
> - pebs_status = cpuc->pebs_enabled;
> + pebs_status = p->status = cpuc->pebs_enabled;
>  
>   bit = find_first_bit((unsigned long *)_status,
>   x86_pmu.max_pebs_events);
> -- 
> 2.7.4
> 


Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

2021-03-16 Thread Namhyung Kim
On Tue, Mar 16, 2021 at 9:28 PM Liang, Kan  wrote:
>
>
>
> On 3/16/2021 3:22 AM, Namhyung Kim wrote:
> > Hi Peter and Kan,
> >
> > On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra  wrote:
> >>
> >> On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote:
> >>> On 3/3/2021 1:59 PM, Peter Zijlstra wrote:
> >>>> On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com 
> >>>> wrote:
> >>
> >>>>> +++ b/arch/x86/events/intel/ds.c
> >>>>> @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct 
> >>>>> pt_regs *iregs, struct perf_sample_d
> >>>>>continue;
> >>>>>}
> >>>>> - /*
> >>>>> -  * On some CPUs the PEBS status can be zero when PEBS is
> >>>>> -  * racing with clearing of GLOBAL_STATUS.
> >>>>> -  *
> >>>>> -  * Normally we would drop that record, but in the
> >>>>> -  * case when there is only a single active PEBS event
> >>>>> -  * we can assume it's for that event.
> >>>>> -  */
> >>>>> - if (!pebs_status && cpuc->pebs_enabled &&
> >>>>> - !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
> >>>>> - pebs_status = cpuc->pebs_enabled;
> >>>>
> >>>> Wouldn't something like:
> >>>>
> >>>>  pebs_status = p->status = cpus->pebs_enabled;
> >>>>
> >>>
> >>> I didn't consider it as a potential solution in this patch because I don't
> >>> think it's a proper way that SW modifies the buffer, which is supposed to 
> >>> be
> >>> manipulated by the HW.
> >>
> >> Right, but then HW was supposed to write sane values and it doesn't do
> >> that either ;-)
> >>
> >>> It's just a personal preference. I don't see any issue here. We may try 
> >>> it.
> >>
> >> So I mostly agree with you, but I think it's a shame to unsupport such
> >> chips, HSW is still a plenty useable chip today.
> >
> > I got a similar issue on ivybridge machines which caused kernel crash.
> > My case it's related to the branch stack with PEBS events but I think
> > it's the same issue.  And I can confirm that the above approach of
> > updating p->status fixed the problem.
> >
> > I've talked to Stephane about this, and he wants to make it more
> > robust when we see stale (or invalid) PEBS records.  I'll send the
> > patch soon.
> >
>
> Hi Namhyung,
>
> In case you didn't see it, I've already submitted a patch to fix the
> issue last Friday.
> https://lore.kernel.org/lkml/161298-140216-1-git-send-email-kan.li...@linux.intel.com/
> But if you have a more robust proposal, please feel free to submit it.
>
> BTW: The patch set from last Friday also fixed another bug found by the
> perf_fuzzer test. You may be interested.

Right, I missed it.  It'd be nice if you could CC me for perf patches later.

Thanks,
Namhyung


Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

2021-03-16 Thread Namhyung Kim
Hi Peter and Kan,

On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra  wrote:
>
> On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote:
> > On 3/3/2021 1:59 PM, Peter Zijlstra wrote:
> > > On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote:
>
> > > > +++ b/arch/x86/events/intel/ds.c
> > > > @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct 
> > > > pt_regs *iregs, struct perf_sample_d
> > > >   continue;
> > > >   }
> > > > - /*
> > > > -  * On some CPUs the PEBS status can be zero when PEBS is
> > > > -  * racing with clearing of GLOBAL_STATUS.
> > > > -  *
> > > > -  * Normally we would drop that record, but in the
> > > > -  * case when there is only a single active PEBS event
> > > > -  * we can assume it's for that event.
> > > > -  */
> > > > - if (!pebs_status && cpuc->pebs_enabled &&
> > > > - !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
> > > > - pebs_status = cpuc->pebs_enabled;
> > >
> > > Wouldn't something like:
> > >
> > > pebs_status = p->status = cpus->pebs_enabled;
> > >
> >
> > I didn't consider it as a potential solution in this patch because I don't
> > think it's a proper way that SW modifies the buffer, which is supposed to be
> > manipulated by the HW.
>
> Right, but then HW was supposed to write sane values and it doesn't do
> that either ;-)
>
> > It's just a personal preference. I don't see any issue here. We may try it.
>
> So I mostly agree with you, but I think it's a shame to unsupport such
> chips, HSW is still a plenty useable chip today.

I got a similar issue on ivybridge machines which caused kernel crash.
My case it's related to the branch stack with PEBS events but I think
it's the same issue.  And I can confirm that the above approach of
updating p->status fixed the problem.

I've talked to Stephane about this, and he wants to make it more
robust when we see stale (or invalid) PEBS records.  I'll send the
patch soon.

Thanks,
Namhyung


Re: [PATCH] perf record: Fix memory leak in vDSO

2021-03-15 Thread Namhyung Kim
On Mon, Mar 15, 2021 at 10:28 PM Jiri Olsa  wrote:
>
> On Mon, Mar 15, 2021 at 01:56:41PM +0900, Namhyung Kim wrote:
> > I got several memory leak reports from Asan with a simple command.  It
> > was because VDSO is not released due to the refcount.  Like in
> > __dsos_addnew_id(), it should put the refcount after adding to the list.
> >
> >   $ perf record true
> >   [ perf record: Woken up 1 times to write data ]
> >   [ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ]
> >
> >   =
> >   ==692599==ERROR: LeakSanitizer: detected memory leaks
> >
> >   Direct leak of 439 byte(s) in 1 object(s) allocated from:
> > #0 0x7fea52341037 in __interceptor_calloc 
> > ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
> > #1 0x559bce4aa8ee in dso__new_id util/dso.c:1256
> > #2 0x559bce59245a in __machine__addnew_vdso util/vdso.c:132
> > #3 0x559bce59245a in machine__findnew_vdso util/vdso.c:347
> > #4 0x559bce50826c in map__new util/map.c:175
> > #5 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787
> > #6 0x559bce512f6b in machines__deliver_event util/session.c:1481
> > #7 0x559bce515107 in perf_session__deliver_event util/session.c:1551
> > #8 0x559bce51d4d2 in do_flush util/ordered-events.c:244
> > #9 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323
> > #10 0x559bce519bea in __perf_session__process_events util/session.c:2268
> > #11 0x559bce519bea in perf_session__process_events util/session.c:2297
> > #12 0x559bce2e7a52 in process_buildids 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:1017
> > #13 0x559bce2e7a52 in record__finish_output 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:1234
> > #14 0x559bce2ed4f6 in __cmd_record 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:2026
> > #15 0x559bce2ed4f6 in cmd_record 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:2858
> > #16 0x559bce422db4 in run_builtin 
> > /home/namhyung/project/linux/tools/perf/perf.c:313
> > #17 0x559bce2acac8 in handle_internal_command 
> > /home/namhyung/project/linux/tools/perf/perf.c:365
> > #18 0x559bce2acac8 in run_argv 
> > /home/namhyung/project/linux/tools/perf/perf.c:409
> > #19 0x559bce2acac8 in main 
> > /home/namhyung/project/linux/tools/perf/perf.c:539
> > #20 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308
> >
> >   Indirect leak of 32 byte(s) in 1 object(s) allocated from:
> > #0 0x7fea52341037 in __interceptor_calloc 
> > ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
> > #1 0x559bce520907 in nsinfo__copy util/namespaces.c:169
> > #2 0x559bce50821b in map__new util/map.c:168
> > #3 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787
> > #4 0x559bce512f6b in machines__deliver_event util/session.c:1481
> > #5 0x559bce515107 in perf_session__deliver_event util/session.c:1551
> > #6 0x559bce51d4d2 in do_flush util/ordered-events.c:244
> > #7 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323
> > #8 0x559bce519bea in __perf_session__process_events util/session.c:2268
> > #9 0x559bce519bea in perf_session__process_events util/session.c:2297
> > #10 0x559bce2e7a52 in process_buildids 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:1017
> > #11 0x559bce2e7a52 in record__finish_output 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:1234
> > #12 0x559bce2ed4f6 in __cmd_record 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:2026
> > #13 0x559bce2ed4f6 in cmd_record 
> > /home/namhyung/project/linux/tools/perf/builtin-record.c:2858
> > #14 0x559bce422db4 in run_builtin 
> > /home/namhyung/project/linux/tools/perf/perf.c:313
> > #15 0x559bce2acac8 in handle_internal_command 
> > /home/namhyung/project/linux/tools/perf/perf.c:365
> > #16 0x559bce2acac8 in run_argv 
> > /home/namhyung/project/linux/tools/perf/perf.c:409
> > #17 0x559bce2acac8 in main 
> > /home/namhyung/project/linux/tools/perf/perf.c:539
> > #18 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308
> >
> >   SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s).
> >
> > Signed-off-by: Namhyung Kim 
> > ---
> >  tools/perf/util/vdso.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/tools/perf/util/vdso.c b/tools/perf/util

[PATCH] perf record: Fix memory leak in vDSO

2021-03-14 Thread Namhyung Kim
I got several memory leak reports from Asan with a simple command.  It
was because VDSO is not released due to the refcount.  Like in
__dsos_addnew_id(), it should put the refcount after adding to the list.

  $ perf record true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.030 MB perf.data (10 samples) ]

  =
  ==692599==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 439 byte(s) in 1 object(s) allocated from:
#0 0x7fea52341037 in __interceptor_calloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559bce4aa8ee in dso__new_id util/dso.c:1256
#2 0x559bce59245a in __machine__addnew_vdso util/vdso.c:132
#3 0x559bce59245a in machine__findnew_vdso util/vdso.c:347
#4 0x559bce50826c in map__new util/map.c:175
#5 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787
#6 0x559bce512f6b in machines__deliver_event util/session.c:1481
#7 0x559bce515107 in perf_session__deliver_event util/session.c:1551
#8 0x559bce51d4d2 in do_flush util/ordered-events.c:244
#9 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323
#10 0x559bce519bea in __perf_session__process_events util/session.c:2268
#11 0x559bce519bea in perf_session__process_events util/session.c:2297
#12 0x559bce2e7a52 in process_buildids 
/home/namhyung/project/linux/tools/perf/builtin-record.c:1017
#13 0x559bce2e7a52 in record__finish_output 
/home/namhyung/project/linux/tools/perf/builtin-record.c:1234
#14 0x559bce2ed4f6 in __cmd_record 
/home/namhyung/project/linux/tools/perf/builtin-record.c:2026
#15 0x559bce2ed4f6 in cmd_record 
/home/namhyung/project/linux/tools/perf/builtin-record.c:2858
#16 0x559bce422db4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#17 0x559bce2acac8 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#18 0x559bce2acac8 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#19 0x559bce2acac8 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#20 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308

  Indirect leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7fea52341037 in __interceptor_calloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559bce520907 in nsinfo__copy util/namespaces.c:169
#2 0x559bce50821b in map__new util/map.c:168
#3 0x559bce503c92 in machine__process_mmap2_event util/machine.c:1787
#4 0x559bce512f6b in machines__deliver_event util/session.c:1481
#5 0x559bce515107 in perf_session__deliver_event util/session.c:1551
#6 0x559bce51d4d2 in do_flush util/ordered-events.c:244
#7 0x559bce51d4d2 in __ordered_events__flush util/ordered-events.c:323
#8 0x559bce519bea in __perf_session__process_events util/session.c:2268
#9 0x559bce519bea in perf_session__process_events util/session.c:2297
#10 0x559bce2e7a52 in process_buildids 
/home/namhyung/project/linux/tools/perf/builtin-record.c:1017
#11 0x559bce2e7a52 in record__finish_output 
/home/namhyung/project/linux/tools/perf/builtin-record.c:1234
#12 0x559bce2ed4f6 in __cmd_record 
/home/namhyung/project/linux/tools/perf/builtin-record.c:2026
#13 0x559bce2ed4f6 in cmd_record 
/home/namhyung/project/linux/tools/perf/builtin-record.c:2858
#14 0x559bce422db4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#15 0x559bce2acac8 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#16 0x559bce2acac8 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#17 0x559bce2acac8 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#18 0x7fea51e76d09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s).

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/vdso.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/vdso.c b/tools/perf/util/vdso.c
index 3cc91ad048ea..43beb169631d 100644
--- a/tools/perf/util/vdso.c
+++ b/tools/perf/util/vdso.c
@@ -133,6 +133,8 @@ static struct dso *__machine__addnew_vdso(struct machine 
*machine, const char *s
if (dso != NULL) {
__dsos__add(>dsos, dso);
dso__set_long_name(dso, long_name, false);
+   /* Put dso here because __dsos_add already got it */
+   dso__put(dso);
}
 
return dso;
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH] perf core: Allocate perf_buffer in the target node memory

2021-03-14 Thread Namhyung Kim
I found the ring buffer pages are allocated in the node but the ring
buffer itself is not.  Let's convert it to use kzalloc_node() too.

Signed-off-by: Namhyung Kim 
---
 kernel/events/ring_buffer.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ef91ae75ca56..bd55ccc91373 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -804,7 +804,7 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, 
int cpu, int flags)
 {
struct perf_buffer *rb;
unsigned long size;
-   int i;
+   int i, node;
 
size = sizeof(struct perf_buffer);
size += nr_pages * sizeof(void *);
@@ -812,7 +812,8 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, 
int cpu, int flags)
if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
goto fail;
 
-   rb = kzalloc(size, GFP_KERNEL);
+   node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+   rb = kzalloc_node(size, GFP_KERNEL, node);
if (!rb)
goto fail;
 
@@ -906,11 +907,13 @@ struct perf_buffer *rb_alloc(int nr_pages, long 
watermark, int cpu, int flags)
struct perf_buffer *rb;
unsigned long size;
void *all_buf;
+   int node;
 
size = sizeof(struct perf_buffer);
size += sizeof(void *);
 
-   rb = kzalloc(size, GFP_KERNEL);
+   node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+   rb = kzalloc_node(size, GFP_KERNEL, node);
if (!rb)
goto fail;
 
-- 
2.31.0.rc2.261.g7f71774620-goog



Re: [PATCH v4] perf annotate: Fix sample events lost in stdio mode

2021-03-12 Thread Namhyung Kim
On Sat, Mar 13, 2021 at 11:16 AM Yang Jihong  wrote:
>
> In hist__find_annotations function, since different hist_entry may point to 
> same
> symbol, we free notes->src to signal already processed this symbol in stdio 
> mode;
> when annotate, entry will skipped if notes->src is NULL to avoid repeated 
> output.
>
> However, there is a problem, for example, run the following command:
>
>  # perf record -e branch-misses -e branch-instructions -a sleep 1
>
> perf.data file contains different types of sample event.
>
> If the same IP sample event exists in branch-misses and branch-instructions,
> this event uses the same symbol. When annotate branch-misses events, 
> notes->src
> corresponding to this event is set to null, as a result, when annotate
> branch-instructions events, this event is skipped and no annotate is output.
>
> Solution of this patch is to remove zfree in hists__find_annotations and
> change sort order to "dso,symbol" to avoid duplicate output when different
> processes correspond to the same symbol.

Looks good.  But I'm not sure about the branch stack mode.
I suspect we can use the same sort key there.

Jin Yao, what do you think?

Thanks,
Namhyung

> ---
>  tools/perf/builtin-annotate.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
> index a23ba6bb99b6..ad169e3e2e8f 100644
> --- a/tools/perf/builtin-annotate.c
> +++ b/tools/perf/builtin-annotate.c
> @@ -374,13 +374,6 @@ static void hists__find_annotations(struct hists *hists,
> } else {
> hist_entry__tty_annotate(he, evsel, ann);
> nd = rb_next(nd);
> -   /*
> -* Since we have a hist_entry per IP for the same
> -* symbol, free he->ms.sym->src to signal we already
> -* processed this symbol.
> -*/
> -   zfree(>src->cycles_hist);
> -   zfree(>src);
> }
> }
>  }
> @@ -624,6 +617,12 @@ int cmd_annotate(int argc, const char **argv)
> if (setup_sorting(annotate.session->evlist) < 0)
> usage_with_options(annotate_usage, options);
> } else {
> +   /*
> +* Events of different processes may correspond to the same
> +* symbol, we do not care about the processes in annotate,
> +* set sort order to avoid repeated output.
> +*/
> +   sort_order = "dso,symbol";
> if (setup_sorting(NULL) < 0)
> usage_with_options(annotate_usage, options);
> }
> --
> 2.30.GIT
>


Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Namhyung Kim
On Sat, Mar 13, 2021 at 12:38 AM Song Liu  wrote:
>
>
>
> > On Mar 12, 2021, at 12:36 AM, Namhyung Kim  wrote:
> >
> > Hi,
> >
> > On Fri, Mar 12, 2021 at 11:03 AM Song Liu  wrote:
> >>
> >> perf uses performance monitoring counters (PMCs) to monitor system
> >> performance. The PMCs are limited hardware resources. For example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some cases,
> >> there are more active perf_events than available hardware PMCs. To allow
> >> all perf_events to have a chance to run, it is necessary to do expensive
> >> time multiplexing of events.
> >>
> >> On the other hand, many monitoring tools count the common metrics (cycles,
> >> instructions). It is a waste to have multiple tools create multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >>
> >> bperf tries to reduce such wastes by allowing multiple perf_events of
> >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >> of having each perf-stat session to read its own perf_events, bperf uses
> >> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >> Then, the perf-stat session(s) reads the values from these BPF maps.
> >>
> >> Please refer to the comment before the definition of bperf_ops for the
> >> description of bperf architecture.
> >
> > Interesting!  Actually I thought about something similar before,
> > but my BPF knowledge is outdated.  So I need to catch up but
> > failed to have some time for it so far. ;-)
> >
> >>
> >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >> bperf uses a BPF hashmap to share information about BPF programs and maps
> >> used by bperf. This map is pinned to bpffs. The default address is
> >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >> --attr-map.
> >>
> >> ---
> >> Known limitations:
> >> 1. Do not support per cgroup events;
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> >> 3. Do not support event groups.
> >
> > In my case, per cgroup event counting is very important.
> > And I'd like to do that with lots of cpus and cgroups.
>
> We can easily extend this approach to support cgroups events. I didn't
> implement it to keep the first version simple.

OK.

>
> > So I'm working on an in-kernel solution (without BPF),
> > I hope to share it soon.
>
> This is interesting! I cannot wait to see how it looks like. I spent
> quite some time try to enable in kernel sharing (not just cgroup
> events), but finally decided to try BPF approach.

Well I found it hard to support generic event sharing that works
for all use cases.  So I'm focusing on the per cgroup case only.

>
> >
> > And for event groups, it seems the current implementation
> > cannot handle more than one event (not even in a group).
> > That could be a serious limitation..
>
> It supports multiple events. Multiple events are independent, i.e.,
> "cycles" and "instructions" would use two independent leader programs.

OK, then do you need multiple bperf_attr_maps?  Does it work
for an arbitrary number of events?

>
> >
> >>
> >> The following commands have been tested:
> >>
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> >
> > Hmm... so it loads both leader and follower programs if needed, right?
> > Does it support multiple followers with different targets at the same time?
>
> Yes, the whole idea is to have one leader program and multiple follower
> programs. If we only run one of these commands at a time, it will load
> one leader and one follower. If we run multiple of them in parallel,
> they will share the same leader program and load multiple follower
> programs.
>
> I actually tested more than the commands above. The list actually means
> we support -a, -C -p, and -t.
>
> Currently, this works for multiple events, and different parallel
> perf-stat. The two commands below will work well in parallel:
>
>   perf stat --use-bpf -e ref-cycles,instructions -a
>   perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5
>
> Note the use of ref-cycles, which can only use one counter on Intel CPUs.
> With this approach, the above two commands will not do time multiplexing
> on ref-cycles.

Awesome!

Thanks,
Namhyung


Re: [PATCH] perf annotate: Fix sample events lost in stdio mode

2021-03-12 Thread Namhyung Kim
On Fri, Mar 12, 2021 at 4:19 PM Yang Jihong  wrote:
>
>
> Hello,
> On 2021/3/12 13:49, Namhyung Kim wrote:
> > Hi,
> >
> > On Fri, Mar 12, 2021 at 12:24 PM Yang Jihong  wrote:
> >>
> >> Hello, Namhyung
> >>
> >> On 2021/3/11 22:42, Namhyung Kim wrote:
> >>> Hi,
> >>>
> >>> On Thu, Mar 11, 2021 at 5:48 PM Yang Jihong  
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> On 2021/3/6 16:28, Yang Jihong wrote:
> >>>>> In hist__find_annotations function, since have a hist_entry per IP for 
> >>>>> the same
> >>>>> symbol, we free notes->src to signal already processed this symbol in 
> >>>>> stdio mode;
> >>>>> when annotate, entry will skipped if notes->src is NULL to avoid 
> >>>>> repeated output.
> >>>
> >>> I'm not sure it's still true that we have a hist_entry per IP.
> >>> Afaik the default sort key is comm,dso,sym which means it should have a 
> >>> single
> >>> hist_entry for each symbol.  It seems like an old comment..
> >>>
> >> Emm, yes, we have a hist_entry for per IP.
> >> a member named "sym" in struct "hist_entry" points to symbol,
> >> different IP may point to the same symbol.
> >
> > Are you sure about this?  It seems like a bug then.
> >
> Yes, now each IP corresponds to a hist_entry :)
>
> Last week I found that some sample events were missing when perf
> annotate in stdio mode, so I went through the annotate code carefully.
>
> The event handling process is as follows:
> process_sample_event
>evsel_add_sample
>  hists__add_entry
>__hists__add_entry
>  hists__findnew_entry
>hist_entry__new  -> here allock new hist_entry

Yeah, so this is for a symbol.

>
>  hist_entry__inc_addr_samples
>symbol__inc_addr_samples
>  symbol__hists
>annotated_source__new-> here alloc annotate soruce
>annotated_source__alloc_histograms -> here alloc histograms

This should be for each IP (ideally it should be per instruction).

>
> By bugs, do you mean there's something wrong?

No. I think we were saying about different things.  :)


> >>> diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
> >>> index a23ba6bb99b6..a91fe45bd69f 100644
> >>> --- a/tools/perf/builtin-annotate.c
> >>> +++ b/tools/perf/builtin-annotate.c
> >>> @@ -374,13 +374,6 @@ static void hists__find_annotations(struct hists 
> >>> *hists,
> >>>   } else {
> >>>   hist_entry__tty_annotate(he, evsel, ann);
> >>>   nd = rb_next(nd);
> >>> -   /*
> >>> -* Since we have a hist_entry per IP for the same
> >>> -* symbol, free he->ms.sym->src to signal we 
> >>> already
> >>> -* processed this symbol.
> >>> -*/
> >>> -   zfree(>src->cycles_hist);
> >>> -   zfree(>src);
> >>>   }
> >>>   }
> >>>}
> >>>
> >> This solution may have the following problem:
> >> For example, if two sample events are in two different processes but in
> >> the same symbol, repeated output may occur.
> >> Therefore, a flag is required to indicate whether the symbol has been
> >> processed to avoid repeated output.
> >
> > Hmm.. ok.  Yeah we don't care about the processes here.
> > Then we should remove it from the sort key like below:
> >
> > @@ -624,6 +617,7 @@ int cmd_annotate(int argc, const char **argv)
> >  if (setup_sorting(annotate.session->evlist) < 0)
> >  usage_with_options(annotate_usage, options);
> >  } else {
> > +   sort_order = "dso,symbol";
> >  if (setup_sorting(NULL) < 0)
> >  usage_with_options(annotate_usage, options);
> >  }
> >
> >
> Are you referring to this solution?
> --- a/tools/perf/builtin-annotate.c
> +++ b/tools/perf/builtin-annotate.c
> @@ -374,13 +374,6 @@ static void hists__find_annotations(struct hists
> *hists,
>  } else {
>   

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Namhyung Kim
Hi,

On Fri, Mar 12, 2021 at 11:03 AM Song Liu  wrote:
>
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
>
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.
>
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
>
> Please refer to the comment before the definition of bperf_ops for the
> description of bperf architecture.

Interesting!  Actually I thought about something similar before,
but my BPF knowledge is outdated.  So I need to catch up but
failed to have some time for it so far. ;-)

>
> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> bperf uses a BPF hashmap to share information about BPF programs and maps
> used by bperf. This map is pinned to bpffs. The default address is
> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> --attr-map.
>
> ---
> Known limitations:
> 1. Do not support per cgroup events;
> 2. Do not support monitoring of BPF program (perf-stat -b);
> 3. Do not support event groups.

In my case, per cgroup event counting is very important.
And I'd like to do that with lots of cpus and cgroups.
So I'm working on an in-kernel solution (without BPF),
I hope to share it soon.

And for event groups, it seems the current implementation
cannot handle more than one event (not even in a group).
That could be a serious limitation..

>
> The following commands have been tested:
>
>perf stat --use-bpf -e cycles -a
>perf stat --use-bpf -e cycles -C 1,3,4
>perf stat --use-bpf -e cycles -p 123
>perf stat --use-bpf -e cycles -t 100,101

Hmm... so it loads both leader and follower programs if needed, right?
Does it support multiple followers with different targets at the same time?

Thanks,
Namhyung


Re: [PATCH] perf annotate: Fix sample events lost in stdio mode

2021-03-11 Thread Namhyung Kim
Hi,

On Fri, Mar 12, 2021 at 12:24 PM Yang Jihong  wrote:
>
> Hello, Namhyung
>
> On 2021/3/11 22:42, Namhyung Kim wrote:
> > Hi,
> >
> > On Thu, Mar 11, 2021 at 5:48 PM Yang Jihong  wrote:
> >>
> >> Hello,
> >>
> >> On 2021/3/6 16:28, Yang Jihong wrote:
> >>> In hist__find_annotations function, since have a hist_entry per IP for 
> >>> the same
> >>> symbol, we free notes->src to signal already processed this symbol in 
> >>> stdio mode;
> >>> when annotate, entry will skipped if notes->src is NULL to avoid repeated 
> >>> output.
> >
> > I'm not sure it's still true that we have a hist_entry per IP.
> > Afaik the default sort key is comm,dso,sym which means it should have a 
> > single
> > hist_entry for each symbol.  It seems like an old comment..
> >
> Emm, yes, we have a hist_entry for per IP.
> a member named "sym" in struct "hist_entry" points to symbol,
> different IP may point to the same symbol.

Are you sure about this?  It seems like a bug then.

>
> The hist_entry struct is as follows:
> struct hist_entry {
>  ...
>  struct map_symbol ms;
>  ...
> };
> struct map_symbol {
>  struct maps *maps;
>  struct map *map;
>  struct symbol *sym;
> };
>
> >>>
> >>> However, there is a problem, for example, run the following command:
> >>>
> >>># perf record -e branch-misses -e branch-instructions -a sleep 1
> >>>
> >>> perf.data file contains different types of sample event.
> >>>
> >>> If the same IP sample event exists in branch-misses and 
> >>> branch-instructions,
> >>> this event uses the same symbol. When annotate branch-misses events, 
> >>> notes->src
> >>> corresponding to this event is set to null, as a result, when annotate
> >>> branch-instructions events, this event is skipped and no annotate is 
> >>> output.
> >>>
> >>> Solution of this patch is to add a u8 member to struct sym_hist and use a 
> >>> bit to
> >>> indicate whether the symbol has been processed.
> >>> Because different types of event correspond to different sym_hist, no 
> >>> conflict
> >>> occurs.
> >>> ---
> >>>tools/perf/builtin-annotate.c | 22 ++
> >>>tools/perf/util/annotate.h|  4 
> >>>2 files changed, 18 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
> >>> index a23ba6bb99b6..c8c67892ae82 100644
> >>> --- a/tools/perf/builtin-annotate.c
> >>> +++ b/tools/perf/builtin-annotate.c
> >>> @@ -372,15 +372,21 @@ static void hists__find_annotations(struct hists 
> >>> *hists,
> >>>if (next != NULL)
> >>>nd = next;
> >>>} else {
> >>> - hist_entry__tty_annotate(he, evsel, ann);
> >>> + struct sym_hist *h = 
> >>> annotated_source__histogram(notes->src,
> >>> +  
> >>> evsel->idx);
> >>> +
> >>> + if (h->processed == 0) {
> >>> + hist_entry__tty_annotate(he, evsel, ann);
> >>> +
> >>> + /*
> >>> +  * Since we have a hist_entry per IP for 
> >>> the same
> >>> +  * symbol, set processed flag of evsel in 
> >>> sym_hist
> >>> +  * to signal we already processed this 
> >>> symbol.
> >>> +  */
> >>> + h->processed = 1;
> >>> + }
> >>> +
> >>>nd = rb_next(nd);
> >>> - /*
> >>> -  * Since we have a hist_entry per IP for the same
> >>> -  * symbol, free he->ms.sym->src to signal we already
> >>> -  * processed this symbol.
> >>> -  */
> >>> - zfree(>src->cycles_hist);
> >>> - zfree(>src);
> >

Re: [PATCH] perf annotate: Fix sample events lost in stdio mode

2021-03-11 Thread Namhyung Kim
Hi,

On Thu, Mar 11, 2021 at 5:48 PM Yang Jihong  wrote:
>
> Hello,
>
> On 2021/3/6 16:28, Yang Jihong wrote:
> > In hist__find_annotations function, since have a hist_entry per IP for the 
> > same
> > symbol, we free notes->src to signal already processed this symbol in stdio 
> > mode;
> > when annotate, entry will skipped if notes->src is NULL to avoid repeated 
> > output.

I'm not sure it's still true that we have a hist_entry per IP.
Afaik the default sort key is comm,dso,sym which means it should have a single
hist_entry for each symbol.  It seems like an old comment..

> >
> > However, there is a problem, for example, run the following command:
> >
> >   # perf record -e branch-misses -e branch-instructions -a sleep 1
> >
> > perf.data file contains different types of sample event.
> >
> > If the same IP sample event exists in branch-misses and branch-instructions,
> > this event uses the same symbol. When annotate branch-misses events, 
> > notes->src
> > corresponding to this event is set to null, as a result, when annotate
> > branch-instructions events, this event is skipped and no annotate is output.
> >
> > Solution of this patch is to add a u8 member to struct sym_hist and use a 
> > bit to
> > indicate whether the symbol has been processed.
> > Because different types of event correspond to different sym_hist, no 
> > conflict
> > occurs.
> > ---
> >   tools/perf/builtin-annotate.c | 22 ++
> >   tools/perf/util/annotate.h|  4 
> >   2 files changed, 18 insertions(+), 8 deletions(-)
> >
> > diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
> > index a23ba6bb99b6..c8c67892ae82 100644
> > --- a/tools/perf/builtin-annotate.c
> > +++ b/tools/perf/builtin-annotate.c
> > @@ -372,15 +372,21 @@ static void hists__find_annotations(struct hists 
> > *hists,
> >   if (next != NULL)
> >   nd = next;
> >   } else {
> > - hist_entry__tty_annotate(he, evsel, ann);
> > + struct sym_hist *h = 
> > annotated_source__histogram(notes->src,
> > +  
> > evsel->idx);
> > +
> > + if (h->processed == 0) {
> > + hist_entry__tty_annotate(he, evsel, ann);
> > +
> > + /*
> > +  * Since we have a hist_entry per IP for the 
> > same
> > +  * symbol, set processed flag of evsel in 
> > sym_hist
> > +  * to signal we already processed this symbol.
> > +  */
> > + h->processed = 1;
> > + }
> > +
> >   nd = rb_next(nd);
> > - /*
> > -  * Since we have a hist_entry per IP for the same
> > -  * symbol, free he->ms.sym->src to signal we already
> > -  * processed this symbol.
> > -  */
> > - zfree(>src->cycles_hist);
> > - zfree(>src);
> >   }
> >   }
> >   }
> > diff --git a/tools/perf/util/annotate.h b/tools/perf/util/annotate.h
> > index 096cdaf21b01..89872bfdc958 100644
> > --- a/tools/perf/util/annotate.h
> > +++ b/tools/perf/util/annotate.h
> > @@ -228,6 +228,10 @@ void symbol__calc_percent(struct symbol *sym, struct 
> > evsel *evsel);
> >   struct sym_hist {
> >   u64   nr_samples;
> >   u64   period;
> > +
> > + u8processed  : 1, /* whether symbol has been 
> > processed, used for annotate */
> > +   __reserved : 7;

I think just a bool member is fine.

> > +
> >   struct sym_hist_entry addr[];
> >   };
> >
> >
> Please check whether this solution is feasible, look forward to your review.

What about this?  (not tested)

diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index a23ba6bb99b6..a91fe45bd69f 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -374,13 +374,6 @@ static void hists__find_annotations(struct hists *hists,
} else {
hist_entry__tty_annotate(he, evsel, ann);
nd = rb_next(nd);
-   /*
-* Since we have a hist_entry per IP for the same
-* symbol, free he->ms.sym->src to signal we already
-* processed this symbol.
-*/
-   zfree(>src->cycles_hist);
-   zfree(>src);
}
}
 }

Thanks,
Namhyung


[PATCH 1/2] perf core: Add a kmem_cache for struct perf_event

2021-03-11 Thread Namhyung Kim
From: Namhyung Kim 

The kernel can allocate a lot of struct perf_event when profiling. For
example, 256 cpu x 8 events x 20 cgroups = 40K instances of the struct
would be allocated on a large system.

The size of struct perf_event in my setup is 1152 byte. As it's
allocated by kmalloc, the actual allocation size would be rounded up
to 2K.

Then there's 896 byte (~43%) of waste per instance resulting in total
~35MB with 40K instances. We can create a dedicated kmem_cache to
avoid such a big unnecessary memory consumption.

With this change, I can see below (note this machine has 112 cpus).

  # grep perf_event /proc/slabinfo
  perf_event224784   115272 : tunables   24   128 : 
slabdata112112  0

The sixth column is pages-per-slab which is 2, and the fifth column is
obj-per-slab which is 7.  Thus actually it can use 1152 x 7 = 8064
byte in the 8K, and wasted memory is (8192 - 8064) / 7 = ~18 byte per
instance.

Signed-off-by: Namhyung Kim 
---
 kernel/events/core.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5206097d4d3d..10f2548211d0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -402,6 +402,7 @@ static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
 static struct srcu_struct pmus_srcu;
 static cpumask_var_t perf_online_mask;
+static struct kmem_cache *perf_event_cache;
 
 /*
  * perf event paranoia level:
@@ -4591,7 +4592,7 @@ static void free_event_rcu(struct rcu_head *head)
if (event->ns)
put_pid_ns(event->ns);
perf_event_free_filter(event);
-   kfree(event);
+   kmem_cache_free(perf_event_cache, event);
 }
 
 static void ring_buffer_attach(struct perf_event *event,
@@ -11251,7 +11252,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
return ERR_PTR(-EINVAL);
}
 
-   event = kzalloc(sizeof(*event), GFP_KERNEL);
+   event = kmem_cache_zalloc(perf_event_cache, GFP_KERNEL);
if (!event)
return ERR_PTR(-ENOMEM);
 
@@ -11455,7 +11456,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
put_pid_ns(event->ns);
if (event->hw.target)
put_task_struct(event->hw.target);
-   kfree(event);
+   kmem_cache_free(perf_event_cache, event);
 
return ERR_PTR(err);
 }
@@ -13087,6 +13088,8 @@ void __init perf_event_init(void)
ret = init_hw_breakpoint();
WARN(ret, "hw_breakpoint initialization failed with: %d", ret);
 
+   perf_event_cache = KMEM_CACHE(perf_event, SLAB_PANIC);
+
/*
 * Build time assertion that we keep the data_head at the intended
 * location.  IOW, validation we got the __reserved[] size right.
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH 2/2] perf core: Allocate perf_event in the target node memory

2021-03-11 Thread Namhyung Kim
For cpu events, it'd better allocating them in the corresponding node
memory as they would be mostly accessed by the target cpu.  Although
perf tools sets the cpu affinity before calling perf_event_open, there
are places it doesn't (notably perf record) and we should consider
other external users too.

Signed-off-by: Namhyung Kim 
---
 kernel/events/core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 10f2548211d0..519faf0b7b54 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11246,13 +11246,16 @@ perf_event_alloc(struct perf_event_attr *attr, int 
cpu,
struct perf_event *event;
struct hw_perf_event *hwc;
long err = -EINVAL;
+   int node;
 
if ((unsigned)cpu >= nr_cpu_ids) {
if (!task || cpu != -1)
return ERR_PTR(-EINVAL);
}
 
-   event = kmem_cache_zalloc(perf_event_cache, GFP_KERNEL);
+   node = (cpu >= 0) ? cpu_to_node(cpu) : -1;
+   event = kmem_cache_alloc_node(perf_event_cache, GFP_KERNEL | __GFP_ZERO,
+ node);
if (!event)
return ERR_PTR(-ENOMEM);
 
-- 
2.31.0.rc2.261.g7f71774620-goog



Re: [PATCH] perf diff: Don't crash on freeing errno-session

2021-03-02 Thread Namhyung Kim
Hello,

On Tue, Mar 2, 2021 at 11:35 AM Dmitry Safonov  wrote:
>
> __cmd_diff() sets result of perf_session__new() to d->session.
> In case of failure, it's errno and perf-diff may crash with:
> failed to open perf.data: Permission denied
> Failed to open perf.data
> Segmentation fault (core dumped)
>
> From the coredump:
> 0  0x5569a62b5955 in auxtrace__free (session=0x)
> at util/auxtrace.c:2681
> 1  0x5569a626b37d in perf_session__delete (session=0x)
> at util/session.c:295
> 2  perf_session__delete (session=0x) at util/session.c:291
> 3  0x5569a618008a in __cmd_diff () at builtin-diff.c:1239
> 4  cmd_diff (argc=, argv=) at 
> builtin-diff.c:2011
> [..]
>
> Funny enough, it won't always crash. For me it crashes only if failed
> file is second in cmd-line: the reason is that cmd_diff() check files for
> branch-stacks [in check_file_brstack()] and if the first file doesn't
> have brstacks, it doesn't proceed to try open other files from cmd-line.
>
> Check d->session before calling perf_session__delete().
>
> Another solution would be assigning to temporary variable, checking it,
> but I find it easier to follow with IS_ERR() check in the same function.
> After some time it's still obvious why the check is needed, and with
> temp variable it's possible to make the same mistake.
>
> Cc: Alexander Shishkin 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Ingo Molnar 
> Cc: Jiri Olsa 
> Cc: Mark Rutland 
> Cc: Namhyung Kim 
> Cc: Peter Zijlstra 
> Signed-off-by: Dmitry Safonov 

Acked-by: Namhyung Kim 

Thanks,
Namhyung


> ---
>  tools/perf/builtin-diff.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
> index cefc71506409..b0c57e55052d 100644
> --- a/tools/perf/builtin-diff.c
> +++ b/tools/perf/builtin-diff.c
> @@ -1236,7 +1236,8 @@ static int __cmd_diff(void)
>
>   out_delete:
> data__for_each_file(i, d) {
> -   perf_session__delete(d->session);
> +   if (!IS_ERR(d->session))
> +   perf_session__delete(d->session);
> data__free(d);
> }
>
> --
> 2.30.1
>


Re: [PATCH] perf stat: improve readability of shadow stats

2021-03-01 Thread Namhyung Kim
On Tue, Mar 2, 2021 at 4:19 AM Jiri Olsa  wrote:
>
> On Tue, Mar 02, 2021 at 01:24:02AM +0800, Changbin Du wrote:
> > This does follow two changes:
> >   1) Select appropriate unit between K/M/G.
> >   2) Use 'cpu-sec' instead of 'sec' to state this is not the wall-time.
> >
> > $ sudo ./perf stat -a -- sleep 1
> >
> > Before: Unit 'M' is selected even the number is very small.
> >  Performance counter stats for 'system wide':
> >
> >   4,003.06 msec cpu-clock #3.998 CPUs utilized
> > 16,179  context-switches  #0.004 M/sec
> >161  cpu-migrations#0.040 K/sec
> >  4,699  page-faults   #0.001 M/sec
> >  6,135,801,925  cycles#1.533 GHz
> >   (83.21%)
> >  5,783,308,491  stalled-cycles-frontend   #   94.26% frontend 
> > cycles idle (83.21%)
> >  4,543,694,050  stalled-cycles-backend#   74.05% backend cycles 
> > idle  (66.49%)
> >  4,720,130,587  instructions  #0.77  insn per cycle
> >   #1.23  stalled cycles 
> > per insn  (83.28%)
> >753,848,078  branches  #  188.318 M/sec  
> >   (83.61%)
> > 37,457,747  branch-misses #4.97% of all 
> > branches  (83.48%)
> >
> >1.001283725 seconds time elapsed
> >
> > After:
> > $ sudo ./perf stat -a -- sleep 2
> >
> >  Performance counter stats for 'system wide':
> >
> >   8,003.20 msec cpu-clock #3.998 CPUs utilized
> >  9,768  context-switches  #1.221 K/cpu-sec
> >164  cpu-migrations#   20.492  /cpu-sec
>
> should you remove also the leading '/' in ' /cpu-sec' ?

The change looks good.  And I think we should keep '/' otherwise it'd be
more confusing.

>
>
> SNIP
>
> > @@ -1270,18 +1271,14 @@ void perf_stat__print_shadow_stats(struct 
> > perf_stat_config *config,
> >   generic_metric(config, evsel->metric_expr, 
> > evsel->metric_events, NULL,
> >   evsel->name, evsel->metric_name, NULL, 1, 
> > cpu, out, st);
> >   } else if (runtime_stat_n(st, STAT_NSECS, cpu, ) != 0) {
> > - char unit = 'M';
> > + char unit = ' ';
> >   char unit_buf[10];
> >
> >   total = runtime_stat_avg(st, STAT_NSECS, cpu, );
> > -
> >   if (total)
> > - ratio = 1000.0 * avg / total;
> > - if (ratio < 0.001) {
> > - ratio *= 1000;
> > - unit = 'K';
> > - }
> > - snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
> > + ratio = convert_unit_double(10.0 * avg / 
> > total, );
> > +
> > + snprintf(unit_buf, sizeof(unit_buf), "%c/cpu-sec", unit);
> >   print_metric(config, ctxp, NULL, "%8.3f", unit_buf, ratio);
>
> hum this will change -x output that people parse, so I don't think we can do 
> that

Agreed.

>
> >   } else if (perf_stat_evsel__is(evsel, SMI_NUM)) {
> >   print_smi_cost(config, cpu, out, st, );
> > diff --git a/tools/perf/util/units.c b/tools/perf/util/units.c
> > index a46762aec4c9..ac13b5ecde31 100644
> > --- a/tools/perf/util/units.c
> > +++ b/tools/perf/util/units.c
> > @@ -55,6 +55,28 @@ unsigned long convert_unit(unsigned long value, char 
> > *unit)
> >   return value;
> >  }
> >
> > +double convert_unit_double(double value, char *unit)
> > +{
> > + *unit = ' ';
> > +
> > + if (value > 1000.0) {
> > + value /= 1000.0;
> > + *unit = 'K';
> > + }
> > +
> > + if (value > 1000.0) {
> > + value /= 1000.0;
> > + *unit = 'M';
> > + }
> > +
> > + if (value > 1000.0) {
> > + value /= 1000.0;
> > + *unit = 'G';
> > + }
> > +
> > + return value;
> > +}
>
> we have convert_unit function just above doing the same only with
> unsigned long.. let's have one base function with double values and
> another one casting the result to unsigned long

Sounds good.

Thanks,
Namhyung


Re: [PATCH 04/11] perf test: Fix cpu and thread map leaks in sw_clock_freq test

2021-03-01 Thread Namhyung Kim
Hi Jiri,

On Tue, Mar 2, 2021 at 2:24 AM Jiri Olsa  wrote:
>
> On Mon, Mar 01, 2021 at 11:04:02PM +0900, Namhyung Kim wrote:
> > The evlist has the maps with its own refcounts so we don't need to set
> > the pointers to NULL.  Otherwise following error was reported by Asan.
> >
> > Also change the goto label since it doesn't need to have two.
> >
> >   # perf test -v 25
> >   25: Software clock events period values:
> >   --- start ---
> >   test child forked, pid 149154
> >   mmap size 528384B
> >   mmap size 528384B
> >
> >   =
> >   ==149154==ERROR: LeakSanitizer: detected memory leaks
> >
> >   Direct leak of 32 byte(s) in 1 object(s) allocated from:
> > #0 0x7fef5cd071f8 in __interceptor_realloc 
> > ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
> > #1 0x56260d5e8b8e in perf_thread_map__realloc 
> > /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
> > #2 0x56260d3df7a9 in thread_map__new_by_tid util/thread_map.c:63
> > #3 0x56260d2ac6b2 in __test__sw_clock_freq tests/sw-clock.c:65
> > #4 0x56260d26d8fb in run_test tests/builtin-test.c:428
> > #5 0x56260d26d8fb in test_and_print tests/builtin-test.c:458
> > #6 0x56260d26fa53 in __cmd_test tests/builtin-test.c:679
> > #7 0x56260d26fa53 in cmd_test tests/builtin-test.c:825
> > #8 0x56260d2dbb64 in run_builtin 
> > /home/namhyung/project/linux/tools/perf/perf.c:313
> > #9 0x56260d165a88 in handle_internal_command 
> > /home/namhyung/project/linux/tools/perf/perf.c:365
> > #10 0x56260d165a88 in run_argv 
> > /home/namhyung/project/linux/tools/perf/perf.c:409
> > #11 0x56260d165a88 in main 
> > /home/namhyung/project/linux/tools/perf/perf.c:539
> >     #12 0x7fef5c83cd09 in __libc_start_main ../csu/libc-start.c:308
> >
> > ...
> >   test child finished with 1
> >    end 
> >   Software clock events period values  : FAILED!
> >
> > Signed-off-by: Namhyung Kim 
> > ---
> >  tools/perf/tests/sw-clock.c | 12 
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> >
> > diff --git a/tools/perf/tests/sw-clock.c b/tools/perf/tests/sw-clock.c
> > index a49c9e23053b..74988846be1d 100644
> > --- a/tools/perf/tests/sw-clock.c
> > +++ b/tools/perf/tests/sw-clock.c
> > @@ -42,8 +42,8 @@ static int __test__sw_clock_freq(enum perf_sw_ids 
> > clock_id)
> >   .disabled = 1,
> >   .freq = 1,
> >   };
> > - struct perf_cpu_map *cpus;
> > - struct perf_thread_map *threads;
> > + struct perf_cpu_map *cpus = NULL;
> > + struct perf_thread_map *threads = NULL;
> >   struct mmap *md;
> >
> >   attr.sample_freq = 500;
> > @@ -66,14 +66,11 @@ static int __test__sw_clock_freq(enum perf_sw_ids 
> > clock_id)
> >   if (!cpus || !threads) {
> >   err = -ENOMEM;
> >   pr_debug("Not enough memory to create thread/cpu maps\n");
> > - goto out_free_maps;
> > + goto out_delete_evlist;
> >   }
> >
> >   perf_evlist__set_maps(>core, cpus, threads);
> >
> > - cpus= NULL;
> > - threads = NULL;
>
> hum, so IIUC we added these and the other you remove in your patches long 
> time ago,
> because there was no refcounting at that time, right?

It seems my original patch just set the maps directly.

  bc96b361cbf9 perf tests: Add a test case for checking sw clock event frequency

And after that Adrian changed it to use the set_maps() helper.

  c5e6bd2ed3e8 perf tests: Fix software clock events test setting maps

It seems we already had the refcounting at the moment.  And then the libperf
renaming happened later.

Thanks,
Namhyung


Re: [PATCH v3 07/12] perf record: init data file at mmap buffer object

2021-03-01 Thread Namhyung Kim
On Mon, Mar 1, 2021 at 10:33 PM Bayduraev, Alexey V
 wrote:
>
> On 01.03.2021 14:44, Namhyung Kim wrote:
> > Hello,
> >
> > On Mon, Mar 1, 2021 at 8:16 PM Bayduraev, Alexey V
> >  wrote:
> >>
> >> Hi,
> >>
> >> On 20.11.2020 13:49, Namhyung Kim wrote:
> >>> On Mon, Nov 16, 2020 at 03:19:41PM +0300, Alexey Budankov wrote:
> >>
> >> 
> >>
> >>>>
> >>>> @@ -1400,8 +1417,12 @@ static int record__mmap_read_evlist(struct record 
> >>>> *rec, struct evlist *evlist,
> >>>>  /*
> >>>>   * Mark the round finished in case we wrote
> >>>>   * at least one event.
> >>>> + *
> >>>> + * No need for round events in directory mode,
> >>>> + * because per-cpu maps and files have data
> >>>> + * sorted by kernel.
> >>>
> >>> But it's not just for single cpu since task can migrate so we need to
> >>> look at other cpu's data too.  Thus we use the ordered events queue
> >>> and round events help to determine when to flush the data.  Without
> >>> the round events, it'd consume huge amount of memory during report.
> >>>
> >>> If we separate tracking records and process them first, we should be
> >>> able to process samples immediately without sorting them in the
> >>> ordered event queue.  This will save both cpu cycles and memory
> >>> footprint significantly IMHO.
> >>>
> >>> Thanks,
> >>> Namhyung
> >>>
> >>
> >> As far as I understand, to split tracing records (FORK/MMAP/COMM) into
> >> a separate file, we need to implement a runtime trace decoder on the
> >> perf-record side to recognize such tracing records coming from the kernel.
> >> Is that what you mean?
> >
> > No, I meant separating the mmap buffers so that the record process
> > can save the data without decoding.
> >
>
> Thanks,
>
> Do you think this can be implemented only on the user side by creating a dummy
> event and manipulating by mmap/comm/task flags of struct perf_event_attr?
> Or some changes on the kernel side are necessary?

It's only user space changes but it can be large.  Actually I worked
on parallelizing
perf report several years ago (not finished, but I don't have time for
it now).  At the
time, perf record didn't support directory output so I made it have indexes to
different data parts. But you can get the idea from the code in

  
https://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git/log/?h=perf/threaded-v5

Thanks,
Namhyung


[PATCH 11/11] perf test: Fix cpu and thread map leaks in perf_time_to_tsc test

2021-03-01 Thread Namhyung Kim
It should release the maps at the end.

  $ perf test -v 71
  71: Convert perf time to TSC   :
  --- start ---
  test child forked, pid 178744
  mmap size 528384B
  1st event perf time 59207256505278 tsc 13187166645142
  rdtsc  time 59207256542151 tsc 13187166723020
  2nd event perf time 59207256543749 tsc 13187166726393

  =
  ==178744==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7faf601f9e8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x55b620cfc00a in cpu_map__trim_new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
#2 0x55b620cfca2f in perf_cpu_map__read 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
#3 0x55b620cfd1ef in cpu_map__read_all_cpu_map 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
#4 0x55b620cfd1ef in perf_cpu_map__new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
#5 0x55b6209ef1b2 in test__perf_time_to_tsc tests/perf-time-to-tsc.c:73
#6 0x55b6209828fb in run_test tests/builtin-test.c:428
#7 0x55b6209828fb in test_and_print tests/builtin-test.c:458
#8 0x55b620984a53 in __cmd_test tests/builtin-test.c:679
#9 0x55b620984a53 in cmd_test tests/builtin-test.c:825
#10 0x55b6209f0cd4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#11 0x55b62087aa88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#12 0x55b62087aa88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#13 0x55b62087aa88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#14 0x7faf5fd2fd09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s).
  test child finished with 1
   end 
  Convert perf time to TSC: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/perf-time-to-tsc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/tests/perf-time-to-tsc.c 
b/tools/perf/tests/perf-time-to-tsc.c
index 7cff02664d0e..680c3cffb128 100644
--- a/tools/perf/tests/perf-time-to-tsc.c
+++ b/tools/perf/tests/perf-time-to-tsc.c
@@ -167,6 +167,8 @@ int test__perf_time_to_tsc(struct test *test 
__maybe_unused, int subtest __maybe
 
 out_err:
evlist__delete(evlist);
+   perf_cpu_map__put(cpus);
+   perf_thread_map__put(threads);
return err;
 }
 
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 10/11] perf test: Fix cpu map leaks in cpu_map_print test

2021-03-01 Thread Namhyung Kim
It should be released after printing the map.

  $ perf test -v 52
  52: Print cpu map  :
  --- start ---
  test child forked, pid 172233

  =
  ==172233==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 156 byte(s) in 1 object(s) allocated from:
#0 0x7fc472518e8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x55e63b378f7a in cpu_map__trim_new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
#2 0x55e63b37a05c in perf_cpu_map__new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:237
#3 0x55e63b056d16 in cpu_map_print tests/cpumap.c:102
#4 0x55e63b056d16 in test__cpu_map_print tests/cpumap.c:120
#5 0x55e63afff8fb in run_test tests/builtin-test.c:428
#6 0x55e63afff8fb in test_and_print tests/builtin-test.c:458
#7 0x55e63b001a53 in __cmd_test tests/builtin-test.c:679
#8 0x55e63b001a53 in cmd_test tests/builtin-test.c:825
#9 0x55e63b06dc44 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#10 0x55e63aef7a88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#11 0x55e63aef7a88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#12 0x55e63aef7a88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#13 0x7fc47204ed09 in __libc_start_main ../csu/libc-start.c:308
  ...

  SUMMARY: AddressSanitizer: 448 byte(s) leaked in 7 allocation(s).
  test child finished with 1
   end 
  Print cpu map: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/cpumap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 29c793ac7d10..0472b110fe65 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -106,6 +106,8 @@ static int cpu_map_print(const char *str)
return -1;
 
cpu_map__snprint(map, buf, sizeof(buf));
+   perf_cpu_map__put(map);
+
return !strcmp(buf, str);
 }
 
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 06/11] perf test: Fix cpu and thread map leaks in keep_tracking test

2021-03-01 Thread Namhyung Kim
The evlist and the cpu/thread maps should be released together.
Otherwise following error was reported by Asan.

  $ perf test -v 28
  28: Use a dummy software event to keep tracking:
  --- start ---
  test child forked, pid 156810
  mmap size 528384B

  =
  ==156810==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f637d2bce8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x55cc6295cffa in cpu_map__trim_new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
#2 0x55cc6295da1f in perf_cpu_map__read 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
#3 0x55cc6295e1df in cpu_map__read_all_cpu_map 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
#4 0x55cc6295e1df in perf_cpu_map__new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
#5 0x55cc626287cf in test__keep_tracking tests/keep-tracking.c:84
#6 0x55cc625e38fb in run_test tests/builtin-test.c:428
#7 0x55cc625e38fb in test_and_print tests/builtin-test.c:458
#8 0x55cc625e5a53 in __cmd_test tests/builtin-test.c:679
#9 0x55cc625e5a53 in cmd_test tests/builtin-test.c:825
#10 0x55cc62651cc4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#11 0x55cc624dba88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#12 0x55cc624dba88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#13 0x55cc624dba88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#14 0x7f637cdf2d09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s).
  test child finished with 1
   end 
  Use a dummy software event to keep tracking: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/keep-tracking.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tools/perf/tests/keep-tracking.c b/tools/perf/tests/keep-tracking.c
index e6f1b2a38e03..a0438b0f0805 100644
--- a/tools/perf/tests/keep-tracking.c
+++ b/tools/perf/tests/keep-tracking.c
@@ -154,10 +154,9 @@ int test__keep_tracking(struct test *test __maybe_unused, 
int subtest __maybe_un
if (evlist) {
evlist__disable(evlist);
evlist__delete(evlist);
-   } else {
-   perf_cpu_map__put(cpus);
-   perf_thread_map__put(threads);
}
+   perf_cpu_map__put(cpus);
+   perf_thread_map__put(threads);
 
return err;
 }
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 08/11] perf test: Fix a thread map leak in thread_map_synthesize test

2021-03-01 Thread Namhyung Kim
It missed to call perf_thread_map__put() after using the map.

  $ perf test -v 43
  43: Synthesize thread map  :
  --- start ---
  test child forked, pid 162640

  =
  ==162640==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7fd48cdaa1f8 in __interceptor_realloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x563e6d5f8d0e in perf_thread_map__realloc 
/home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
#2 0x563e6d3ef69a in thread_map__new_by_pid util/thread_map.c:46
#3 0x563e6d2cec90 in test__thread_map_synthesize tests/thread-map.c:97
#4 0x563e6d27d8fb in run_test tests/builtin-test.c:428
#5 0x563e6d27d8fb in test_and_print tests/builtin-test.c:458
#6 0x563e6d27fa53 in __cmd_test tests/builtin-test.c:679
#7 0x563e6d27fa53 in cmd_test tests/builtin-test.c:825
#8 0x563e6d2ebce4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#9 0x563e6d175a88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#10 0x563e6d175a88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#11 0x563e6d175a88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#12 0x7fd48c8dfd09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 8224 byte(s) leaked in 2 allocation(s).
  test child finished with 1
   end 
  Synthesize thread map: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/thread-map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/tests/thread-map.c b/tools/perf/tests/thread-map.c
index 28f51c4bd373..9e1cf11149ef 100644
--- a/tools/perf/tests/thread-map.c
+++ b/tools/perf/tests/thread-map.c
@@ -102,6 +102,7 @@ int test__thread_map_synthesize(struct test *test 
__maybe_unused, int subtest __
TEST_ASSERT_VAL("failed to synthesize map",
!perf_event__synthesize_thread_map2(NULL, threads, 
process_event, NULL));
 
+   perf_thread_map__put(threads);
return 0;
 }
 
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 02/11] perf test: Fix a memory leak in attr test

2021-03-01 Thread Namhyung Kim
The get_argv_exec_path() returns a dynamic memory so it should be
freed after use.

  $ perf test -v 17
  ...
  ==141682==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 33 byte(s) in 1 object(s) allocated from:
#0 0x7f09107d2e8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7f091035f6a7 in __vasprintf_internal libio/vasprintf.c:71

  SUMMARY: AddressSanitizer: 33 byte(s) leaked in 1 allocation(s).

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/attr.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/tests/attr.c b/tools/perf/tests/attr.c
index ec972e0892ab..dd39ce9b0277 100644
--- a/tools/perf/tests/attr.c
+++ b/tools/perf/tests/attr.c
@@ -182,14 +182,20 @@ int test__attr(struct test *test __maybe_unused, int 
subtest __maybe_unused)
struct stat st;
char path_perf[PATH_MAX];
char path_dir[PATH_MAX];
+   char *exec_path;
 
/* First try development tree tests. */
if (!lstat("./tests", ))
return run_dir("./tests", "./perf");
 
+   exec_path = get_argv_exec_path();
+   if (exec_path == NULL)
+   return -1;
+
/* Then installed path. */
-   snprintf(path_dir,  PATH_MAX, "%s/tests", get_argv_exec_path());
+   snprintf(path_dir,  PATH_MAX, "%s/tests", exec_path);
snprintf(path_perf, PATH_MAX, "%s/perf", BINDIR);
+   free(exec_path);
 
if (!lstat(path_dir, ) &&
!lstat(path_perf, ))
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 05/11] perf test: Fix cpu and thread map leaks in code_reading test

2021-03-01 Thread Namhyung Kim
The evlist and the cpu/thread maps should be released together.
Otherwise following error was reported by Asan.

Note that this test still has memory leaks in DSOs so it still fails
even after this change.  I'll take a look at that too.

  # perf test -v 26
  26: Object code reading:
  --- start ---
  test child forked, pid 154184
  Looking at the vmlinux_path (8 entries long)
  symsrc__init: build id mismatch for vmlinux.
  symsrc__init: cannot get elf header.
  Using /proc/kcore for kernel data
  Using /proc/kallsyms for symbols
  Parsing event 'cycles'
  mmap size 528384B
  ...
  =
  ==154184==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 439 byte(s) in 1 object(s) allocated from:
#0 0x7fcb66e77037 in __interceptor_calloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55ad9b7e821e in dso__new_id util/dso.c:1256
#2 0x55ad9b8cfd4a in __machine__addnew_vdso util/vdso.c:132
#3 0x55ad9b8cfd4a in machine__findnew_vdso util/vdso.c:347
#4 0x55ad9b845b7e in map__new util/map.c:176
#5 0x55ad9b8415a2 in machine__process_mmap2_event util/machine.c:1787
#6 0x55ad9b8fab16 in perf_tool__process_synth_event 
util/synthetic-events.c:64
#7 0x55ad9b8fab16 in perf_event__synthesize_mmap_events 
util/synthetic-events.c:499
#8 0x55ad9b8fbfdf in __event__synthesize_thread util/synthetic-events.c:741
#9 0x55ad9b8ff3e3 in perf_event__synthesize_thread_map 
util/synthetic-events.c:833
#10 0x55ad9b738585 in do_test_code_reading tests/code-reading.c:608
#11 0x55ad9b73b25d in test__code_reading tests/code-reading.c:722
#12 0x55ad9b6f28fb in run_test tests/builtin-test.c:428
#13 0x55ad9b6f28fb in test_and_print tests/builtin-test.c:458
#14 0x55ad9b6f4a53 in __cmd_test tests/builtin-test.c:679
#15 0x55ad9b6f4a53 in cmd_test tests/builtin-test.c:825
#16 0x55ad9b760cc4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#17 0x55ad9b5eaa88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#18 0x55ad9b5eaa88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#19 0x55ad9b5eaa88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#20 0x7fcb669acd09 in __libc_start_main ../csu/libc-start.c:308

...
  SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s).
  test child finished with 1
   end 
  Object code reading: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/code-reading.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/tools/perf/tests/code-reading.c b/tools/perf/tests/code-reading.c
index 280f0348a09c..2fdc7b2f996e 100644
--- a/tools/perf/tests/code-reading.c
+++ b/tools/perf/tests/code-reading.c
@@ -706,13 +706,9 @@ static int do_test_code_reading(bool try_kcore)
 out_put:
thread__put(thread);
 out_err:
-
-   if (evlist) {
-   evlist__delete(evlist);
-   } else {
-   perf_cpu_map__put(cpus);
-   perf_thread_map__put(threads);
-   }
+   evlist__delete(evlist);
+   perf_cpu_map__put(cpus);
+   perf_thread_map__put(threads);
machine__delete_threads(machine);
machine__delete(machine);
 
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 03/11] perf test: Fix cpu and thread map leaks in task_exit test

2021-03-01 Thread Namhyung Kim
The evlist has the maps with its own refcounts so we don't need to set
the pointers to NULL.  Otherwise following error was reported by Asan.

Also change the goto label since it doesn't need to have two.

  # perf test -v 24
  24: Number of exit events of a simple workload :
  --- start ---
  test child forked, pid 145915
  mmap size 528384B

  =
  ==145915==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7fc44e50d1f8 in __interceptor_realloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x561cf50f4d2e in perf_thread_map__realloc 
/home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
#2 0x561cf4eeb949 in thread_map__new_by_tid util/thread_map.c:63
#3 0x561cf4db7fd2 in test__task_exit tests/task-exit.c:74
#4 0x561cf4d798fb in run_test tests/builtin-test.c:428
#5 0x561cf4d798fb in test_and_print tests/builtin-test.c:458
#6 0x561cf4d7ba53 in __cmd_test tests/builtin-test.c:679
#7 0x561cf4d7ba53 in cmd_test tests/builtin-test.c:825
#8 0x561cf4de7d04 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#9 0x561cf4c71a88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#10 0x561cf4c71a88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#11 0x561cf4c71a88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#12 0x7fc44e042d09 in __libc_start_main ../csu/libc-start.c:308

...
  test child finished with 1
   end 
  Number of exit events of a simple workload: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/task-exit.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/tools/perf/tests/task-exit.c b/tools/perf/tests/task-exit.c
index bbf94e4aa145..4c2969db59b0 100644
--- a/tools/perf/tests/task-exit.c
+++ b/tools/perf/tests/task-exit.c
@@ -75,14 +75,11 @@ int test__task_exit(struct test *test __maybe_unused, int 
subtest __maybe_unused
if (!cpus || !threads) {
err = -ENOMEM;
pr_debug("Not enough memory to create thread/cpu maps\n");
-   goto out_free_maps;
+   goto out_delete_evlist;
}
 
perf_evlist__set_maps(>core, cpus, threads);
 
-   cpus= NULL;
-   threads = NULL;
-
err = evlist__prepare_workload(evlist, , argv, false, 
workload_exec_failed_signal);
if (err < 0) {
pr_debug("Couldn't run the workload!\n");
@@ -137,7 +134,7 @@ int test__task_exit(struct test *test __maybe_unused, int 
subtest __maybe_unused
if (retry_count++ > 1000) {
pr_debug("Failed after retrying 1000 times\n");
err = -1;
-   goto out_free_maps;
+   goto out_delete_evlist;
}
 
goto retry;
@@ -148,10 +145,9 @@ int test__task_exit(struct test *test __maybe_unused, int 
subtest __maybe_unused
err = -1;
}
 
-out_free_maps:
+out_delete_evlist:
perf_cpu_map__put(cpus);
perf_thread_map__put(threads);
-out_delete_evlist:
evlist__delete(evlist);
return err;
 }
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 04/11] perf test: Fix cpu and thread map leaks in sw_clock_freq test

2021-03-01 Thread Namhyung Kim
The evlist has the maps with its own refcounts so we don't need to set
the pointers to NULL.  Otherwise following error was reported by Asan.

Also change the goto label since it doesn't need to have two.

  # perf test -v 25
  25: Software clock events period values:
  --- start ---
  test child forked, pid 149154
  mmap size 528384B
  mmap size 528384B

  =
  ==149154==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7fef5cd071f8 in __interceptor_realloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x56260d5e8b8e in perf_thread_map__realloc 
/home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
#2 0x56260d3df7a9 in thread_map__new_by_tid util/thread_map.c:63
#3 0x56260d2ac6b2 in __test__sw_clock_freq tests/sw-clock.c:65
#4 0x56260d26d8fb in run_test tests/builtin-test.c:428
#5 0x56260d26d8fb in test_and_print tests/builtin-test.c:458
#6 0x56260d26fa53 in __cmd_test tests/builtin-test.c:679
#7 0x56260d26fa53 in cmd_test tests/builtin-test.c:825
#8 0x56260d2dbb64 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#9 0x56260d165a88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#10 0x56260d165a88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#11 0x56260d165a88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#12 0x7fef5c83cd09 in __libc_start_main ../csu/libc-start.c:308

...
  test child finished with 1
   end 
  Software clock events period values  : FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/sw-clock.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/tools/perf/tests/sw-clock.c b/tools/perf/tests/sw-clock.c
index a49c9e23053b..74988846be1d 100644
--- a/tools/perf/tests/sw-clock.c
+++ b/tools/perf/tests/sw-clock.c
@@ -42,8 +42,8 @@ static int __test__sw_clock_freq(enum perf_sw_ids clock_id)
.disabled = 1,
.freq = 1,
};
-   struct perf_cpu_map *cpus;
-   struct perf_thread_map *threads;
+   struct perf_cpu_map *cpus = NULL;
+   struct perf_thread_map *threads = NULL;
struct mmap *md;
 
attr.sample_freq = 500;
@@ -66,14 +66,11 @@ static int __test__sw_clock_freq(enum perf_sw_ids clock_id)
if (!cpus || !threads) {
err = -ENOMEM;
pr_debug("Not enough memory to create thread/cpu maps\n");
-   goto out_free_maps;
+   goto out_delete_evlist;
}
 
perf_evlist__set_maps(>core, cpus, threads);
 
-   cpus= NULL;
-   threads = NULL;
-
if (evlist__open(evlist)) {
const char *knob = 
"/proc/sys/kernel/perf_event_max_sample_rate";
 
@@ -129,10 +126,9 @@ static int __test__sw_clock_freq(enum perf_sw_ids clock_id)
err = -1;
}
 
-out_free_maps:
+out_delete_evlist:
perf_cpu_map__put(cpus);
perf_thread_map__put(threads);
-out_delete_evlist:
evlist__delete(evlist);
return err;
 }
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCHSET 00/11] perf test: Fix cpu/thread map leaks

2021-03-01 Thread Namhyung Kim
Hi,

This patchset fixes memory leaks in the perf test code.  In my company
setup, it runs daily with various sanitizers on, so I want to reduce
the failures due to the leaks not the logic.

This time I've focused on the cpu and thread maps as they are obvious
and easy to fix.  I'll take a look at the rest failures.

I didn't add the Fixes: tags since most changes seem to predate the
libperf change.  I'm not sure if I could just add the original commit
hash as this fix is meaningful only if Asan is enabled..  I'm afraid
the stable tree maintainers will see patches not applied cleanly.  But
I can add them if you want, so please let me know.

It's also available at perf/asan-fix-v1 branch in

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Thanks,
Namhyung


Namhyung Kim (11):
  perf test: Fix cpu and thread map leaks in basic mmap test
  perf test: Fix a memory leak in attr test
  perf test: Fix cpu and thread map leaks in task_exit test
  perf test: Fix cpu and thread map leaks in sw_clock_freq test
  perf test: Fix cpu and thread map leaks in code_reading test
  perf test: Fix cpu and thread map leaks in keep_tracking test
  perf test: Fix cpu and thread map leaks in switch_tracking test
  perf test: Fix a thread map leak in thread_map_synthesize test
  perf test: Fix a memory leak in thread_map_remove test
  perf test: Fix cpu map leaks in cpu_map_print test
  perf test: Fix cpu and thread map leaks in perf_time_to_tsc test

 tools/perf/tests/attr.c |  8 +++-
 tools/perf/tests/code-reading.c | 10 +++---
 tools/perf/tests/cpumap.c   |  2 ++
 tools/perf/tests/keep-tracking.c|  5 ++---
 tools/perf/tests/mmap-basic.c   |  2 --
 tools/perf/tests/perf-time-to-tsc.c |  2 ++
 tools/perf/tests/sw-clock.c | 12 
 tools/perf/tests/switch-tracking.c  |  5 ++---
 tools/perf/tests/task-exit.c| 10 +++---
 tools/perf/tests/thread-map.c   |  8 +++-
 10 files changed, 28 insertions(+), 36 deletions(-)

-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 01/11] perf test: Fix cpu and thread map leaks in basic mmap test

2021-03-01 Thread Namhyung Kim
The evlist has the maps with its own refcounts so we don't need to set
the pointers to NULL.  Otherwise following error was reported by Asan.

  # perf test -v 4
   4: Read samples using the mmap interface  :
  --- start ---
  test child forked, pid 139782
  mmap size 528384B

  =
  ==139782==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f1f76daee8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x564ba21a0fea in cpu_map__trim_new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
#2 0x564ba21a1a0f in perf_cpu_map__read 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
#3 0x564ba21a21cf in cpu_map__read_all_cpu_map 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
#4 0x564ba21a21cf in perf_cpu_map__new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
#5 0x564ba1e48298 in test__basic_mmap tests/mmap-basic.c:55
#6 0x564ba1e278fb in run_test tests/builtin-test.c:428
#7 0x564ba1e278fb in test_and_print tests/builtin-test.c:458
#8 0x564ba1e29a53 in __cmd_test tests/builtin-test.c:679
#9 0x564ba1e29a53 in cmd_test tests/builtin-test.c:825
#10 0x564ba1e95cb4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#11 0x564ba1d1fa88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#12 0x564ba1d1fa88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#13 0x564ba1d1fa88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#14 0x7f1f768e4d09 in __libc_start_main ../csu/libc-start.c:308

...
  test child finished with 1
   end 
  Read samples using the mmap interface: FAILED!
  failed to open shell test directory: 
/home/namhyung/libexec/perf-core/tests/shell

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/mmap-basic.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/perf/tests/mmap-basic.c b/tools/perf/tests/mmap-basic.c
index 57093aeacc6f..73ae8f7aa066 100644
--- a/tools/perf/tests/mmap-basic.c
+++ b/tools/perf/tests/mmap-basic.c
@@ -158,8 +158,6 @@ int test__basic_mmap(struct test *test __maybe_unused, int 
subtest __maybe_unuse
 
 out_delete_evlist:
evlist__delete(evlist);
-   cpus= NULL;
-   threads = NULL;
 out_free_cpus:
perf_cpu_map__put(cpus);
 out_free_threads:
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 09/11] perf test: Fix a memory leak in thread_map_remove test

2021-03-01 Thread Namhyung Kim
The str should be freed after creating a thread map.  Also change the
open-coded thread map deletion to a call to perf_thread_map__put().

  $ perf test -v 44
  44: Remove thread map  :
  --- start ---
  test child forked, pid 165536
  2 threads: 165535, 165536
  1 thread: 165536
  0 thread:

  =
  ==165536==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 14 byte(s) in 1 object(s) allocated from:
#0 0x7f54453ffe8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7f5444f8c6a7 in __vasprintf_internal libio/vasprintf.c:71

  SUMMARY: AddressSanitizer: 14 byte(s) leaked in 1 allocation(s).
  test child finished with 1
   end 
  Remove thread map: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/thread-map.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/tools/perf/tests/thread-map.c b/tools/perf/tests/thread-map.c
index 9e1cf11149ef..d1e208b4a571 100644
--- a/tools/perf/tests/thread-map.c
+++ b/tools/perf/tests/thread-map.c
@@ -110,12 +110,12 @@ int test__thread_map_remove(struct test *test 
__maybe_unused, int subtest __mayb
 {
struct perf_thread_map *threads;
char *str;
-   int i;
 
TEST_ASSERT_VAL("failed to allocate map string",
asprintf(, "%d,%d", getpid(), getppid()) >= 0);
 
threads = thread_map__new_str(str, NULL, 0, false);
+   free(str);
 
TEST_ASSERT_VAL("failed to allocate thread_map",
threads);
@@ -142,9 +142,6 @@ int test__thread_map_remove(struct test *test 
__maybe_unused, int subtest __mayb
TEST_ASSERT_VAL("failed to not remove thread",
thread_map__remove(threads, 0));
 
-   for (i = 0; i < threads->nr; i++)
-   zfree(>map[i].comm);
-
-   free(threads);
+   perf_thread_map__put(threads);
return 0;
 }
-- 
2.30.1.766.gb4fecdf3b7-goog



[PATCH 07/11] perf test: Fix cpu and thread map leaks in switch_tracking test

2021-03-01 Thread Namhyung Kim
The evlist and cpu/thread maps should be released together.
Otherwise the following error was reported by Asan.

  $ perf test -v 35
  35: Track with sched_switch:
  --- start ---
  test child forked, pid 159287
  Using CPUID GenuineIntel-6-8E-C
  mmap size 528384B
  1295 events recorded

  =
  ==159287==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7fa28d9a2e8f in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x5652f5a5affa in cpu_map__trim_new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
#2 0x5652f5a5ba1f in perf_cpu_map__read 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
#3 0x5652f5a5c1df in cpu_map__read_all_cpu_map 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
#4 0x5652f5a5c1df in perf_cpu_map__new 
/home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
#5 0x5652f5723bbf in test__switch_tracking tests/switch-tracking.c:350
#6 0x5652f56e18fb in run_test tests/builtin-test.c:428
#7 0x5652f56e18fb in test_and_print tests/builtin-test.c:458
#8 0x5652f56e3a53 in __cmd_test tests/builtin-test.c:679
#9 0x5652f56e3a53 in cmd_test tests/builtin-test.c:825
#10 0x5652f574fcc4 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#11 0x5652f55d9a88 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#12 0x5652f55d9a88 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#13 0x5652f55d9a88 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#14 0x7fa28d4d8d09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s).
  test child finished with 1
   end 
  Track with sched_switch: FAILED!

Signed-off-by: Namhyung Kim 
---
 tools/perf/tests/switch-tracking.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tools/perf/tests/switch-tracking.c 
b/tools/perf/tests/switch-tracking.c
index 15a2ab765d89..3ebaa758df77 100644
--- a/tools/perf/tests/switch-tracking.c
+++ b/tools/perf/tests/switch-tracking.c
@@ -574,10 +574,9 @@ int test__switch_tracking(struct test *test 
__maybe_unused, int subtest __maybe_
if (evlist) {
evlist__disable(evlist);
evlist__delete(evlist);
-   } else {
-   perf_cpu_map__put(cpus);
-   perf_thread_map__put(threads);
}
+   perf_cpu_map__put(cpus);
+   perf_thread_map__put(threads);
 
return err;
 
-- 
2.30.1.766.gb4fecdf3b7-goog



Re: [PATCH v3 07/12] perf record: init data file at mmap buffer object

2021-03-01 Thread Namhyung Kim
Hello,

On Mon, Mar 1, 2021 at 8:16 PM Bayduraev, Alexey V
 wrote:
>
> Hi,
>
> On 20.11.2020 13:49, Namhyung Kim wrote:
> > On Mon, Nov 16, 2020 at 03:19:41PM +0300, Alexey Budankov wrote:
>
> 
>
> >>
> >> @@ -1400,8 +1417,12 @@ static int record__mmap_read_evlist(struct record 
> >> *rec, struct evlist *evlist,
> >>  /*
> >>   * Mark the round finished in case we wrote
> >>   * at least one event.
> >> + *
> >> + * No need for round events in directory mode,
> >> + * because per-cpu maps and files have data
> >> + * sorted by kernel.
> >
> > But it's not just for single cpu since task can migrate so we need to
> > look at other cpu's data too.  Thus we use the ordered events queue
> > and round events help to determine when to flush the data.  Without
> > the round events, it'd consume huge amount of memory during report.
> >
> > If we separate tracking records and process them first, we should be
> > able to process samples immediately without sorting them in the
> > ordered event queue.  This will save both cpu cycles and memory
> > footprint significantly IMHO.
> >
> > Thanks,
> > Namhyung
> >
>
> As far as I understand, to split tracing records (FORK/MMAP/COMM) into
> a separate file, we need to implement a runtime trace decoder on the
> perf-record side to recognize such tracing records coming from the kernel.
> Is that what you mean?

No, I meant separating the mmap buffers so that the record process
can save the data without decoding.

>
> IMHO this can be tricky to implement and adds some overhead that can lead
> to possible data loss. Do you have any other ideas how to optimize memory
> consumption on perf-report side without a runtime trace decoder?
> Maybe "round events" would somehow help in directory mode?
>
> BTW In our tool we use another approach: two-pass trace file loading.
> The first loads tracing records, the second loads samples.

Yeah, something like that.  With the separated data, we can do it
more efficiently IMHO.

Thanks,
Namhyung


Re: [PATCH] perf trace: Ensure read cmdlines are null terminated.

2021-02-28 Thread Namhyung Kim
Hi Ian,

On Sat, Feb 27, 2021 at 7:14 AM Ian Rogers  wrote:
>
> Issue detected by address sanitizer.
>
> Fixes: cd4ceb63438e (perf util: Save pid-cmdline mapping into tracing header)
> Signed-off-by: Ian Rogers 

Acked-by: Namhyung Kim 

Thanks,
Namhyung

> ---
>  tools/perf/util/trace-event-read.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/perf/util/trace-event-read.c 
> b/tools/perf/util/trace-event-read.c
> index f507dff713c9..8a01af783310 100644
> --- a/tools/perf/util/trace-event-read.c
> +++ b/tools/perf/util/trace-event-read.c
> @@ -361,6 +361,7 @@ static int read_saved_cmdline(struct tep_handle *pevent)
> pr_debug("error reading saved cmdlines\n");
> goto out;
> }
> +   buf[ret] = '\0';
>
> parse_saved_cmdline(pevent, buf, size);
> ret = 0;
> --
> 2.30.1.766.gb4fecdf3b7-goog
>


[PATCH v2 2/2] perf stat: Fix use-after-free when -r option is used

2021-02-24 Thread Namhyung Kim
I got a segfault when using -r option with event groups.  The option
makes it run the workload multiple times and it will reuse the evlist
and evsel for each run.

While most of resources are allocated and freed properly, the id hash
in the evlist was not and it resulted in the bug.  You can see it with
the address sanitizer like below:

  $ perf stat -r 100 -e '{cycles,instructions}' true
  =
  ==693052==ERROR: AddressSanitizer: heap-use-after-free on
  address 0x608003d0 at pc 0x558c57732835 bp 0x7fff1526adb0 sp 
0x7fff1526ada8
  WRITE of size 8 at 0x608003d0 thread T0
#0 0x558c57732834 in hlist_add_head 
/home/namhyung/project/linux/tools/include/linux/list.h:644
#1 0x558c57732834 in perf_evlist__id_hash 
/home/namhyung/project/linux/tools/lib/perf/evlist.c:237
#2 0x558c57732834 in perf_evlist__id_add 
/home/namhyung/project/linux/tools/lib/perf/evlist.c:244
#3 0x558c57732834 in perf_evlist__id_add_fd 
/home/namhyung/project/linux/tools/lib/perf/evlist.c:285
#4 0x558c5747733e in store_evsel_ids util/evsel.c:2765
#5 0x558c5747733e in evsel__store_ids util/evsel.c:2782
#6 0x558c5730b717 in __run_perf_stat 
/home/namhyung/project/linux/tools/perf/builtin-stat.c:895
#7 0x558c5730b717 in run_perf_stat 
/home/namhyung/project/linux/tools/perf/builtin-stat.c:1014
#8 0x558c5730b717 in cmd_stat 
/home/namhyung/project/linux/tools/perf/builtin-stat.c:2446
#9 0x558c57427c24 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#10 0x558c572b1a48 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#11 0x558c572b1a48 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#12 0x558c572b1a48 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#13 0x7fcadb9f7d09 in __libc_start_main ../csu/libc-start.c:308
#14 0x558c572b60f9 in _start 
(/home/namhyung/project/linux/tools/perf/perf+0x45d0f9)

Actually the nodes in the hash table are struct perf_stream_id and
they were freed in the previous run.  Fix it by resetting the hash.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/evlist.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 5121b4db66fe..882cd1f721d9 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1306,6 +1306,7 @@ void evlist__close(struct evlist *evlist)
perf_evsel__free_fd(>core);
perf_evsel__free_id(>core);
}
+   perf_evlist__reset_id_hash(>core);
 }
 
 static int evlist__create_syswide_maps(struct evlist *evlist)
-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH v2 1/2] libperf: Add perf_evlist__reset_id_hash()

2021-02-24 Thread Namhyung Kim
Add the perf_evlist__reset_id_hash() function as an internal function
so that it can be called by perf to reset the hash table.  This is
necessary for perf stat to run the workload multiple times.

Signed-off-by: Namhyung Kim 
---
 tools/lib/perf/evlist.c  | 13 +
 tools/lib/perf/include/internal/evlist.h |  2 ++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 17465d454a0e..a0aaf385cbb5 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -26,13 +26,10 @@
 
 void perf_evlist__init(struct perf_evlist *evlist)
 {
-   int i;
-
-   for (i = 0; i < PERF_EVLIST__HLIST_SIZE; ++i)
-   INIT_HLIST_HEAD(>heads[i]);
INIT_LIST_HEAD(>entries);
evlist->nr_entries = 0;
fdarray__init(>pollfd, 64);
+   perf_evlist__reset_id_hash(evlist);
 }
 
 static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
@@ -237,6 +234,14 @@ static void perf_evlist__id_hash(struct perf_evlist 
*evlist,
hlist_add_head(>node, >heads[hash]);
 }
 
+void perf_evlist__reset_id_hash(struct perf_evlist *evlist)
+{
+   int i;
+
+   for (i = 0; i < PERF_EVLIST__HLIST_SIZE; ++i)
+   INIT_HLIST_HEAD(>heads[i]);
+}
+
 void perf_evlist__id_add(struct perf_evlist *evlist,
 struct perf_evsel *evsel,
 int cpu, int thread, u64 id)
diff --git a/tools/lib/perf/include/internal/evlist.h 
b/tools/lib/perf/include/internal/evlist.h
index 2d0fa02b036f..212c29063ad4 100644
--- a/tools/lib/perf/include/internal/evlist.h
+++ b/tools/lib/perf/include/internal/evlist.h
@@ -124,4 +124,6 @@ int perf_evlist__id_add_fd(struct perf_evlist *evlist,
   struct perf_evsel *evsel,
   int cpu, int thread, int fd);
 
+void perf_evlist__reset_id_hash(struct perf_evlist *evlist);
+
 #endif /* __LIBPERF_INTERNAL_EVLIST_H */
-- 
2.30.0.617.g56c4b15f3c-goog



Re: [PATCH 2/2] perf stat: Fix segfault when -r option is used

2021-02-24 Thread Namhyung Kim
On Wed, Feb 24, 2021 at 5:11 PM Namhyung Kim  wrote:
>
> I got a segfault when using -r option with event groups.  The option
> makes it run the workload multiple times and it will reuse the evlist
> and evsel for each run.

Well, it might not see a segfault because the freed memory region is
likely to be reused.  But you can see the bug clearly with asan.

Thanks,
Namhyung

>
> While most of resources are allocated and freed properly, the id hash
> in the evlist was not and it resulted in a crash.  You can see it with
> the address sanitizer like below:
>
>   $ perf stat -r 100 -e '{cycles,instructions}' true
>   =
>   ==693052==ERROR: AddressSanitizer: heap-use-after-free on
>   address 0x608003d0 at pc 0x558c57732835 bp 0x7fff1526adb0 sp 
> 0x7fff1526ada8
>   WRITE of size 8 at 0x608003d0 thread T0
> #0 0x558c57732834 in hlist_add_head 
> /home/namhyung/project/linux/tools/include/linux/list.h:644
> #1 0x558c57732834 in perf_evlist__id_hash 
> /home/namhyung/project/linux/tools/lib/perf/evlist.c:237
> #2 0x558c57732834 in perf_evlist__id_add 
> /home/namhyung/project/linux/tools/lib/perf/evlist.c:244
> #3 0x558c57732834 in perf_evlist__id_add_fd 
> /home/namhyung/project/linux/tools/lib/perf/evlist.c:285
> #4 0x558c5747733e in store_evsel_ids util/evsel.c:2765
> #5 0x558c5747733e in evsel__store_ids util/evsel.c:2782
> #6 0x558c5730b717 in __run_perf_stat 
> /home/namhyung/project/linux/tools/perf/builtin-stat.c:895
> #7 0x558c5730b717 in run_perf_stat 
> /home/namhyung/project/linux/tools/perf/builtin-stat.c:1014
> #8 0x558c5730b717 in cmd_stat 
> /home/namhyung/project/linux/tools/perf/builtin-stat.c:2446
> #9 0x558c57427c24 in run_builtin 
> /home/namhyung/project/linux/tools/perf/perf.c:313
> #10 0x558c572b1a48 in handle_internal_command 
> /home/namhyung/project/linux/tools/perf/perf.c:365
> #11 0x558c572b1a48 in run_argv 
> /home/namhyung/project/linux/tools/perf/perf.c:409
> #12 0x558c572b1a48 in main 
> /home/namhyung/project/linux/tools/perf/perf.c:539
> #13 0x7fcadb9f7d09 in __libc_start_main ../csu/libc-start.c:308
> #14 0x558c572b60f9 in _start 
> (/home/namhyung/project/linux/tools/perf/perf+0x45d0f9)
>
> Actually the nodes in the hash table are struct perf_stream_id and
> they were freed in the previous run.  Fix it by resetting the hash.
>
> Signed-off-by: Namhyung Kim 
> ---
>  tools/perf/util/evlist.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 5121b4db66fe..882cd1f721d9 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1306,6 +1306,7 @@ void evlist__close(struct evlist *evlist)
> perf_evsel__free_fd(>core);
> perf_evsel__free_id(>core);
> }
> +   perf_evlist__reset_id_hash(>core);
>  }
>
>  static int evlist__create_syswide_maps(struct evlist *evlist)
> --
> 2.30.0.617.g56c4b15f3c-goog
>


Re: [PATCH 1/2] libperf: Add perf_evlist__reset_id_hash()

2021-02-24 Thread Namhyung Kim
On Wed, Feb 24, 2021 at 5:11 PM Namhyung Kim  wrote:
>
> Add the perf_evlist__reset_id_hash() function to libperf API so that
> it can be called to reset the hash table.  This is necessary for perf
> stat to run the workload multiple times.
>
> Signed-off-by: Namhyung Kim 
> ---
[SNIP]
> diff --git a/tools/lib/perf/libperf.map b/tools/lib/perf/libperf.map
> index 7be1af8a546c..285100143d89 100644
> --- a/tools/lib/perf/libperf.map
> +++ b/tools/lib/perf/libperf.map
> @@ -42,6 +42,7 @@ LIBPERF_0.0.1 {
> perf_evlist__munmap;
> perf_evlist__filter_pollfd;
> perf_evlist__next_mmap;
> +   perf_evlist__reset_id_hash;
> perf_mmap__consume;
> perf_mmap__read_init;
> perf_mmap__read_done;

I saw perf_evsel__free_fd and perf_evsel__free_id is called from
util/evlist.c without being listed here.  Do we need to add them?

Thanks,
Namhyung


[PATCH 2/2] perf stat: Fix segfault when -r option is used

2021-02-24 Thread Namhyung Kim
I got a segfault when using -r option with event groups.  The option
makes it run the workload multiple times and it will reuse the evlist
and evsel for each run.

While most of resources are allocated and freed properly, the id hash
in the evlist was not and it resulted in a crash.  You can see it with
the address sanitizer like below:

  $ perf stat -r 100 -e '{cycles,instructions}' true
  =
  ==693052==ERROR: AddressSanitizer: heap-use-after-free on
  address 0x608003d0 at pc 0x558c57732835 bp 0x7fff1526adb0 sp 
0x7fff1526ada8
  WRITE of size 8 at 0x608003d0 thread T0
#0 0x558c57732834 in hlist_add_head 
/home/namhyung/project/linux/tools/include/linux/list.h:644
#1 0x558c57732834 in perf_evlist__id_hash 
/home/namhyung/project/linux/tools/lib/perf/evlist.c:237
#2 0x558c57732834 in perf_evlist__id_add 
/home/namhyung/project/linux/tools/lib/perf/evlist.c:244
#3 0x558c57732834 in perf_evlist__id_add_fd 
/home/namhyung/project/linux/tools/lib/perf/evlist.c:285
#4 0x558c5747733e in store_evsel_ids util/evsel.c:2765
#5 0x558c5747733e in evsel__store_ids util/evsel.c:2782
#6 0x558c5730b717 in __run_perf_stat 
/home/namhyung/project/linux/tools/perf/builtin-stat.c:895
#7 0x558c5730b717 in run_perf_stat 
/home/namhyung/project/linux/tools/perf/builtin-stat.c:1014
#8 0x558c5730b717 in cmd_stat 
/home/namhyung/project/linux/tools/perf/builtin-stat.c:2446
#9 0x558c57427c24 in run_builtin 
/home/namhyung/project/linux/tools/perf/perf.c:313
#10 0x558c572b1a48 in handle_internal_command 
/home/namhyung/project/linux/tools/perf/perf.c:365
#11 0x558c572b1a48 in run_argv 
/home/namhyung/project/linux/tools/perf/perf.c:409
#12 0x558c572b1a48 in main 
/home/namhyung/project/linux/tools/perf/perf.c:539
#13 0x7fcadb9f7d09 in __libc_start_main ../csu/libc-start.c:308
#14 0x558c572b60f9 in _start 
(/home/namhyung/project/linux/tools/perf/perf+0x45d0f9)

Actually the nodes in the hash table are struct perf_stream_id and
they were freed in the previous run.  Fix it by resetting the hash.

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/evlist.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 5121b4db66fe..882cd1f721d9 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1306,6 +1306,7 @@ void evlist__close(struct evlist *evlist)
perf_evsel__free_fd(>core);
perf_evsel__free_id(>core);
}
+   perf_evlist__reset_id_hash(>core);
 }
 
 static int evlist__create_syswide_maps(struct evlist *evlist)
-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH 1/2] libperf: Add perf_evlist__reset_id_hash()

2021-02-24 Thread Namhyung Kim
Add the perf_evlist__reset_id_hash() function to libperf API so that
it can be called to reset the hash table.  This is necessary for perf
stat to run the workload multiple times.

Signed-off-by: Namhyung Kim 
---
 tools/lib/perf/evlist.c  | 13 +
 tools/lib/perf/include/perf/evlist.h |  2 ++
 tools/lib/perf/libperf.map   |  1 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 17465d454a0e..a0aaf385cbb5 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -26,13 +26,10 @@
 
 void perf_evlist__init(struct perf_evlist *evlist)
 {
-   int i;
-
-   for (i = 0; i < PERF_EVLIST__HLIST_SIZE; ++i)
-   INIT_HLIST_HEAD(>heads[i]);
INIT_LIST_HEAD(>entries);
evlist->nr_entries = 0;
fdarray__init(>pollfd, 64);
+   perf_evlist__reset_id_hash(evlist);
 }
 
 static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
@@ -237,6 +234,14 @@ static void perf_evlist__id_hash(struct perf_evlist 
*evlist,
hlist_add_head(>node, >heads[hash]);
 }
 
+void perf_evlist__reset_id_hash(struct perf_evlist *evlist)
+{
+   int i;
+
+   for (i = 0; i < PERF_EVLIST__HLIST_SIZE; ++i)
+   INIT_HLIST_HEAD(>heads[i]);
+}
+
 void perf_evlist__id_add(struct perf_evlist *evlist,
 struct perf_evsel *evsel,
 int cpu, int thread, u64 id)
diff --git a/tools/lib/perf/include/perf/evlist.h 
b/tools/lib/perf/include/perf/evlist.h
index 0a7479dc13bf..0085732e8cd9 100644
--- a/tools/lib/perf/include/perf/evlist.h
+++ b/tools/lib/perf/include/perf/evlist.h
@@ -46,4 +46,6 @@ LIBPERF_API struct perf_mmap *perf_evlist__next_mmap(struct 
perf_evlist *evlist,
 (pos) != NULL; \
 (pos) = perf_evlist__next_mmap((evlist), (pos), overwrite))
 
+LIBPERF_API void perf_evlist__reset_id_hash(struct perf_evlist *evlist);
+
 #endif /* __LIBPERF_EVLIST_H */
diff --git a/tools/lib/perf/libperf.map b/tools/lib/perf/libperf.map
index 7be1af8a546c..285100143d89 100644
--- a/tools/lib/perf/libperf.map
+++ b/tools/lib/perf/libperf.map
@@ -42,6 +42,7 @@ LIBPERF_0.0.1 {
perf_evlist__munmap;
perf_evlist__filter_pollfd;
perf_evlist__next_mmap;
+   perf_evlist__reset_id_hash;
perf_mmap__consume;
perf_mmap__read_init;
perf_mmap__read_done;
-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH] perf daemon: Fix compile error with Asan

2021-02-23 Thread Namhyung Kim
I'm seeing a build failure when build with address sanitizer.
It seems we could write to the name[100] if the var is longer.

  $ make EXTRA_CFLAGS=-fsanitize=address
  ...
CC   builtin-daemon.o
  In function ‘get_session_name’,
inlined from ‘session_config’ at builtin-daemon.c:164:6,
inlined from ‘server_config’ at builtin-daemon.c:223:10:
  builtin-daemon.c:155:11: error: writing 1 byte into a region of size 0 
[-Werror=stringop-overflow=]
155 |  *session = 0;
|  ~^~~
  builtin-daemon.c: In function ‘server_config’:
  builtin-daemon.c:162:7: note: at offset 100 to object ‘name’ with size 100 
declared here
162 |  char name[100];
|   ^~~~

Fixes: c0666261ff38 ("perf daemon: Add config file support")
Signed-off-by: Namhyung Kim 
---
 tools/perf/builtin-daemon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/builtin-daemon.c b/tools/perf/builtin-daemon.c
index 617feaf020f6..8f9fc61691da 100644
--- a/tools/perf/builtin-daemon.c
+++ b/tools/perf/builtin-daemon.c
@@ -161,7 +161,7 @@ static int session_config(struct daemon *daemon, const char 
*var, const char *va
struct daemon_session *session;
char name[100];
 
-   if (get_session_name(var, name, sizeof(name)))
+   if (get_session_name(var, name, sizeof(name) - 1))
return -EINVAL;
 
var = strchr(var, '.');
-- 
2.30.0.617.g56c4b15f3c-goog



Re: [PATCH V2 3/3] perf: Optimize sched_task() in a context switch

2021-02-22 Thread Namhyung Kim
On Wed, Jan 27, 2021 at 1:41 PM Namhyung Kim  wrote:
>
> Hi,
>
> On Mon, Jan 18, 2021 at 4:04 PM Namhyung Kim  wrote:
> >
> > Hi Peter and Kan,
> >
> > On Thu, Dec 10, 2020 at 11:25 PM Peter Zijlstra  
> > wrote:
> > >
> > > On Thu, Dec 10, 2020 at 08:52:55AM -0500, Liang, Kan wrote:
> > > >
> > > >
> > > > On 12/10/2020 2:13 AM, Namhyung Kim wrote:
> > > > > Hi Peter and Kan,
> > > > >
> > > > > How can we move this forward?
> > > >
> > > > Hi Namhyung,
> > > >
> > > > Thanks for the test. The changes look good to me.
> > > >
> > > > Hi Peter,
> > > >
> > > > Should we resend the patch set for further review?
> > >
> > > I've not yet seen a coherent replacement of #3, what I send was just a
> > > PoC.
>
> If it's the only problem of #3 which is an optimization,
> can we merge the actual fixes in #1 and #2 first?
>
> I know some people waiting for the fix..

Ping again...

Thanks,
Namhyung


Re: [PATCH 3/3] tools/lib/fs: Cache cgroupfs mount point

2021-02-19 Thread Namhyung Kim
Hi Arnaldo,

On Wed, Feb 17, 2021 at 9:58 PM Arnaldo Carvalho de Melo
 wrote:
>
> Em Fri, Jan 08, 2021 at 02:51:44PM +0900, Namhyung Kim escreveu:
> > On Wed, Jan 6, 2021 at 10:33 AM Namhyung Kim  wrote:
> > >
> > > Hi Arnaldo,
> > >
> > > On Tue, Dec 29, 2020 at 8:51 PM Arnaldo Carvalho de Melo
> > >  wrote:
> > > >
> > > > Em Wed, Dec 16, 2020 at 06:05:56PM +0900, Namhyung Kim escreveu:
> > > > > Currently it parses the /proc file everytime it opens a file in the
> > > > > cgroupfs.  Save the last result to avoid it (assuming it won't be
> > > > > changed between the accesses).
> > > >
> > > > Which is the most likely case, but can't we use something like inotify
> > > > to detect that and bail out or warn the user?
> > >
> > > Hmm.. looks doable.  Will check.
> >
> > So I've played with inotify a little bit, and it seems it needs to monitor
> > changes on the file or the directory.  I didn't get any notification from
> > the /proc/mounts file even if I did some mount/umount.
> >
> > Instead, I could get IN_UNMOUNT when the cgroup filesystem was
> > unmounted.  But for the monitoring, we need to do one of a) select-like
> > syscall to wait for the events, b) signal-driven IO notification or c) read
> > the inotify file with non-block mode everytime.
> >
> > In a library code, I don't think we can do a) or b) since it can affect
> > user program behaviors.  Then we should go with c) but I think
> > it's opposite to the purpose of this patch. :)
> >
> > As you said, I think mostly we don't care as the accesses will happen
> > in a short period of time.  But if you really care, maybe for the upcoming
> > perf daemon changes, I think we can add an API to invalidate the cache
> > or internal time-based invalidation logic (like remove it after 10 sec.).
>
> Ok, we can have something in 'perf daemon' to periodically invalidate
> this, maybe do a poor man inotify and when asking for the cgroup
> mountpoint, check some characteristic of that file that changes when it
> is modified, or plain use a timestamp and have some threshold.

I thought about this again.

We don't directly access the cgroups in the perf daemon.
It just creates new record processes so they'll see a new
mountpoint whenever they started since this cache is
shared within the process only.

That means we don't need to care about the invalidate in the
daemon but each perf record and perf stat should do it when
they are required to do the work repeatedly.

But looking at the code, the cgroup is set during event parsing
(-G option) or early in the command (--for-each-cgroup option).
So cgroup info would not be changed even if the command
runs repeatedly.

So I think you can take the patch as is.

Thanks,
Namhyung


  1   2   3   4   5   6   7   8   9   10   >