Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-30 Thread Stephane Eranian
On Fri, Aug 30, 2019 at 3:49 PM Namhyung Kim  wrote:
>
> On Fri, Aug 30, 2019 at 4:35 PM Peter Zijlstra  wrote:
> >
> > On Fri, Aug 30, 2019 at 12:46:51PM +0900, Namhyung Kim wrote:
> > > Hi Peter,
> > >
> > > On Wed, Aug 28, 2019 at 6:45 PM Peter Zijlstra  
> > > wrote:
> > > >
> > > > On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > > > > To support cgroup tracking, add CGROUP event to save a link between
> > > > > cgroup path and inode number.  The attr.cgroup bit was also added to
> > > > > enable cgroup tracking from userspace.
> > > > >
> > > > > This event will be generated when a new cgroup becomes active.
> > > > > Userspace might need to synthesize those events for existing cgroups.
> > > > >
> > > > > As aux_output change is also going on, I just added the bit here as
> > > > > well to remove possible conflicts later.
> > > >
> > > > Why do we want this?
> > >
> > > I saw below [1] and thought you have the patch introduced aux_output
> > > and it's gonna to be merged soon.
> > > Also the tooling patches are already in the acme/perf/core
> > > so I just wanted to avoid conflicts.
> > >
> > > Anyway, I'm ok with changing it.  Will remove in v2.
> >
> > I seem to have confused both you and Arnaldo with this. This email
> > questions the Changelog as a whole, not just the aux thing (I send a
> > separate email for that).
> >
> > This Changelog utterly fails to explain to me _why_ we need/want cgroup
> > tracking. So why do I want to review and possibly merge this? Changelog
> > needs to answer this.
>
> OK.  How about this?
>
> Systems running a large number of jobs in different cgroups want to
> profiling such jobs precisely.  This includes container hosting systems
> widely used today.  Currently perf supports namespace tracking but
> the systems may not use (cgroup) namespace for their jobs.  Also
> it'd be more intuitive to see cgroup names (as they're given by user
> or sysadmin) rather than numeric cgroup/namespace id even if they
> use the namespaces.
>

In data centers you care about attributing samples to a job not such
much to a process.
A job may have multiple processes which may come and go. The cgroup on
the other hand
stays around for the entire lifetime of the job. It is much easier to
map a cgroup name to a particular
job than it is to map a pid back to a job name, especially for offline
post-processing.

Hope this clarifies why we would like this feature upstream.


>
> Thanks,
> Namhyung


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-30 Thread Namhyung Kim
On Fri, Aug 30, 2019 at 4:35 PM Peter Zijlstra  wrote:
>
> On Fri, Aug 30, 2019 at 12:46:51PM +0900, Namhyung Kim wrote:
> > Hi Peter,
> >
> > On Wed, Aug 28, 2019 at 6:45 PM Peter Zijlstra  wrote:
> > >
> > > On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > > > To support cgroup tracking, add CGROUP event to save a link between
> > > > cgroup path and inode number.  The attr.cgroup bit was also added to
> > > > enable cgroup tracking from userspace.
> > > >
> > > > This event will be generated when a new cgroup becomes active.
> > > > Userspace might need to synthesize those events for existing cgroups.
> > > >
> > > > As aux_output change is also going on, I just added the bit here as
> > > > well to remove possible conflicts later.
> > >
> > > Why do we want this?
> >
> > I saw below [1] and thought you have the patch introduced aux_output
> > and it's gonna to be merged soon.
> > Also the tooling patches are already in the acme/perf/core
> > so I just wanted to avoid conflicts.
> >
> > Anyway, I'm ok with changing it.  Will remove in v2.
>
> I seem to have confused both you and Arnaldo with this. This email
> questions the Changelog as a whole, not just the aux thing (I send a
> separate email for that).
>
> This Changelog utterly fails to explain to me _why_ we need/want cgroup
> tracking. So why do I want to review and possibly merge this? Changelog
> needs to answer this.

OK.  How about this?

Systems running a large number of jobs in different cgroups want to
profiling such jobs precisely.  This includes container hosting systems
widely used today.  Currently perf supports namespace tracking but
the systems may not use (cgroup) namespace for their jobs.  Also
it'd be more intuitive to see cgroup names (as they're given by user
or sysadmin) rather than numeric cgroup/namespace id even if they
use the namespaces.

Thanks,
Namhyung


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-30 Thread Peter Zijlstra
On Fri, Aug 30, 2019 at 12:46:51PM +0900, Namhyung Kim wrote:
> Hi Peter,
> 
> On Wed, Aug 28, 2019 at 6:45 PM Peter Zijlstra  wrote:
> >
> > On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > > To support cgroup tracking, add CGROUP event to save a link between
> > > cgroup path and inode number.  The attr.cgroup bit was also added to
> > > enable cgroup tracking from userspace.
> > >
> > > This event will be generated when a new cgroup becomes active.
> > > Userspace might need to synthesize those events for existing cgroups.
> > >
> > > As aux_output change is also going on, I just added the bit here as
> > > well to remove possible conflicts later.
> >
> > Why do we want this?
> 
> I saw below [1] and thought you have the patch introduced aux_output
> and it's gonna to be merged soon.
> Also the tooling patches are already in the acme/perf/core
> so I just wanted to avoid conflicts.
> 
> Anyway, I'm ok with changing it.  Will remove in v2.

I seem to have confused both you and Arnaldo with this. This email
questions the Changelog as a whole, not just the aux thing (I send a
separate email for that).

This Changelog utterly fails to explain to me _why_ we need/want cgroup
tracking. So why do I want to review and possibly merge this? Changelog
needs to answer this.


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-29 Thread Namhyung Kim
Hi Tejun,

On Wed, Aug 28, 2019 at 11:48 PM Tejun Heo  wrote:
>
> Hello, Namhyung.
>
> On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > +  * struct {
> > +  *  struct perf_event_headerheader;
> > +  *  u64 ino;
> > +  *  u64 path_len;
> > +  *  charpath[];
>
> ino and path aren't great identifers for cgroups because both can get
> recycled pretty quickly.  Can you please take a look at
> KERNFS_ROOT_SUPPORT_EXPORTOP?  That's the fhandle that cgroup uses,
> currently the standard ino+gen which isn't ideal but good enough.
> Another benefit is that the path can also be resolved efficiently from
> userspace given the handle.

Thanks for the info, I'll take a look at it.
Namhyung


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-29 Thread Namhyung Kim
Hi Peter,

On Wed, Aug 28, 2019 at 6:45 PM Peter Zijlstra  wrote:
>
> On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > To support cgroup tracking, add CGROUP event to save a link between
> > cgroup path and inode number.  The attr.cgroup bit was also added to
> > enable cgroup tracking from userspace.
> >
> > This event will be generated when a new cgroup becomes active.
> > Userspace might need to synthesize those events for existing cgroups.
> >
> > As aux_output change is also going on, I just added the bit here as
> > well to remove possible conflicts later.
>
> Why do we want this?

I saw below [1] and thought you have the patch introduced aux_output
and it's gonna to be merged soon.
Also the tooling patches are already in the acme/perf/core
so I just wanted to avoid conflicts.

Anyway, I'm ok with changing it.  Will remove in v2.

Thanks
Namhyung

[1] https://lkml.org/lkml/2019/8/6/586


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-28 Thread Tejun Heo
Hello, Namhyung.

On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> +  * struct {
> +  *  struct perf_event_headerheader;
> +  *  u64 ino;
> +  *  u64 path_len;
> +  *  charpath[];

ino and path aren't great identifers for cgroups because both can get
recycled pretty quickly.  Can you please take a look at
KERNFS_ROOT_SUPPORT_EXPORTOP?  That's the fhandle that cgroup uses,
currently the standard ino+gen which isn't ideal but good enough.
Another benefit is that the path can also be resolved efficiently from
userspace given the handle.

Thanks.

-- 
tejun


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-28 Thread Arnaldo Carvalho de Melo
Em Wed, Aug 28, 2019 at 11:44:59AM +0200, Peter Zijlstra escreveu:
> On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> > To support cgroup tracking, add CGROUP event to save a link between
> > cgroup path and inode number.  The attr.cgroup bit was also added to
> > enable cgroup tracking from userspace.
> > 
> > This event will be generated when a new cgroup becomes active.
> > Userspace might need to synthesize those events for existing cgroups.
> > 
> > As aux_output change is also going on, I just added the bit here as
> > well to remove possible conflicts later.
> 
> Why do we want this?

Yeah, I'd resolve the conflict later, to avoid mixing things like this.

- Arnaldo


Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-28 Thread Peter Zijlstra
On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> To support cgroup tracking, add CGROUP event to save a link between
> cgroup path and inode number.  The attr.cgroup bit was also added to
> enable cgroup tracking from userspace.
> 
> This event will be generated when a new cgroup becomes active.
> Userspace might need to synthesize those events for existing cgroups.
> 
> As aux_output change is also going on, I just added the bit here as
> well to remove possible conflicts later.

Why do we want this?



Re: [PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-28 Thread Peter Zijlstra
On Wed, Aug 28, 2019 at 04:31:22PM +0900, Namhyung Kim wrote:
> To support cgroup tracking, add CGROUP event to save a link between
> cgroup path and inode number.  The attr.cgroup bit was also added to
> enable cgroup tracking from userspace.
> 
> This event will be generated when a new cgroup becomes active.
> Userspace might need to synthesize those events for existing cgroups.
> 
> As aux_output change is also going on, I just added the bit here as
> well to remove possible conflicts later.
> 
> Cc: Tejun Heo 
> Cc: Li Zefan 
> Cc: Johannes Weiner 
> Cc: Adrian Hunter 
> Signed-off-by: Namhyung Kim 
> ---
>  include/uapi/linux/perf_event.h |  15 -
>  kernel/events/core.c| 112 
>  2 files changed, 126 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 7198ddd0c6b1..cb07c24b715f 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -374,7 +374,9 @@ struct perf_event_attr {
>   namespaces :  1, /* include namespaces data 
> */
>   ksymbol:  1, /* include ksymbol events 
> */
>   bpf_event  :  1, /* include bpf events */
> - __reserved_1   : 33;
> + aux_output :  1, /* generate AUX records 
> instead of events */
> + cgroup :  1, /* include cgroup events */
> + __reserved_1   : 31;

That looks like a rebase fail, aux_output is not from these patches.


[PATCH 1/9] perf/core: Add PERF_RECORD_CGROUP event

2019-08-28 Thread Namhyung Kim
To support cgroup tracking, add CGROUP event to save a link between
cgroup path and inode number.  The attr.cgroup bit was also added to
enable cgroup tracking from userspace.

This event will be generated when a new cgroup becomes active.
Userspace might need to synthesize those events for existing cgroups.

As aux_output change is also going on, I just added the bit here as
well to remove possible conflicts later.

Cc: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Adrian Hunter 
Signed-off-by: Namhyung Kim 
---
 include/uapi/linux/perf_event.h |  15 -
 kernel/events/core.c| 112 
 2 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..cb07c24b715f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -374,7 +374,9 @@ struct perf_event_attr {
namespaces :  1, /* include namespaces data 
*/
ksymbol:  1, /* include ksymbol events 
*/
bpf_event  :  1, /* include bpf events */
-   __reserved_1   : 33;
+   aux_output :  1, /* generate AUX records 
instead of events */
+   cgroup :  1, /* include cgroup events */
+   __reserved_1   : 31;
 
union {
__u32   wakeup_events;/* wakeup every n events */
@@ -999,6 +1001,17 @@ enum perf_event_type {
 */
PERF_RECORD_BPF_EVENT   = 18,
 
+   /*
+* struct {
+*  struct perf_event_headerheader;
+*  u64 ino;
+*  u64 path_len;
+*  charpath[];
+*  struct sample_idsample_id;
+* };
+*/
+   PERF_RECORD_CGROUP  = 19,
+
PERF_RECORD_MAX,/* non-ABI */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..d898263db4fc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -386,6 +386,7 @@ static atomic_t nr_freq_events __read_mostly;
 static atomic_t nr_switch_events __read_mostly;
 static atomic_t nr_ksymbol_events __read_mostly;
 static atomic_t nr_bpf_events __read_mostly;
+static atomic_t nr_cgroup_events __read_mostly;
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -4314,6 +4315,8 @@ static void unaccount_event(struct perf_event *event)
atomic_dec(&nr_comm_events);
if (event->attr.namespaces)
atomic_dec(&nr_namespaces_events);
+   if (event->attr.cgroup)
+   atomic_dec(&nr_cgroup_events);
if (event->attr.task)
atomic_dec(&nr_task_events);
if (event->attr.freq)
@@ -7217,6 +7220,106 @@ void perf_event_namespaces(struct task_struct *task)
NULL);
 }
 
+/*
+ * cgroup tracking
+ */
+#ifdef CONFIG_CGROUPS
+
+struct perf_cgroup_event {
+   char*path;
+   struct {
+   struct perf_event_headerheader;
+   u64 ino;
+   u64 path_len;
+   charpath[];
+   } event_id;
+};
+
+static int perf_event_cgroup_match(struct perf_event *event)
+{
+   return event->attr.cgroup;
+}
+
+static void perf_event_cgroup_output(struct perf_event *event, void *data)
+{
+   struct perf_cgroup_event *cgroup_event = data;
+   struct perf_output_handle handle;
+   struct perf_sample_data sample;
+   u16 header_size = cgroup_event->event_id.header.size;
+   int ret;
+
+   if (!perf_event_cgroup_match(event))
+   return;
+
+   perf_event_header__init_id(&cgroup_event->event_id.header,
+  &sample, event);
+   ret = perf_output_begin(&handle, event,
+   cgroup_event->event_id.header.size);
+   if (ret)
+   goto out;
+
+   perf_output_put(&handle, cgroup_event->event_id);
+   __output_copy(&handle, cgroup_event->path,
+ cgroup_event->event_id.path_len);
+
+   perf_event__output_id_sample(event, &handle, &sample);
+
+   perf_output_end(&handle);
+out:
+   cgroup_event->event_id.header.size = header_size;
+}
+
+void perf_event_cgroup(struct cgroup *cgrp)
+{
+   struct perf_cgroup_event cgroup_event;
+   char path_enomem[16] = "//enomem";
+   char *pathname;
+   size_t size;
+
+   if (!atomic_read(&nr_cgroup_events))
+   return;
+
+   cgroup_event = (struct perf_cgroup_event){
+   .event_id  = {
+   .header = {
+