Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-27 Thread Alexei Starovoitov
On Wed, Jan 27, 2016 at 10:58:22AM +0100, Peter Zijlstra wrote:
> 
> > Meaning there gotta be always a user space process
> > that will be holding perf_event FDs.
> 
> By using fget() the BPF array thing will hold the FDs, right? I mean
> once you do a full fget() userspace can go and kill itself, the struct
> file will persists.

Right. I mistakenly thought that fget() will prevent the task
from freeing or some fs/directory resources will be hanging.
It's from anon_inode. So all good. All concerns cleared.



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-27 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 03:31:59PM -0800, Alexei Starovoitov wrote:

> This patch will conflict with kernel/bpf/arraymap.c and
> kernel/trace/bpf_trace.c that are planned for net-next,
> but the conflicts in kernel/events/core.c are probably harder
> to resolve, so yes please take it into tip/perf.

Thanks.

> I think your scm_right fixes depend on this patch and together
> it's an important bug fix, so probably makes sense to send
> them right now without waiting for the next merge window?

I'll leave that up to Ingo, but likely yes.

> As soon as you get the whole thing into tip, I'll test it
> to make sure bpf side is ok and I hope Wang will test it too.
> 
> I'm still a bit concerned about taking file reference for this,
> since bpf prorgams that use perf_events won't be able to be
> 'detached'. 

I was not aware BPF could be detached like that.

> Meaning there gotta be always a user space process
> that will be holding perf_event FDs.

By using fget() the BPF array thing will hold the FDs, right? I mean
once you do a full fget() userspace can go and kill itself, the struct
file will persists.

> On networking side we
> don't have this limitation. Like we can attach bpf to TC,
> iproute2 will exit and reattach some time later. So it
> kinda sux, but sounds like you want to get rid of
> perf_event->refcnt completely, 

We cannot actually get rid of it, we need it for some existence stuff.
But yes, we need stricter cleanup.

> so I don't see any other way.
> We can fix it later if it really becomes an issue.

One possible avenue would be to allow BPF to create its own (kernel)
events and manage its lifetime along with the BPF object. But so far I
don't see a problem with the file thing.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-27 Thread Alexei Starovoitov
On Wed, Jan 27, 2016 at 10:58:22AM +0100, Peter Zijlstra wrote:
> 
> > Meaning there gotta be always a user space process
> > that will be holding perf_event FDs.
> 
> By using fget() the BPF array thing will hold the FDs, right? I mean
> once you do a full fget() userspace can go and kill itself, the struct
> file will persists.

Right. I mistakenly thought that fget() will prevent the task
from freeing or some fs/directory resources will be hanging.
It's from anon_inode. So all good. All concerns cleared.



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-27 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 03:31:59PM -0800, Alexei Starovoitov wrote:

> This patch will conflict with kernel/bpf/arraymap.c and
> kernel/trace/bpf_trace.c that are planned for net-next,
> but the conflicts in kernel/events/core.c are probably harder
> to resolve, so yes please take it into tip/perf.

Thanks.

> I think your scm_right fixes depend on this patch and together
> it's an important bug fix, so probably makes sense to send
> them right now without waiting for the next merge window?

I'll leave that up to Ingo, but likely yes.

> As soon as you get the whole thing into tip, I'll test it
> to make sure bpf side is ok and I hope Wang will test it too.
> 
> I'm still a bit concerned about taking file reference for this,
> since bpf prorgams that use perf_events won't be able to be
> 'detached'. 

I was not aware BPF could be detached like that.

> Meaning there gotta be always a user space process
> that will be holding perf_event FDs.

By using fget() the BPF array thing will hold the FDs, right? I mean
once you do a full fget() userspace can go and kill itself, the struct
file will persists.

> On networking side we
> don't have this limitation. Like we can attach bpf to TC,
> iproute2 will exit and reattach some time later. So it
> kinda sux, but sounds like you want to get rid of
> perf_event->refcnt completely, 

We cannot actually get rid of it, we need it for some existence stuff.
But yes, we need stricter cleanup.

> so I don't see any other way.
> We can fix it later if it really becomes an issue.

One possible avenue would be to allow BPF to create its own (kernel)
events and manage its lifetime along with the BPF object. But so far I
don't see a problem with the file thing.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-26 Thread Alexei Starovoitov
On Tue, Jan 26, 2016 at 06:24:25PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 05:16:37PM +0100, Peter Zijlstra wrote:
> > > +struct file *perf_event_get(unsigned int fd)
> > >  {
> > > + struct file *file;
> > >  
> > > + file = fget_raw(fd);
> > 
> > fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
> > stuff does not apply to perf events, so you'd put any fd for which the
> > distinction matters anyway.

yeah good catch. the following is needed:
file = fget_raw(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);

> > 
> > > + if (file->f_op != _fops) {
> > > + fput(file);
> > > + return ERR_PTR(-EBADF);
> > > + }
> > >  
> > > + return file;
> > >  }
> 
> It is not possible for one thread to concurrently call close() while
> this thread tries to fget() ? In which case, we must check the return
> value anyway?

the !file check is definitely needed, fd passed by the user can be bogus.
The caller of perf_event_get() checks for errors too.

This patch will conflict with kernel/bpf/arraymap.c and
kernel/trace/bpf_trace.c that are planned for net-next,
but the conflicts in kernel/events/core.c are probably harder
to resolve, so yes please take it into tip/perf.
I think your scm_right fixes depend on this patch and together
it's an important bug fix, so probably makes sense to send
them right now without waiting for the next merge window?
As soon as you get the whole thing into tip, I'll test it
to make sure bpf side is ok and I hope Wang will test it too.

I'm still a bit concerned about taking file reference for this,
since bpf prorgams that use perf_events won't be able to be
'detached'. Meaning there gotta be always a user space process
that will be holding perf_event FDs. On networking side we
don't have this limitation. Like we can attach bpf to TC,
iproute2 will exit and reattach some time later. So it
kinda sux, but sounds like you want to get rid of
perf_event->refcnt completely, so I don't see any other way.
We can fix it later if it really becomes an issue.



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-26 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 05:16:37PM +0100, Peter Zijlstra wrote:
> > +struct file *perf_event_get(unsigned int fd)
> >  {
> > +   struct file *file;
> >  
> > +   file = fget_raw(fd);
> 
> fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
> stuff does not apply to perf events, so you'd put any fd for which the
> distinction matters anyway.
> 
> > +   if (file->f_op != _fops) {
> > +   fput(file);
> > +   return ERR_PTR(-EBADF);
> > +   }
> >  
> > +   return file;
> >  }

It is not possible for one thread to concurrently call close() while
this thread tries to fget() ? In which case, we must check the return
value anyway?


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-26 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 08:59:49PM -0800, Alexei Starovoitov wrote:
> I think I understand what you're trying to do and
> the patch looks good to me.

Great, thanks!

> As far as BPF side I did the following...
> does it match the model you outlined above?

Yes, a few comments/questions below.

> 
> Subject: [PATCH ] perf,bpf: convert perf_event_array to use struct file
> 
> Signed-off-by: Alexei Starovoitov 

Can I take this through the tip/perf tree so that all these changes land
together?

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 06ae52e99ac2..2a95e0d2370f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8896,21 +8896,17 @@ void perf_event_delayed_put(struct task_struct *task)
>   WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
>  }
>  
> +struct file *perf_event_get(unsigned int fd)
>  {
> + struct file *file;
>  
> + file = fget_raw(fd);

fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
stuff does not apply to perf events, so you'd put any fd for which the
distinction matters anyway.

> + if (file->f_op != _fops) {
> + fput(file);
> + return ERR_PTR(-EBADF);
> + }
>  
> + return file;
>  }


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-26 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 08:59:49PM -0800, Alexei Starovoitov wrote:
> I think I understand what you're trying to do and
> the patch looks good to me.

Great, thanks!

> As far as BPF side I did the following...
> does it match the model you outlined above?

Yes, a few comments/questions below.

> 
> Subject: [PATCH ] perf,bpf: convert perf_event_array to use struct file
> 
> Signed-off-by: Alexei Starovoitov 

Can I take this through the tip/perf tree so that all these changes land
together?

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 06ae52e99ac2..2a95e0d2370f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8896,21 +8896,17 @@ void perf_event_delayed_put(struct task_struct *task)
>   WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
>  }
>  
> +struct file *perf_event_get(unsigned int fd)
>  {
> + struct file *file;
>  
> + file = fget_raw(fd);

fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
stuff does not apply to perf events, so you'd put any fd for which the
distinction matters anyway.

> + if (file->f_op != _fops) {
> + fput(file);
> + return ERR_PTR(-EBADF);
> + }
>  
> + return file;
>  }


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-26 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 05:16:37PM +0100, Peter Zijlstra wrote:
> > +struct file *perf_event_get(unsigned int fd)
> >  {
> > +   struct file *file;
> >  
> > +   file = fget_raw(fd);
> 
> fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
> stuff does not apply to perf events, so you'd put any fd for which the
> distinction matters anyway.
> 
> > +   if (file->f_op != _fops) {
> > +   fput(file);
> > +   return ERR_PTR(-EBADF);
> > +   }
> >  
> > +   return file;
> >  }

It is not possible for one thread to concurrently call close() while
this thread tries to fget() ? In which case, we must check the return
value anyway?


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-26 Thread Alexei Starovoitov
On Tue, Jan 26, 2016 at 06:24:25PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 05:16:37PM +0100, Peter Zijlstra wrote:
> > > +struct file *perf_event_get(unsigned int fd)
> > >  {
> > > + struct file *file;
> > >  
> > > + file = fget_raw(fd);
> > 
> > fget_raw() to guarantee the return value isn't NULL? afaict the O_PATH
> > stuff does not apply to perf events, so you'd put any fd for which the
> > distinction matters anyway.

yeah good catch. the following is needed:
file = fget_raw(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);

> > 
> > > + if (file->f_op != _fops) {
> > > + fput(file);
> > > + return ERR_PTR(-EBADF);
> > > + }
> > >  
> > > + return file;
> > >  }
> 
> It is not possible for one thread to concurrently call close() while
> this thread tries to fget() ? In which case, we must check the return
> value anyway?

the !file check is definitely needed, fd passed by the user can be bogus.
The caller of perf_event_get() checks for errors too.

This patch will conflict with kernel/bpf/arraymap.c and
kernel/trace/bpf_trace.c that are planned for net-next,
but the conflicts in kernel/events/core.c are probably harder
to resolve, so yes please take it into tip/perf.
I think your scm_right fixes depend on this patch and together
it's an important bug fix, so probably makes sense to send
them right now without waiting for the next merge window?
As soon as you get the whole thing into tip, I'll test it
to make sure bpf side is ok and I hope Wang will test it too.

I'm still a bit concerned about taking file reference for this,
since bpf prorgams that use perf_events won't be able to be
'detached'. Meaning there gotta be always a user space process
that will be holding perf_event FDs. On networking side we
don't have this limitation. Like we can attach bpf to TC,
iproute2 will exit and reattach some time later. So it
kinda sux, but sounds like you want to get rid of
perf_event->refcnt completely, so I don't see any other way.
We can fix it later if it really becomes an issue.



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Alexei Starovoitov
On Mon, Jan 25, 2016 at 10:04:10PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2016 at 03:54:14PM +0100, Peter Zijlstra wrote:
> > Alexander, Alexei,
> > 
> > How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
> > indicate the event has been given up by its 'owner' and decouples us
> > from the actual event->owner logic.
> > 
> > This retains the event->owner and event->owner_list thing purely for the
> > prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
> > us strict 'owner' semantics in that:
> > 
> >   struct perf_event *my_event = perf_event_create_kernel_counter();
> > 
> >   /* ... */
> > 
> >   perf_event_release_kernel(my_event);
> > 
> > Or
> > 
> >   int fd = sys_perf_event_open(...);
> > 
> >   close(fd); /* last, calls fops::release */
> > 
> > Will destroy the event dead. event::refcount will 'retain' the object
> > but it will become non functional and is strictly meant as a temporal
> > existence guarantee (for when RCU isn't good enough).
> > 
> > So this should restore the scm_rights case, which preserves the fd but
> > could result in not having event->owner (and therefore being removed
> > from its owner_list), which is fine.
> > 
> > BPF still needs to get fixed to use filedesc references instead.
> 
> Still no BPF, but this one actually 'works', as in it doesn't have the
> blatant exit races and has survived a few hours of runtime.
> 
> ---
>  include/linux/perf_event.h |3 
>  kernel/events/core.c   |  304 
> ++---
>  2 files changed, 150 insertions(+), 157 deletions(-)

I think I understand what you're trying to do and
the patch looks good to me.
As far as BPF side I did the following...
does it match the model you outlined above?
I did basic testing and it looks fine.

Subject: [PATCH ] perf,bpf: convert perf_event_array to use struct file

Signed-off-by: Alexei Starovoitov 
---
 include/linux/perf_event.h |  4 ++--
 kernel/bpf/arraymap.c  | 21 +++--
 kernel/events/core.c   | 20 
 kernel/trace/bpf_trace.c   | 14 ++
 4 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a48f16a..df275020fde9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -729,7 +729,7 @@ extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
-extern struct perf_event *perf_event_get(unsigned int fd);
+extern struct file *perf_event_get(unsigned int fd);
 extern const struct perf_event_attr *perf_event_attrs(struct perf_event 
*event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
@@ -1070,7 +1070,7 @@ static inline int perf_event_init_task(struct task_struct 
*child) { return 0; }
 static inline void perf_event_exit_task(struct task_struct *child) { }
 static inline void perf_event_free_task(struct task_struct *task)  { }
 static inline void perf_event_delayed_put(struct task_struct *task){ }
-static inline struct perf_event *perf_event_get(unsigned int fd)   { 
return ERR_PTR(-EINVAL); }
+static inline struct file *perf_event_get(unsigned int fd) { return 
ERR_PTR(-EINVAL); }
 static inline const struct perf_event_attr *perf_event_attrs(struct perf_event 
*event)
 {
return ERR_PTR(-EINVAL);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index b0799bced518..89ebbc4d1164 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -291,10 +291,13 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
 {
struct perf_event *event;
const struct perf_event_attr *attr;
+   struct file *file;
 
-   event = perf_event_get(fd);
-   if (IS_ERR(event))
-   return event;
+   file = perf_event_get(fd);
+   if (IS_ERR(file))
+   return file;
+
+   event = file->private_data;
 
attr = perf_event_attrs(event);
if (IS_ERR(attr))
@@ -304,24 +307,22 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
goto err;
 
if (attr->type == PERF_TYPE_RAW)
-   return event;
+   return file;
 
if (attr->type == PERF_TYPE_HARDWARE)
-   return event;
+   return file;
 
if (attr->type == PERF_TYPE_SOFTWARE &&
attr->config == PERF_COUNT_SW_BPF_OUTPUT)
-   return event;
+   return file;
 err:
-   perf_event_release_kernel(event);
+   fput(file);
return ERR_PTR(-EINVAL);
 }
 
 static void perf_event_fd_array_put_ptr(void *ptr)
 {
-   struct perf_event *event = ptr;
-
-   perf_event_release_kernel(event);
+   fput((struct file *)ptr);
 }
 
 static const struct 

Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 03:54:14PM +0100, Peter Zijlstra wrote:
> Alexander, Alexei,
> 
> How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
> indicate the event has been given up by its 'owner' and decouples us
> from the actual event->owner logic.
> 
> This retains the event->owner and event->owner_list thing purely for the
> prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
> us strict 'owner' semantics in that:
> 
>   struct perf_event *my_event = perf_event_create_kernel_counter();
> 
>   /* ... */
> 
>   perf_event_release_kernel(my_event);
> 
> Or
> 
>   int fd = sys_perf_event_open(...);
> 
>   close(fd); /* last, calls fops::release */
> 
> Will destroy the event dead. event::refcount will 'retain' the object
> but it will become non functional and is strictly meant as a temporal
> existence guarantee (for when RCU isn't good enough).
> 
> So this should restore the scm_rights case, which preserves the fd but
> could result in not having event->owner (and therefore being removed
> from its owner_list), which is fine.
> 
> BPF still needs to get fixed to use filedesc references instead.

Still no BPF, but this one actually 'works', as in it doesn't have the
blatant exit races and has survived a few hours of runtime.

---
 include/linux/perf_event.h |3 
 kernel/events/core.c   |  304 ++---
 2 files changed, 150 insertions(+), 157 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
int nr_cgroups;  /* cgroup evts */
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
-
-   struct delayed_work orphans_remove;
-   boolorphans_remove_sched;
 };
 
 /*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include 
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1086,8 +1084,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  * task_struct::perf_event_mutex
  *   perf_event_context::mutex
- * perf_event_context::lock
  * perf_event::child_mutex;
+ *   perf_event_context::lock
  * perf_event::mmap_mutex
  * mmap_sem
  */
@@ -1646,45 +1644,11 @@ static void perf_group_detach(struct per
perf_event__header_size(tmp);
 }
 
-/*
- * User event without the task.
- */
 static bool is_orphaned_event(struct perf_event *event)
 {
-   return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-   return is_orphaned_event(event->parent);
+   return event->state == PERF_EVENT_STATE_EXIT;
 }
 
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-   if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-   return;
-
-   if (queue_delayed_work(perf_wq, >orphans_remove, 1)) {
-   get_ctx(ctx);
-   ctx->orphans_remove_sched = true;
-   }
-}
-
-static int __init perf_workqueue_init(void)
-{
-   perf_wq = create_singlethread_workqueue("perf");
-   WARN(!perf_wq, "failed to create perf workqueue\n");
-   return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
struct pmu *pmu = event->pmu;
@@ -1745,9 +1709,6 @@ event_sched_out(struct perf_event *event
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
 
-   if (is_orphaned_child(event))
-   schedule_orphans_remove(ctx);
-
perf_pmu_enable(event->pmu);
 }
 
@@ -1771,6 +1732,9 @@ group_sched_out(struct perf_event *group
cpuctx->exclusive = 0;
 }
 
+#define DETACH_GROUP   0x01
+#define DETACH_STATE   0x02
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -1783,13 +1747,19 @@ __perf_remove_from_context(struct perf_e
   struct perf_event_context *ctx,
   void *info)
 {
-   bool detach_group = (unsigned long)info;
+   unsigned long flags = (unsigned long)info;
 
event_sched_out(event, cpuctx, ctx);
-   if (detach_group)
+
+   if (flags & DETACH_GROUP)
perf_group_detach(event);
list_del_event(event, ctx);
 
+   if (flags & DETACH_STATE) {
+   event->state = PERF_EVENT_STATE_EXIT;
+   perf_event_wakeup(event);
+   }
+
if (!ctx->nr_events && ctx->is_active) {
ctx->is_active = 0;
if (ctx->task) {
@@ 

Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 12:48:46PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2016 at 11:44:03AM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > > > Peter Zijlstra  writes:
> > > > 
> > > > > So I think there's a number of problems still :-(
> > 
> > I've been looking at how perf_event->owner is handled and couldn't
> > figure out how you deal with the case of passing perf_event_fd via 
> > scm_rights.
> > It seems one process can open an event, pass it to another process,
> > but when current process exists owner will still point to dead task,
> > since refcount > 0.
> > Which part am I missing?
> 
> Nothing, you raised a good point. I think this shows we cannot link
> !event->owner to an event being 'dead'.
> 
> If we keep these two states separate, the scm_rights thing should work
> again.

Alexander, Alexei,

How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
indicate the event has been given up by its 'owner' and decouples us
from the actual event->owner logic.

This retains the event->owner and event->owner_list thing purely for the
prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
us strict 'owner' semantics in that:

  struct perf_event *my_event = perf_event_create_kernel_counter();

  /* ... */

  perf_event_release_kernel(my_event);

Or

  int fd = sys_perf_event_open(...);

  close(fd); /* last, calls fops::release */

Will destroy the event dead. event::refcount will 'retain' the object
but it will become non functional and is strictly meant as a temporal
existence guarantee (for when RCU isn't good enough).

So this should restore the scm_rights case, which preserves the fd but
could result in not having event->owner (and therefore being removed
from its owner_list), which is fine.

BPF still needs to get fixed to use filedesc references instead.

---

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
int nr_cgroups;  /* cgroup evts */
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
-
-   struct delayed_work orphans_remove;
-   boolorphans_remove_sched;
 };
 
 /*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include 
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1086,8 +1084,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  * task_struct::perf_event_mutex
  *   perf_event_context::mutex
- * perf_event_context::lock
  * perf_event::child_mutex;
+ *   perf_event_context::lock
  * perf_event::mmap_mutex
  * mmap_sem
  */
@@ -1646,45 +1644,11 @@ static void perf_group_detach(struct per
perf_event__header_size(tmp);
 }
 
-/*
- * User event without the task.
- */
 static bool is_orphaned_event(struct perf_event *event)
 {
-   return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-   return is_orphaned_event(event->parent);
-}
-
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-   if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-   return;
-
-   if (queue_delayed_work(perf_wq, >orphans_remove, 1)) {
-   get_ctx(ctx);
-   ctx->orphans_remove_sched = true;
-   }
+   return event->state == PERF_EVENT_STATE_EXIT;
 }
 
-static int __init perf_workqueue_init(void)
-{
-   perf_wq = create_singlethread_workqueue("perf");
-   WARN(!perf_wq, "failed to create perf workqueue\n");
-   return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
struct pmu *pmu = event->pmu;
@@ -1745,9 +1709,6 @@ event_sched_out(struct perf_event *event
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
 
-   if (is_orphaned_child(event))
-   schedule_orphans_remove(ctx);
-
perf_pmu_enable(event->pmu);
 }
 
@@ -1771,6 +1732,9 @@ group_sched_out(struct perf_event *group
cpuctx->exclusive = 0;
 }
 
+#define DETACH_GROUP   0x01
+#define DETACH_STATE   0x02
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -1783,13 +1747,19 @@ __perf_remove_from_context(struct perf_e
   struct perf_event_context *ctx,
   void *info)
 {
-   bool detach_group = (unsigned long)info;
+ 

Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Peter Zijlstra
On Fri, Jan 22, 2016 at 11:44:03AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > > Peter Zijlstra  writes:
> > > 
> > > > So I think there's a number of problems still :-(
> 
> I've been looking at how perf_event->owner is handled and couldn't
> figure out how you deal with the case of passing perf_event_fd via scm_rights.
> It seems one process can open an event, pass it to another process,
> but when current process exists owner will still point to dead task,
> since refcount > 0.
> Which part am I missing?

Nothing, you raised a good point. I think this shows we cannot link
!event->owner to an event being 'dead'.

If we keep these two states separate, the scm_rights thing should work
again.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Alexei Starovoitov
On Mon, Jan 25, 2016 at 10:04:10PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2016 at 03:54:14PM +0100, Peter Zijlstra wrote:
> > Alexander, Alexei,
> > 
> > How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
> > indicate the event has been given up by its 'owner' and decouples us
> > from the actual event->owner logic.
> > 
> > This retains the event->owner and event->owner_list thing purely for the
> > prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
> > us strict 'owner' semantics in that:
> > 
> >   struct perf_event *my_event = perf_event_create_kernel_counter();
> > 
> >   /* ... */
> > 
> >   perf_event_release_kernel(my_event);
> > 
> > Or
> > 
> >   int fd = sys_perf_event_open(...);
> > 
> >   close(fd); /* last, calls fops::release */
> > 
> > Will destroy the event dead. event::refcount will 'retain' the object
> > but it will become non functional and is strictly meant as a temporal
> > existence guarantee (for when RCU isn't good enough).
> > 
> > So this should restore the scm_rights case, which preserves the fd but
> > could result in not having event->owner (and therefore being removed
> > from its owner_list), which is fine.
> > 
> > BPF still needs to get fixed to use filedesc references instead.
> 
> Still no BPF, but this one actually 'works', as in it doesn't have the
> blatant exit races and has survived a few hours of runtime.
> 
> ---
>  include/linux/perf_event.h |3 
>  kernel/events/core.c   |  304 
> ++---
>  2 files changed, 150 insertions(+), 157 deletions(-)

I think I understand what you're trying to do and
the patch looks good to me.
As far as BPF side I did the following...
does it match the model you outlined above?
I did basic testing and it looks fine.

Subject: [PATCH ] perf,bpf: convert perf_event_array to use struct file

Signed-off-by: Alexei Starovoitov 
---
 include/linux/perf_event.h |  4 ++--
 kernel/bpf/arraymap.c  | 21 +++--
 kernel/events/core.c   | 20 
 kernel/trace/bpf_trace.c   | 14 ++
 4 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a48f16a..df275020fde9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -729,7 +729,7 @@ extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
-extern struct perf_event *perf_event_get(unsigned int fd);
+extern struct file *perf_event_get(unsigned int fd);
 extern const struct perf_event_attr *perf_event_attrs(struct perf_event 
*event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
@@ -1070,7 +1070,7 @@ static inline int perf_event_init_task(struct task_struct 
*child) { return 0; }
 static inline void perf_event_exit_task(struct task_struct *child) { }
 static inline void perf_event_free_task(struct task_struct *task)  { }
 static inline void perf_event_delayed_put(struct task_struct *task){ }
-static inline struct perf_event *perf_event_get(unsigned int fd)   { 
return ERR_PTR(-EINVAL); }
+static inline struct file *perf_event_get(unsigned int fd) { return 
ERR_PTR(-EINVAL); }
 static inline const struct perf_event_attr *perf_event_attrs(struct perf_event 
*event)
 {
return ERR_PTR(-EINVAL);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index b0799bced518..89ebbc4d1164 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -291,10 +291,13 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
 {
struct perf_event *event;
const struct perf_event_attr *attr;
+   struct file *file;
 
-   event = perf_event_get(fd);
-   if (IS_ERR(event))
-   return event;
+   file = perf_event_get(fd);
+   if (IS_ERR(file))
+   return file;
+
+   event = file->private_data;
 
attr = perf_event_attrs(event);
if (IS_ERR(attr))
@@ -304,24 +307,22 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
goto err;
 
if (attr->type == PERF_TYPE_RAW)
-   return event;
+   return file;
 
if (attr->type == PERF_TYPE_HARDWARE)
-   return event;
+   return file;
 
if (attr->type == PERF_TYPE_SOFTWARE &&
attr->config == PERF_COUNT_SW_BPF_OUTPUT)
-   return event;
+   return file;
 err:
-   perf_event_release_kernel(event);
+   fput(file);
return ERR_PTR(-EINVAL);
 }
 
 static void perf_event_fd_array_put_ptr(void *ptr)
 {
-   struct perf_event *event = ptr;
-
-   perf_event_release_kernel(event);
+   fput((struct file *)ptr);
 }
 
 

Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 12:48:46PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2016 at 11:44:03AM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > > > Peter Zijlstra  writes:
> > > > 
> > > > > So I think there's a number of problems still :-(
> > 
> > I've been looking at how perf_event->owner is handled and couldn't
> > figure out how you deal with the case of passing perf_event_fd via 
> > scm_rights.
> > It seems one process can open an event, pass it to another process,
> > but when current process exists owner will still point to dead task,
> > since refcount > 0.
> > Which part am I missing?
> 
> Nothing, you raised a good point. I think this shows we cannot link
> !event->owner to an event being 'dead'.
> 
> If we keep these two states separate, the scm_rights thing should work
> again.

Alexander, Alexei,

How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
indicate the event has been given up by its 'owner' and decouples us
from the actual event->owner logic.

This retains the event->owner and event->owner_list thing purely for the
prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
us strict 'owner' semantics in that:

  struct perf_event *my_event = perf_event_create_kernel_counter();

  /* ... */

  perf_event_release_kernel(my_event);

Or

  int fd = sys_perf_event_open(...);

  close(fd); /* last, calls fops::release */

Will destroy the event dead. event::refcount will 'retain' the object
but it will become non functional and is strictly meant as a temporal
existence guarantee (for when RCU isn't good enough).

So this should restore the scm_rights case, which preserves the fd but
could result in not having event->owner (and therefore being removed
from its owner_list), which is fine.

BPF still needs to get fixed to use filedesc references instead.

---

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
int nr_cgroups;  /* cgroup evts */
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
-
-   struct delayed_work orphans_remove;
-   boolorphans_remove_sched;
 };
 
 /*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include 
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1086,8 +1084,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  * task_struct::perf_event_mutex
  *   perf_event_context::mutex
- * perf_event_context::lock
  * perf_event::child_mutex;
+ *   perf_event_context::lock
  * perf_event::mmap_mutex
  * mmap_sem
  */
@@ -1646,45 +1644,11 @@ static void perf_group_detach(struct per
perf_event__header_size(tmp);
 }
 
-/*
- * User event without the task.
- */
 static bool is_orphaned_event(struct perf_event *event)
 {
-   return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-   return is_orphaned_event(event->parent);
-}
-
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-   if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-   return;
-
-   if (queue_delayed_work(perf_wq, >orphans_remove, 1)) {
-   get_ctx(ctx);
-   ctx->orphans_remove_sched = true;
-   }
+   return event->state == PERF_EVENT_STATE_EXIT;
 }
 
-static int __init perf_workqueue_init(void)
-{
-   perf_wq = create_singlethread_workqueue("perf");
-   WARN(!perf_wq, "failed to create perf workqueue\n");
-   return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
struct pmu *pmu = event->pmu;
@@ -1745,9 +1709,6 @@ event_sched_out(struct perf_event *event
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
 
-   if (is_orphaned_child(event))
-   schedule_orphans_remove(ctx);
-
perf_pmu_enable(event->pmu);
 }
 
@@ -1771,6 +1732,9 @@ group_sched_out(struct perf_event *group
cpuctx->exclusive = 0;
 }
 
+#define DETACH_GROUP   0x01
+#define DETACH_STATE   0x02
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -1783,13 +1747,19 @@ __perf_remove_from_context(struct perf_e
   struct perf_event_context *ctx,
   void *info)
 {
-   bool detach_group = 

Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 03:54:14PM +0100, Peter Zijlstra wrote:
> Alexander, Alexei,
> 
> How about the below? That uses event->state == PERF_EVENT_STATE_EXIT to
> indicate the event has been given up by its 'owner' and decouples us
> from the actual event->owner logic.
> 
> This retains the event->owner and event->owner_list thing purely for the
> prclt(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE) calls, but does give
> us strict 'owner' semantics in that:
> 
>   struct perf_event *my_event = perf_event_create_kernel_counter();
> 
>   /* ... */
> 
>   perf_event_release_kernel(my_event);
> 
> Or
> 
>   int fd = sys_perf_event_open(...);
> 
>   close(fd); /* last, calls fops::release */
> 
> Will destroy the event dead. event::refcount will 'retain' the object
> but it will become non functional and is strictly meant as a temporal
> existence guarantee (for when RCU isn't good enough).
> 
> So this should restore the scm_rights case, which preserves the fd but
> could result in not having event->owner (and therefore being removed
> from its owner_list), which is fine.
> 
> BPF still needs to get fixed to use filedesc references instead.

Still no BPF, but this one actually 'works', as in it doesn't have the
blatant exit races and has survived a few hours of runtime.

---
 include/linux/perf_event.h |3 
 kernel/events/core.c   |  304 ++---
 2 files changed, 150 insertions(+), 157 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -634,9 +634,6 @@ struct perf_event_context {
int nr_cgroups;  /* cgroup evts */
void*task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
-
-   struct delayed_work orphans_remove;
-   boolorphans_remove_sched;
 };
 
 /*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,8 +49,6 @@
 
 #include 
 
-static struct workqueue_struct *perf_wq;
-
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -1086,8 +1084,8 @@ static void put_ctx(struct perf_event_co
  * Lock order:
  * task_struct::perf_event_mutex
  *   perf_event_context::mutex
- * perf_event_context::lock
  * perf_event::child_mutex;
+ *   perf_event_context::lock
  * perf_event::mmap_mutex
  * mmap_sem
  */
@@ -1646,45 +1644,11 @@ static void perf_group_detach(struct per
perf_event__header_size(tmp);
 }
 
-/*
- * User event without the task.
- */
 static bool is_orphaned_event(struct perf_event *event)
 {
-   return event && !is_kernel_event(event) && !event->owner;
-}
-
-/*
- * Event has a parent but parent's task finished and it's
- * alive only because of children holding refference.
- */
-static bool is_orphaned_child(struct perf_event *event)
-{
-   return is_orphaned_event(event->parent);
+   return event->state == PERF_EVENT_STATE_EXIT;
 }
 
-static void orphans_remove_work(struct work_struct *work);
-
-static void schedule_orphans_remove(struct perf_event_context *ctx)
-{
-   if (!ctx->task || ctx->orphans_remove_sched || !perf_wq)
-   return;
-
-   if (queue_delayed_work(perf_wq, >orphans_remove, 1)) {
-   get_ctx(ctx);
-   ctx->orphans_remove_sched = true;
-   }
-}
-
-static int __init perf_workqueue_init(void)
-{
-   perf_wq = create_singlethread_workqueue("perf");
-   WARN(!perf_wq, "failed to create perf workqueue\n");
-   return perf_wq ? 0 : -1;
-}
-
-core_initcall(perf_workqueue_init);
-
 static inline int pmu_filter_match(struct perf_event *event)
 {
struct pmu *pmu = event->pmu;
@@ -1745,9 +1709,6 @@ event_sched_out(struct perf_event *event
if (event->attr.exclusive || !cpuctx->active_oncpu)
cpuctx->exclusive = 0;
 
-   if (is_orphaned_child(event))
-   schedule_orphans_remove(ctx);
-
perf_pmu_enable(event->pmu);
 }
 
@@ -1771,6 +1732,9 @@ group_sched_out(struct perf_event *group
cpuctx->exclusive = 0;
 }
 
+#define DETACH_GROUP   0x01
+#define DETACH_STATE   0x02
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -1783,13 +1747,19 @@ __perf_remove_from_context(struct perf_e
   struct perf_event_context *ctx,
   void *info)
 {
-   bool detach_group = (unsigned long)info;
+   unsigned long flags = (unsigned long)info;
 
event_sched_out(event, cpuctx, ctx);
-   if (detach_group)
+
+   if (flags & DETACH_GROUP)
perf_group_detach(event);
list_del_event(event, ctx);
 
+   if (flags & DETACH_STATE) {
+   event->state = PERF_EVENT_STATE_EXIT;
+   perf_event_wakeup(event);
+   }
+
if (!ctx->nr_events && ctx->is_active) {
ctx->is_active = 0;
if (ctx->task) {
@@ 

Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-25 Thread Peter Zijlstra
On Fri, Jan 22, 2016 at 11:44:03AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > > Peter Zijlstra  writes:
> > > 
> > > > So I think there's a number of problems still :-(
> 
> I've been looking at how perf_event->owner is handled and couldn't
> figure out how you deal with the case of passing perf_event_fd via scm_rights.
> It seems one process can open an event, pass it to another process,
> but when current process exists owner will still point to dead task,
> since refcount > 0.
> Which part am I missing?

Nothing, you raised a good point. I think this shows we cannot link
!event->owner to an event being 'dead'.

If we keep these two states separate, the scm_rights thing should work
again.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Alexei Starovoitov
On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > Peter Zijlstra  writes:
> > 
> > > So I think there's a number of problems still :-(

I've been looking at how perf_event->owner is handled and couldn't
figure out how you deal with the case of passing perf_event_fd via scm_rights.
It seems one process can open an event, pass it to another process,
but when current process exists owner will still point to dead task,
since refcount > 0.
Which part am I missing?



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Peter Zijlstra
On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > So I think there's a number of problems still :-(
> 
> Also, it does indeed race with
> __perf_event_exit_task()/sync_child_event(), but that one I'd fix by
> simply wrapping the sync_child_event()/free_event() in
> 
> mutex_lock(_event->child_mutex);
> if (!is_orphan_event(parent_event)) {
>sync_child_event(child_event);
>free_event(child_event);
> }
> mutex_unlock(_event->child_event);

Also, note the comment with _perf_event_disable(), that relies on
sync_child_event() taking event->parent->child_mutex.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Peter Zijlstra
On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > So I think there's a number of problems still :-(
> 
> Also, it does indeed race with
> __perf_event_exit_task()/sync_child_event(), but that one I'd fix by
> simply wrapping the sync_child_event()/free_event() in
> 
> mutex_lock(_event->child_mutex);
> if (!is_orphan_event(parent_event)) {
>sync_child_event(child_event);
>free_event(child_event);
> }
> mutex_unlock(_event->child_event);

So I've been staring at exactly that code for a while because Ingo
managed to trigger that race (and I could reproduce with his
'workload').

But I'm not seeing how; both sites hold ctx->mutex and remove the event
from the ctx->event_list.

So the way I'm seeing it, either the orphan_work find and frees it, or
the __perf_event_exit_task() one does, but I'm a bit stumped on how they
can both do.

Sure, the sync stuff is pointless if we're orphan, but I don't see how
it can harm.

> At some later point in time the code there could use a bit of
> reshuffling, I guess.

Yes.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Alexander Shishkin
Peter Zijlstra  writes:

> So I think there's a number of problems still :-(

Also, it does indeed race with
__perf_event_exit_task()/sync_child_event(), but that one I'd fix by
simply wrapping the sync_child_event()/free_event() in

mutex_lock(_event->child_mutex);
if (!is_orphan_event(parent_event)) {
   sync_child_event(child_event);
   free_event(child_event);
}
mutex_unlock(_event->child_event);

At some later point in time the code there could use a bit of
reshuffling, I guess.

Regards,
--
Alex


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Alexei Starovoitov
On Fri, Jan 22, 2016 at 01:38:47PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> > Peter Zijlstra  writes:
> > 
> > > So I think there's a number of problems still :-(

I've been looking at how perf_event->owner is handled and couldn't
figure out how you deal with the case of passing perf_event_fd via scm_rights.
It seems one process can open an event, pass it to another process,
but when current process exists owner will still point to dead task,
since refcount > 0.
Which part am I missing?



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Alexander Shishkin
Peter Zijlstra  writes:

> So I think there's a number of problems still :-(

Also, it does indeed race with
__perf_event_exit_task()/sync_child_event(), but that one I'd fix by
simply wrapping the sync_child_event()/free_event() in

mutex_lock(_event->child_mutex);
if (!is_orphan_event(parent_event)) {
   sync_child_event(child_event);
   free_event(child_event);
}
mutex_unlock(_event->child_event);

At some later point in time the code there could use a bit of
reshuffling, I guess.

Regards,
--
Alex


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Peter Zijlstra
On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > So I think there's a number of problems still :-(
> 
> Also, it does indeed race with
> __perf_event_exit_task()/sync_child_event(), but that one I'd fix by
> simply wrapping the sync_child_event()/free_event() in
> 
> mutex_lock(_event->child_mutex);
> if (!is_orphan_event(parent_event)) {
>sync_child_event(child_event);
>free_event(child_event);
> }
> mutex_unlock(_event->child_event);

So I've been staring at exactly that code for a while because Ingo
managed to trigger that race (and I could reproduce with his
'workload').

But I'm not seeing how; both sites hold ctx->mutex and remove the event
from the ctx->event_list.

So the way I'm seeing it, either the orphan_work find and frees it, or
the __perf_event_exit_task() one does, but I'm a bit stumped on how they
can both do.

Sure, the sync stuff is pointless if we're orphan, but I don't see how
it can harm.

> At some later point in time the code there could use a bit of
> reshuffling, I guess.

Yes.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-22 Thread Peter Zijlstra
On Fri, Jan 22, 2016 at 01:35:40PM +0200, Alexander Shishkin wrote:
> Peter Zijlstra  writes:
> 
> > So I think there's a number of problems still :-(
> 
> Also, it does indeed race with
> __perf_event_exit_task()/sync_child_event(), but that one I'd fix by
> simply wrapping the sync_child_event()/free_event() in
> 
> mutex_lock(_event->child_mutex);
> if (!is_orphan_event(parent_event)) {
>sync_child_event(child_event);
>free_event(child_event);
> }
> mutex_unlock(_event->child_event);

Also, note the comment with _perf_event_disable(), that relies on
sync_child_event() taking event->parent->child_mutex.


Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-20 Thread Alexei Starovoitov
On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> 
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > > 
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > > 
> > > 
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> > 
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
> 
> So I'm (possibly again) confused on how BPF works.
> 
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.

yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
   user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
   counters to measure things like number of cycles or tlb misses
   in a given function.
   In this case user space typically leaves FDs around, but it doesn't
   use them for anything.

> Also, BPF should not be a way to get around the filedesc resource limit.

all bpf tracing stuff is root only and maps are charged for every element.

> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
> 
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
> 
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
> 
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
> 
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
> 
> Currently we do this async (and with holes).
> 
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.

I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?



Re: [PATCH v2] perf: Synchronously cleanup child events

2016-01-20 Thread Alexei Starovoitov
On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> 
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > > 
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > > 
> > > 
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> > 
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
> 
> So I'm (possibly again) confused on how BPF works.
> 
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.

yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
   user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
   counters to measure things like number of cycles or tlb misses
   in a given function.
   In this case user space typically leaves FDs around, but it doesn't
   use them for anything.

> Also, BPF should not be a way to get around the filedesc resource limit.

all bpf tracing stuff is root only and maps are charged for every element.

> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
> 
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
> 
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
> 
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
> 
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
> 
> Currently we do this async (and with holes).
> 
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.

I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?