Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Alexei Starovoitov
On Mon, Apr 25, 2016 at 08:41:39PM -0300, Arnaldo Carvalho de Melo wrote:
>  
> +int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> +
> +static inline size_t perf_callchain_entry__sizeof(void)
> +{
> + return (sizeof(struct perf_callchain_entry) +
> + sizeof(__u64) * sysctl_perf_event_max_stack);
> +}
> @@ -1144,6 +1144,14 @@ static struct ctl_table kern_table[] = {
>   .extra1 = &zero,
>   .extra2 = &one_hundred,
>   },
> + {
> + .procname   = "perf_event_max_stack",
> + .data   = NULL, /* filled in by handler */
> + .maxlen = sizeof(sysctl_perf_event_max_stack),
> + .mode   = 0644,
> + .proc_handler   = perf_event_max_stack_handler,
> + .extra1 = &zero,
> + },

you need to define a max value otherwise perf_callchain_entry__sizeof
will overflow. Sure it's root only facility, but still not nice.
1M? Anything above 1M stack frames would be insane anyway.
The rest looks good. Thanks!



Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Arnaldo Carvalho de Melo
Em Mon, Apr 25, 2016 at 02:59:49PM -0700, Alexei Starovoitov escreveu:
> On Mon, Apr 25, 2016 at 05:17:50PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Apr 25, 2016 at 01:06:48PM -0700, Alexei Starovoitov escreveu:
> > > On Mon, Apr 25, 2016 at 04:22:29PM -0300, Arnaldo Carvalho de Melo wrote:
> > Well, I thought that multiline expressions required parentheses, to make
> > them easier to read for someone, maybe Ingo? ;-)
> > 
> > Whatever, both generate the same result, really want me to change this?
> 
> I was mainly complaining about unnecessary &..[0]
> 
> > > ?
> > > if didn't mixed up the ordering...
> > 
> > If you are not sure, then its not clearer, huh? ;-P
> 
> matter of taste. No strong opinion
> 
> > > and probably we could do the math on the spot instead of introducing
> > > perf_callchain_entry_size global variable?
> > 
> > I was trying to avoid the calc for each alloc, just doing it when we
> > change that number via the sysctl, probably not that big a deal, do you
> > really think we should do the math per-alloc instead of
> > per-sysctl-changing?
> 
> The math is trivial:
> #define __perf_callchain_entry_size(entries) \
>   (sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
> shift and add after load is almost the same speed as load, since
> integer multiply right after is dominating the cost.
> whereas updating two global variables can potentially cause
> race conditions that need to be analyzed.

Ok, I thought the places where we changed those variables were always
under that callchain_mutex, but nah, too much trouble checking, find v3
below, see if I can have your Acked-by.

Thanks,

- Arnaldo

commit 565e8b138157fbe1c4495f083b88db681f52c6b3
Author: Arnaldo Carvalho de Melo 
Date:   Thu Apr 21 12:28:50 2016 -0300

perf core: Allow setting up max frame stack depth via sysctl

The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

  # cat /proc/sys/kernel/perf_event_max_stack
  127

Chaging it:

  # echo 256 > /proc/sys/kernel/perf_event_max_stack
  # cat /proc/sys/kernel/perf_event_max_stack
  256

But as soon as there is some event using callchains we get:

  # echo 512 > /proc/sys/kernel/perf_event_max_stack
  -bash: echo: write error: Device or resource busy
  #

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg 
Acked-by: Alexei Starovoitov 
Acked-by: David Ahern 
Cc: Adrian Hunter 
Cc: Alexander Shishkin 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Masami Hiramatsu 
Cc: Milian Wolff 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Wang Nan 
Cc: Zefan Li 
Link: http://lkml.kernel.org/n/tip-cgls6uuncwjtq969tys1j...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57653a44b128..260cde08e92e 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -60,6 +60,7 @@ show up in /proc/sys/kernel:
 - panic_on_warn
 - perf_cpu_time_max_percent
 - perf_event_paranoid
+- perf_event_max_stack
 - pid_max
 - powersave-nap   [ PPC only ]
 - printk
@@ -654,6 +655,19 @@ users (without CAP_SYS_ADMIN).  The default value is 1.
 
 ==
 
+perf_event_max_stack:
+
+Controls maximum number of stack frames to copy for (attr.sample_type &
+PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using
+'perf record -g' or 'perf trace --call-graph fp'.
+
+This can only be done when no events are in use that have callchains
+enabled, otherwise writing to this file will return -EBUSY.
+
+The default value is 127.
+
+==
+
 pid_max:
 
 PID allocation wrap value.  When the kernel's next PID value
diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 4e02ae5950ff..27563befa8a2 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -75,7 +75,7 @@ perf_callchain_user(struct perf_callchain_entry *entry, 
struct pt_regs *regs)
 
tail = (struct frame_tail __user *)regs->ARM_fp - 1;
 
-   while ((entry->nr < PERF_MAX_STACK_DEPTH) &&
+   while ((entry->nr < sysctl_perf_event_max_stack) &&
   tail && !((unsigned long)tail & 0x3))

Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Alexei Starovoitov
On Mon, Apr 25, 2016 at 05:17:50PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Apr 25, 2016 at 01:06:48PM -0700, Alexei Starovoitov escreveu:
> > On Mon, Apr 25, 2016 at 04:22:29PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Mon, Apr 25, 2016 at 01:27:06PM -0300, Arnaldo Carvalho de Melo 
> > > escreveu:
> > > > Em Mon, Apr 25, 2016 at 01:14:25PM -0300, Arnaldo Carvalho de Melo 
> > > > escreveu:
> > > > > Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> > > > > > right... and looking into it further, realized that the patch is 
> > > > > > broken,
> > > > > > since get_callchain_entry() is doing:
> > > > > >   return &entries->cpu_entries[cpu][*rctx];
> > > > > > whereas it should be dynamic offset based on 
> > > > > > sysctl_perf_event_max_stack*8
> > > > > > So definitely needs another respin.
> > > 
> > > > struct perf_callchain_entry {
> > > > __u64 nr;
> > > > __u64 ip[0]; /* /proc/sys/kernel/perf_event_max_stack */
> > > > };
> > >  
> > > > But perf_callchain_entry->ip is not a pointer... Got it ;-\
> > > 
> > > This is what I am building now, a patch on top of the previous, that
> > > will be combined and sent as v3, if I don't find any more funnies:
> > > 
> > > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > > index 6fe77349fa9d..40657892a7e0 100644
> > > --- a/kernel/events/callchain.c
> > > +++ b/kernel/events/callchain.c
> > > @@ -20,11 +20,10 @@ struct callchain_cpus_entries {
> > >  
> > >  int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> > >  
> > > -static size_t perf_callchain_entry__sizeof(void)
> > > -{
> > > - return sizeof(struct perf_callchain_entry) +
> > > -sizeof(__u64) * sysctl_perf_event_max_stack;
> > > -}
> > > +#define __perf_callchain_entry_size(entries) \
> > > + (sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
> > > +
> > > +static size_t perf_callchain_entry_size __read_mostly = 
> > > __perf_callchain_entry_size(PERF_MAX_STACK_DEPTH);
> > >  
> > >  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
> > >  static atomic_t nr_callchain_events;
> > > @@ -81,7 +80,7 @@ static int alloc_callchain_buffers(void)
> > >   if (!entries)
> > >   return -ENOMEM;
> > >  
> > > - size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
> > > + size = perf_callchain_entry_size * PERF_NR_CONTEXTS;
> > >  
> > >   for_each_possible_cpu(cpu) {
> > >   entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
> > > @@ -155,7 +154,8 @@ static struct perf_callchain_entry 
> > > *get_callchain_entry(int *rctx)
> > >  
> > >   cpu = smp_processor_id();
> > >  
> > > - return &entries->cpu_entries[cpu][*rctx];
> > > + return (((void *)&entries->cpu_entries[cpu][0]) +
> > > + (*rctx * perf_callchain_entry_size));
> > 
> > I think the following would be easier to read:
> > 
> > +   return (void *)entries->cpu_entries[cpu] +
> > +   *rctx * perf_callchain_entry_size;
> 
> Well, I thought that multiline expressions required parentheses, to make
> them easier to read for someone, maybe Ingo? ;-)
> 
> Whatever, both generate the same result, really want me to change this?

I was mainly complaining about unnecessary &..[0]

> > ?
> > if didn't mixed up the ordering...
> 
> If you are not sure, then its not clearer, huh? ;-P

matter of taste. No strong opinion

> > and probably we could do the math on the spot instead of introducing
> > perf_callchain_entry_size global variable?
> 
> I was trying to avoid the calc for each alloc, just doing it when we
> change that number via the sysctl, probably not that big a deal, do you
> really think we should do the math per-alloc instead of
> per-sysctl-changing?

The math is trivial:
#define __perf_callchain_entry_size(entries) \
(sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
shift and add after load is almost the same speed as load, since
integer multiply right after is dominating the cost.
whereas updating two global variables can potentially cause
race conditions that need to be analyzed.



Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Arnaldo Carvalho de Melo
Em Mon, Apr 25, 2016 at 01:06:48PM -0700, Alexei Starovoitov escreveu:
> On Mon, Apr 25, 2016 at 04:22:29PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Apr 25, 2016 at 01:27:06PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Mon, Apr 25, 2016 at 01:14:25PM -0300, Arnaldo Carvalho de Melo 
> > > escreveu:
> > > > Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> > > > > right... and looking into it further, realized that the patch is 
> > > > > broken,
> > > > > since get_callchain_entry() is doing:
> > > > >   return &entries->cpu_entries[cpu][*rctx];
> > > > > whereas it should be dynamic offset based on 
> > > > > sysctl_perf_event_max_stack*8
> > > > > So definitely needs another respin.
> > 
> > > struct perf_callchain_entry {
> > > __u64 nr;
> > > __u64 ip[0]; /* /proc/sys/kernel/perf_event_max_stack */
> > > };
> >  
> > > But perf_callchain_entry->ip is not a pointer... Got it ;-\
> > 
> > This is what I am building now, a patch on top of the previous, that
> > will be combined and sent as v3, if I don't find any more funnies:
> > 
> > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > index 6fe77349fa9d..40657892a7e0 100644
> > --- a/kernel/events/callchain.c
> > +++ b/kernel/events/callchain.c
> > @@ -20,11 +20,10 @@ struct callchain_cpus_entries {
> >  
> >  int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> >  
> > -static size_t perf_callchain_entry__sizeof(void)
> > -{
> > -   return sizeof(struct perf_callchain_entry) +
> > -  sizeof(__u64) * sysctl_perf_event_max_stack;
> > -}
> > +#define __perf_callchain_entry_size(entries) \
> > +   (sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
> > +
> > +static size_t perf_callchain_entry_size __read_mostly = 
> > __perf_callchain_entry_size(PERF_MAX_STACK_DEPTH);
> >  
> >  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
> >  static atomic_t nr_callchain_events;
> > @@ -81,7 +80,7 @@ static int alloc_callchain_buffers(void)
> > if (!entries)
> > return -ENOMEM;
> >  
> > -   size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
> > +   size = perf_callchain_entry_size * PERF_NR_CONTEXTS;
> >  
> > for_each_possible_cpu(cpu) {
> > entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
> > @@ -155,7 +154,8 @@ static struct perf_callchain_entry 
> > *get_callchain_entry(int *rctx)
> >  
> > cpu = smp_processor_id();
> >  
> > -   return &entries->cpu_entries[cpu][*rctx];
> > +   return (((void *)&entries->cpu_entries[cpu][0]) +
> > +   (*rctx * perf_callchain_entry_size));
> 
> I think the following would be easier to read:
> 
> + return (void *)entries->cpu_entries[cpu] +
> + *rctx * perf_callchain_entry_size;

Well, I thought that multiline expressions required parentheses, to make
them easier to read for someone, maybe Ingo? ;-)

Whatever, both generate the same result, really want me to change this?

> ?
> if didn't mixed up the ordering...

If you are not sure, then its not clearer, huh? ;-P
 
> and probably we could do the math on the spot instead of introducing
> perf_callchain_entry_size global variable?

I was trying to avoid the calc for each alloc, just doing it when we
change that number via the sysctl, probably not that big a deal, do you
really think we should do the math per-alloc instead of
per-sysctl-changing?

- Arnaldo


Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Alexei Starovoitov
On Mon, Apr 25, 2016 at 04:22:29PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Apr 25, 2016 at 01:27:06PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Mon, Apr 25, 2016 at 01:14:25PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> > > > right... and looking into it further, realized that the patch is broken,
> > > > since get_callchain_entry() is doing:
> > > >   return &entries->cpu_entries[cpu][*rctx];
> > > > whereas it should be dynamic offset based on 
> > > > sysctl_perf_event_max_stack*8
> > > > So definitely needs another respin.
> 
> > struct perf_callchain_entry {
> > __u64 nr;
> > __u64 ip[0]; /* /proc/sys/kernel/perf_event_max_stack */
> > };
>  
> > But perf_callchain_entry->ip is not a pointer... Got it ;-\
> 
> This is what I am building now, a patch on top of the previous, that
> will be combined and sent as v3, if I don't find any more funnies:
> 
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 6fe77349fa9d..40657892a7e0 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -20,11 +20,10 @@ struct callchain_cpus_entries {
>  
>  int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
>  
> -static size_t perf_callchain_entry__sizeof(void)
> -{
> - return sizeof(struct perf_callchain_entry) +
> -sizeof(__u64) * sysctl_perf_event_max_stack;
> -}
> +#define __perf_callchain_entry_size(entries) \
> + (sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
> +
> +static size_t perf_callchain_entry_size __read_mostly = 
> __perf_callchain_entry_size(PERF_MAX_STACK_DEPTH);
>  
>  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
>  static atomic_t nr_callchain_events;
> @@ -81,7 +80,7 @@ static int alloc_callchain_buffers(void)
>   if (!entries)
>   return -ENOMEM;
>  
> - size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
> + size = perf_callchain_entry_size * PERF_NR_CONTEXTS;
>  
>   for_each_possible_cpu(cpu) {
>   entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
> @@ -155,7 +154,8 @@ static struct perf_callchain_entry 
> *get_callchain_entry(int *rctx)
>  
>   cpu = smp_processor_id();
>  
> - return &entries->cpu_entries[cpu][*rctx];
> + return (((void *)&entries->cpu_entries[cpu][0]) +
> + (*rctx * perf_callchain_entry_size));

I think the following would be easier to read:

+   return (void *)entries->cpu_entries[cpu] +
+   *rctx * perf_callchain_entry_size;
?
if didn't mixed up the ordering...

and probably we could do the math on the spot instead of introducing
perf_callchain_entry_size global variable?



Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Arnaldo Carvalho de Melo
Em Mon, Apr 25, 2016 at 01:27:06PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Apr 25, 2016 at 01:14:25PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> > > right... and looking into it further, realized that the patch is broken,
> > > since get_callchain_entry() is doing:
> > >   return &entries->cpu_entries[cpu][*rctx];
> > > whereas it should be dynamic offset based on sysctl_perf_event_max_stack*8
> > > So definitely needs another respin.

> struct perf_callchain_entry {
> __u64 nr;
> __u64 ip[0]; /* /proc/sys/kernel/perf_event_max_stack */
> };
 
> But perf_callchain_entry->ip is not a pointer... Got it ;-\

This is what I am building now, a patch on top of the previous, that
will be combined and sent as v3, if I don't find any more funnies:

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 6fe77349fa9d..40657892a7e0 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -20,11 +20,10 @@ struct callchain_cpus_entries {
 
 int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
 
-static size_t perf_callchain_entry__sizeof(void)
-{
-   return sizeof(struct perf_callchain_entry) +
-  sizeof(__u64) * sysctl_perf_event_max_stack;
-}
+#define __perf_callchain_entry_size(entries) \
+   (sizeof(struct perf_callchain_entry) + sizeof(__u64) * entries)
+
+static size_t perf_callchain_entry_size __read_mostly = 
__perf_callchain_entry_size(PERF_MAX_STACK_DEPTH);
 
 static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
 static atomic_t nr_callchain_events;
@@ -81,7 +80,7 @@ static int alloc_callchain_buffers(void)
if (!entries)
return -ENOMEM;
 
-   size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
+   size = perf_callchain_entry_size * PERF_NR_CONTEXTS;
 
for_each_possible_cpu(cpu) {
entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
@@ -155,7 +154,8 @@ static struct perf_callchain_entry *get_callchain_entry(int 
*rctx)
 
cpu = smp_processor_id();
 
-   return &entries->cpu_entries[cpu][*rctx];
+   return (((void *)&entries->cpu_entries[cpu][0]) +
+   (*rctx * perf_callchain_entry_size));
 }
 
 static void
@@ -236,10 +236,12 @@ int perf_event_max_stack_handler(struct ctl_table *table, 
int write,
return ret;
 
mutex_lock(&callchain_mutex);
-   if (atomic_read(&nr_callchain_events))
+   if (atomic_read(&nr_callchain_events)) {
ret = -EBUSY;
-   else
+   } else {
sysctl_perf_event_max_stack = new_value;
+   perf_callchain_entry_size   = 
__perf_callchain_entry_size(new_value);
+   }
 
mutex_unlock(&callchain_mutex);
 


Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Arnaldo Carvalho de Melo
Em Mon, Apr 25, 2016 at 01:14:25PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> > On Fri, Apr 22, 2016 at 04:05:31PM -0600, David Ahern wrote:
> > > On 4/22/16 2:52 PM, Arnaldo Carvalho de Melo wrote:
> > > >+int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> > > >+
> > > >+static size_t perf_callchain_entry__sizeof(void)
> > > >+{
> > > >+return sizeof(struct perf_callchain_entry) +
> > > >+   sizeof(__u64) * sysctl_perf_event_max_stack;
> > > >+}
> > > >+
> 
> > > To Alexei's comment, a max_stack of 0 still has a non-zero alloc size so
> > > that should be ok.
> 
> > > >  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
> > > >  static atomic_t nr_callchain_events;
> > > >  static DEFINE_MUTEX(callchain_mutex);
> > > >@@ -73,7 +81,7 @@ static int alloc_callchain_buffers(void)
> > > > if (!entries)
> > > > return -ENOMEM;
> 
> > > >-size = sizeof(struct perf_callchain_entry) * PERF_NR_CONTEXTS;
> > > >+size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
> 
> > > > for_each_possible_cpu(cpu) {
> > > > entries->cpu_entries[cpu] = kmalloc_node(size, 
> > > > GFP_KERNEL,
>  
> > right... and looking into it further, realized that the patch is broken,
> > since get_callchain_entry() is doing:
> >   return &entries->cpu_entries[cpu][*rctx];
> > whereas it should be dynamic offset based on sysctl_perf_event_max_stack*8
> > So definitely needs another respin.
> 
> Huh? Can you elaborate a bit more?
> 
> Are you saying this is a bug introduced by this patch or something
> pre-existing?

When we alloc we alloc:

/*
 * Number of contexts where an event can trigger:
 *  task, softirq, hardirq, nmi.
 */
#define PERF_NR_CONTEXTS4

callchain_cpus_entries[0 .. (nr_cpu_ids - 1)][in_nmi() (3)]
callchain_cpus_entries[0 .. (nr_cpu_ids - 1)][in_irq() (2)]
callchain_cpus_entries[0 .. (nr_cpu_ids - 1)][in_sortirq() (1)]
callchain_cpus_entries[0 .. (nr_cpu_ids - 1)][(0)]

And all of these have this type:

struct perf_callchain_entry {
__u64 nr;
__u64 ip[0]; /* /proc/sys/kernel/perf_event_max_stack */
};

So, it will return a struct_callchain_entry pointer to a 8-byte sized
chunk, with just perf_callchain_entry->nr, and will not try to touch
perf_callchain_entry->ip[] since sysctl_perf_event_max_stack is zero.

But perf_callchain_entry->ip is not a pointer... Got it ;-\

Touché, respinning...

Brendan, this probably affected you results...

- Arnaldo


Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Arnaldo Carvalho de Melo
Em Fri, Apr 22, 2016 at 03:18:08PM -0700, Alexei Starovoitov escreveu:
> On Fri, Apr 22, 2016 at 04:05:31PM -0600, David Ahern wrote:
> > On 4/22/16 2:52 PM, Arnaldo Carvalho de Melo wrote:
> > >+int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> > >+
> > >+static size_t perf_callchain_entry__sizeof(void)
> > >+{
> > >+  return sizeof(struct perf_callchain_entry) +
> > >+ sizeof(__u64) * sysctl_perf_event_max_stack;
> > >+}
> > >+

> > To Alexei's comment, a max_stack of 0 still has a non-zero alloc size so
> > that should be ok.

> > >  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
> > >  static atomic_t nr_callchain_events;
> > >  static DEFINE_MUTEX(callchain_mutex);
> > >@@ -73,7 +81,7 @@ static int alloc_callchain_buffers(void)
> > >   if (!entries)
> > >   return -ENOMEM;

> > >-  size = sizeof(struct perf_callchain_entry) * PERF_NR_CONTEXTS;
> > >+  size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;

> > >   for_each_possible_cpu(cpu) {
> > >   entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
 
> right... and looking into it further, realized that the patch is broken,
> since get_callchain_entry() is doing:
>   return &entries->cpu_entries[cpu][*rctx];
> whereas it should be dynamic offset based on sysctl_perf_event_max_stack*8
> So definitely needs another respin.

Huh? Can you elaborate a bit more?

Are you saying this is a bug introduced by this patch or something
pre-existing?

- Arnaldo


Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-22 Thread Alexei Starovoitov
On Fri, Apr 22, 2016 at 04:05:31PM -0600, David Ahern wrote:
> On 4/22/16 2:52 PM, Arnaldo Carvalho de Melo wrote:
> >Em Wed, Apr 20, 2016 at 04:04:12PM -0700, Alexei Starovoitov escreveu:
> >>On Wed, Apr 20, 2016 at 07:47:30PM -0300, Arnaldo Carvalho de Melo wrote:
> >
> >>Nice. I like it. That's a great approach to hard problem.
> >>Java guys will be happy too.
> >>Please also adjust two places in kernel/bpf/stackmap.c
> >
> >>>+  {
> >>>+  .procname   = "perf_event_max_stack",
> >>>+  .data   = NULL, /* filled in by handler */
> >>>+  .maxlen = sizeof(sysctl_perf_event_max_stack),
> >>>+  .mode   = 0644,
> >>>+  .proc_handler   = perf_event_max_stack_handler,
> >>>+  .extra1 = &zero,
> >
> >>zero seems to be the wrong minimum. I think it should be at least 2 to
> >>to fit user/kernel tags ?  Probably needs to define max as well.
> >
> >So, if someone asks for zero, it will not copy anything, but then, this
> >would be what the user had asked for :-)
> >
> >Ditto for the max, if someone asks for too big a callchain, then when
> >allocating it it will fail and no callchain will be produced, that or it
> >will be able to allocate but will take too long copying that many
> >addresses, and we would be prevented from doing so by some other
> >protection, iirc there is perf_cpu_time_max_percent, and then buffer
> >space will run out.
> >
> >So I think that leaving it as is is enough, no?
> >
> >Can I keep your Acked-by? David, can I keep yours?
> 
> Yes
> 
> 
> >diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> >index 343c22f5e867..6fe77349fa9d 100644
> >--- a/kernel/events/callchain.c
> >+++ b/kernel/events/callchain.c
> >@@ -18,6 +18,14 @@ struct callchain_cpus_entries {
> > struct perf_callchain_entry *cpu_entries[0];
> >  };
> >
> >+int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
> >+
> >+static size_t perf_callchain_entry__sizeof(void)
> >+{
> >+return sizeof(struct perf_callchain_entry) +
> >+   sizeof(__u64) * sysctl_perf_event_max_stack;
> >+}
> >+
> 
> To Alexei's comment, a max_stack of 0 still has a non-zero alloc size so
> that should be ok.
> 
> 
> >  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
> >  static atomic_t nr_callchain_events;
> >  static DEFINE_MUTEX(callchain_mutex);
> >@@ -73,7 +81,7 @@ static int alloc_callchain_buffers(void)
> > if (!entries)
> > return -ENOMEM;
> >
> >-size = sizeof(struct perf_callchain_entry) * PERF_NR_CONTEXTS;
> >+size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;
> >
> > for_each_possible_cpu(cpu) {
> > entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,

right... and looking into it further, realized that the patch is broken,
since get_callchain_entry() is doing:
  return &entries->cpu_entries[cpu][*rctx];
whereas it should be dynamic offset based on sysctl_perf_event_max_stack*8
So definitely needs another respin.



Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-22 Thread David Ahern

On 4/22/16 2:52 PM, Arnaldo Carvalho de Melo wrote:

Em Wed, Apr 20, 2016 at 04:04:12PM -0700, Alexei Starovoitov escreveu:

On Wed, Apr 20, 2016 at 07:47:30PM -0300, Arnaldo Carvalho de Melo wrote:



Nice. I like it. That's a great approach to hard problem.
Java guys will be happy too.
Please also adjust two places in kernel/bpf/stackmap.c



+   {
+   .procname   = "perf_event_max_stack",
+   .data   = NULL, /* filled in by handler */
+   .maxlen = sizeof(sysctl_perf_event_max_stack),
+   .mode   = 0644,
+   .proc_handler   = perf_event_max_stack_handler,
+   .extra1 = &zero,



zero seems to be the wrong minimum. I think it should be at least 2 to
to fit user/kernel tags ?  Probably needs to define max as well.


So, if someone asks for zero, it will not copy anything, but then, this
would be what the user had asked for :-)

Ditto for the max, if someone asks for too big a callchain, then when
allocating it it will fail and no callchain will be produced, that or it
will be able to allocate but will take too long copying that many
addresses, and we would be prevented from doing so by some other
protection, iirc there is perf_cpu_time_max_percent, and then buffer
space will run out.

So I think that leaving it as is is enough, no?

Can I keep your Acked-by? David, can I keep yours?


Yes



diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 343c22f5e867..6fe77349fa9d 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -18,6 +18,14 @@ struct callchain_cpus_entries {
struct perf_callchain_entry *cpu_entries[0];
  };

+int sysctl_perf_event_max_stack __read_mostly = PERF_MAX_STACK_DEPTH;
+
+static size_t perf_callchain_entry__sizeof(void)
+{
+   return sizeof(struct perf_callchain_entry) +
+  sizeof(__u64) * sysctl_perf_event_max_stack;
+}
+


To Alexei's comment, a max_stack of 0 still has a non-zero alloc size so 
that should be ok.




  static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
  static atomic_t nr_callchain_events;
  static DEFINE_MUTEX(callchain_mutex);
@@ -73,7 +81,7 @@ static int alloc_callchain_buffers(void)
if (!entries)
return -ENOMEM;

-   size = sizeof(struct perf_callchain_entry) * PERF_NR_CONTEXTS;
+   size = perf_callchain_entry__sizeof() * PERF_NR_CONTEXTS;

for_each_possible_cpu(cpu) {
entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,





Re: [PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-22 Thread Alexei Starovoitov
On Fri, Apr 22, 2016 at 05:52:32PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Apr 20, 2016 at 04:04:12PM -0700, Alexei Starovoitov escreveu:
> > On Wed, Apr 20, 2016 at 07:47:30PM -0300, Arnaldo Carvalho de Melo wrote:
>  
> > Nice. I like it. That's a great approach to hard problem.
> > Java guys will be happy too.
> > Please also adjust two places in kernel/bpf/stackmap.c
>  
> > > + {
> > > + .procname   = "perf_event_max_stack",
> > > + .data   = NULL, /* filled in by handler */
> > > + .maxlen = sizeof(sysctl_perf_event_max_stack),
> > > + .mode   = 0644,
> > > + .proc_handler   = perf_event_max_stack_handler,
> > > + .extra1 = &zero,
>  
> > zero seems to be the wrong minimum. I think it should be at least 2 to
> > to fit user/kernel tags ?  Probably needs to define max as well.
> 
> So, if someone asks for zero, it will not copy anything, but then, this
> would be what the user had asked for :-)
> 
> Ditto for the max, if someone asks for too big a callchain, then when
> allocating it it will fail and no callchain will be produced, that or it
> will be able to allocate but will take too long copying that many
> addresses, and we would be prevented from doing so by some other
> protection, iirc there is perf_cpu_time_max_percent, and then buffer
> space will run out.
> 
> So I think that leaving it as is is enough, no?
> 
> Can I keep your Acked-by? David, can I keep yours?

I'm still a bit concerned with perf_event_max_stack==0, since it
means that alloc_callchain_buffers() will be failing,
so perf_event_open() will also be failing with ENOMEM which
will be hard to understand for any user and not clear whether
perf user space can print any hints, since such errno is ambiguous.
Also I'm concerned with very large perf_event_max_stack that
can cause integer overflow.
So I still think we have to set reasonable min and max.
Other than that it's ack.



[PATCH/RFC v2] perf core: Allow setting up max frame stack depth via sysctl

2016-04-22 Thread Arnaldo Carvalho de Melo
Em Wed, Apr 20, 2016 at 04:04:12PM -0700, Alexei Starovoitov escreveu:
> On Wed, Apr 20, 2016 at 07:47:30PM -0300, Arnaldo Carvalho de Melo wrote:
 
> Nice. I like it. That's a great approach to hard problem.
> Java guys will be happy too.
> Please also adjust two places in kernel/bpf/stackmap.c
 
> > +   {
> > +   .procname   = "perf_event_max_stack",
> > +   .data   = NULL, /* filled in by handler */
> > +   .maxlen = sizeof(sysctl_perf_event_max_stack),
> > +   .mode   = 0644,
> > +   .proc_handler   = perf_event_max_stack_handler,
> > +   .extra1 = &zero,
 
> zero seems to be the wrong minimum. I think it should be at least 2 to
> to fit user/kernel tags ?  Probably needs to define max as well.

So, if someone asks for zero, it will not copy anything, but then, this
would be what the user had asked for :-)

Ditto for the max, if someone asks for too big a callchain, then when
allocating it it will fail and no callchain will be produced, that or it
will be able to allocate but will take too long copying that many
addresses, and we would be prevented from doing so by some other
protection, iirc there is perf_cpu_time_max_percent, and then buffer
space will run out.

So I think that leaving it as is is enough, no?

Can I keep your Acked-by? David, can I keep yours?

I'll improve tools/perf to look at this file and then push this to Ingo,
if nobody hollers.

Peter, I think my response was enough about reporting allocation
failure, right?

Brendan tested it and it solved the problem he reported, the callchains
he got are really that big and now he has good looking flamegraphs, yay!

- Arnaldo

Here is V2, with stackmap.c adjusted, only build tested.

commit 7c24bb249e392e3dd7dd71a26277c64e313bab4e
Author: Arnaldo Carvalho de Melo 
Date:   Thu Apr 21 12:28:50 2016 -0300

perf core: Allow setting up max frame stack depth via sysctl

The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

  # cat /proc/sys/kernel/perf_event_max_stack
  127

Chaging it:

  # echo 256 > /proc/sys/kernel/perf_event_max_stack
  # cat /proc/sys/kernel/perf_event_max_stack
  256

But as soon as there is some event using callchains we get:

  # echo 512 > /proc/sys/kernel/perf_event_max_stack
  -bash: echo: write error: Device or resource busy
  #

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg 
Acked-by: Alexei Starovoitov 
Acked-by: David Ahern 
Cc: Adrian Hunter 
Cc: Alexander Shishkin 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Masami Hiramatsu 
Cc: Milian Wolff 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Wang Nan 
Cc: Zefan Li 
Link: http://lkml.kernel.org/n/tip-cgls6uuncwjtq969tys1j...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57653a44b128..260cde08e92e 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -60,6 +60,7 @@ show up in /proc/sys/kernel:
 - panic_on_warn
 - perf_cpu_time_max_percent
 - perf_event_paranoid
+- perf_event_max_stack
 - pid_max
 - powersave-nap   [ PPC only ]
 - printk
@@ -654,6 +655,19 @@ users (without CAP_SYS_ADMIN).  The default value is 1.
 
 ==
 
+perf_event_max_stack:
+
+Controls maximum number of stack frames to copy for (attr.sample_type &
+PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using
+'perf record -g' or 'perf trace --call-graph fp'.
+
+This can only be done when no events are in use that have callchains
+enabled, otherwise writing to this file will return -EBUSY.
+
+The default value is 127.
+
+==
+
 pid_max:
 
 PID allocation wrap value.  When the kernel's next PID value
diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 4e02ae5950ff..27563befa8a2 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -75,7 +75,7 @@ perf_callchain_user(struct perf_callchain_entry *entry, 
struct pt_regs *regs)
 
tail = (struct frame_tail __user *)regs->ARM_fp - 1;
 
-   while ((entry->nr < PERF_MAX_STACK_DEPTH) &&
+   while ((e