from:"Wangnan \(F\)"

Re: [PATCH] perf record: Fix segfault when running with suid and kptr_restrict is 1

2016-09-20 Thread Wangnan (F)




On 2016/9/21 11:48, Wang Nan wrote:

Before this patch perf panic if kptr_restrict set to 1 and perf is owned
by root with suid set:

  $ whoami
  wangnan
  $ ls -l ./perf
  -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf
  $ cat /proc/sys/kernel/kptr_restrict
  1
  $ cat /proc/sys/kernel/perf_event_paranoid
  -1
  $ ./perf record -a
  Segmentation fault (core dumped)

The reason is perf assumes it is allowed to read kptr from /proc/kallsyms
when euid is root, but in fact kernel doesn't allow it reading kptr when
euid and uid are not match with each other:

  $ cp /bin/cat .
  $ sudo chown root:root ./cat
  $ sudo chmod u+s ./cat
  $ cat /proc/kallsyms | grep do_fork
   T _do_fork  <--- kptr is hidden even euid is root
  $ sudo cat /proc/kallsyms | grep do_fork
  81080230 T _do_fork

See lib/vsprintf.c for kernel side code.

This patch fixes this problem by checking both uid and euid.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
---
  tools/perf/util/symbol.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 19c9c55..9528702 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1946,8 +1946,9 @@ static bool symbol__read_kptr_restrict(void)
if (fp != NULL) {
char line[8];
  
+



Sorry for this blank line. Will fix it.


if (fgets(line, sizeof(line), fp) != NULL)
-   value = (geteuid() != 0) ?
+   value = ((geteuid() != 0) || (getuid() != 0)) ?
(atoi(line) != 0) :
(atoi(line) == 2);

Re: [PATCH 2/2] perf record: Add --dry-run option to check cmdline options

2016-06-19 Thread Wangnan (F)




On 2016/6/17 0:48, Arnaldo Carvalho de Melo wrote:

Em Thu, Jun 16, 2016 at 08:02:41AM +, Wang Nan escreveu:

With '--dry-run', 'perf record' doesn't do reall recording. Combine with
llvm.dump-obj option, --dry-run can be used to help compile BPF objects for
embedded platform.

So these are nice and have value, but can we have a subcommand to do all
this with an expressive name, Something like:

   perf bpfcc foo.c -o foo

or shorter:

   perf bcc foo.c -o foo

Just like one would use gcc or some other compiler to generate something
for later use?


I'll try it today. I thought a subcommand require a bigger feature,
and wrapping clang is not big enough.


That if called as:

   perf bcc foo.c

Would default to generating a foo.o file.

   Then, later, one could use this as a event name, i.e.

   trace --event foo

Would, knowing that there is no event named "foo", look at the current
directory (and in some other places perhaps) for a file named "foo" that
was a bpf object file to use as it would a foo.c, shortcircuiting the
bpf compilation code.
If this was done instead:

   trace --event foo.c

And foo.c wasn't present, it would fallback to the behaviour described
in the previous paragraph: look for a foo.o or foo bpf object file, etc.

What do you think?


I'm not sure how many people can be benified from this feature. The only
advantage I can understand is we can skip the '.c', '.o' or '.bpf' suffix.

I guess what you really want is introducing something like buildid-cache for
BPF object. One can compile his/her BPF scriptlets into .o using 'perf 
bcc' and
insert it into cache, then he/her can use the resuling object without 
remembering

the path of it.

About fallback, if user explicitly uses '.o' or '.bpf' as suffix our 
parser can
be easier. Technically we need a boundary to split event name and 
configuration.
'.c', '.o' and '.bpf' are boundaries. In addition, is there any 
difference between

'-e mybpf' and '-e mybpf.bpf'? We can define that, when using '-e mybpf'
the search path whould be the BPF object cache, when using '-e 
mybpf.bpf' the
search path is current directory. It is acceptable, but why not make '-e 
mybpf.bpf'

search BPF object cache also?

Thank you.

Re: [PATCH v7 7/8] perf tools: Check write_backward during evlist config

2016-06-19 Thread Wangnan (F)




On 2016/6/17 5:47, Arnaldo Carvalho de Melo wrote:

Em Wed, Jun 15, 2016 at 02:23:34AM +, Wang Nan escreveu:

Before this patch, when using overwritable ring buffer on an old
kernel, error message is misleading:

  # ~/perf record -m 1 -e raw_syscalls:*/overwrite/ -a
  Error:
  The raw_syscalls:sys_enter event is not supported.

This patch output clear error message to tell user his/her kernel
is too old:

  # ~/perf record -m 1 -e raw_syscalls:*/overwrite/ -a
  Reading from overwrite event is not supported by this kernel
  Error:
  The raw_syscalls:sys_enter event is not supported.

So I went to see if exposing that missing_features struct outside
evsel.c was strictly needed and found that we already have fallbacking
for this feature (attr.write_backwards) i.e. if we set it and
sys_perf_event_open() fails, we will check if we are asking the kernel
for some attr. field that it doesn't supports, set that missing_features
and try again.

But the way this was done for attr.write_backwards was buggy, as we need
to check features in the inverse order of their introduction to the
kernel, so that a newer tool checks first the newest perf_event_attr
fields, detecting that the older kernel doesn't have support for them.
The patch that introduced write_backwards support ([1]) in perf_evsel__open()
did this checking after all the other older attributes, wrongly.

[1]: b90dc17a5d14 ("perf evsel: Add overwrite attribute and check 
write_backward")

Also we shouldn't even try to call sys_perf_event_open if
perf_missing_features.write_backward is true and evsel->overwrite is
also true, the old code would check this only after successfully opening
the fd, do it before the open loop.

Please take a look at the following patch, see if it is sufficient for
handling older kernels, probably we need to emit a message to the user,
but that has to be done at the builtin- level, i.e. at the tool, i.e.
perf_evsel_open__strerror() should have what it takes to figure out this
extra error and provide/ a proper string, lemme add this to the patch...
done, please check:

write_backwards_fallback.patch:


[SNIP]

  
@@ -1496,7 +1493,10 @@ try_fallback:

 * Must probe features in the order they were added to the
 * perf_event_attr interface.
 */


I read this comment but misunderstand. I thought 'order' means newest last.

Will try your patch. Thank you.


-   if (!perf_missing_features.clockid_wrong && evsel->attr.use_clockid) {
+   if (!perf_missing_features.write_backward && 
evsel->attr.write_backward) {
+   perf_missing_features.write_backward = true;
+   goto fallback_missing_features;
+   } else if (!perf_missing_features.clockid_wrong && 
evsel->attr.use_clockid) {
perf_missing_features.clockid_wrong = true;
goto fallback_missing_features;
} else if (!perf_missing_features.clockid && evsel->attr.use_clockid) {
@@ -1521,10 +1521,6 @@ try_fallback:
  PERF_SAMPLE_BRANCH_NO_FLAGS))) {
perf_missing_features.lbr_flags = true;
goto fallback_missing_features;
-   } else if (!perf_missing_features.write_backward &&
-   evsel->attr.write_backward) {
-   perf_missing_features.write_backward = true;
-   goto fallback_missing_features;
}
  
  out_close:

@@ -2409,6 +2405,8 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
"We found oprofile daemon running, please stop it and try again.");
break;
case EINVAL:
+   if (evsel->overwrite && perf_missing_features.write_backward)
+   return scnprintf(msg, size, "Reading from overwrite event is 
not supported by this kernel.");
if (perf_missing_features.clockid)
return scnprintf(msg, size, "clockid feature not 
supported.");
if (perf_missing_features.clockid_wrong)

Re: [PATCH v7 8/8] perf record: Unmap overwrite evlist when event terminate

2016-06-20 Thread Wangnan (F)




On 2016/6/17 4:59, Arnaldo Carvalho de Melo wrote:

Em Wed, Jun 15, 2016 at 02:23:35AM +, Wang Nan escreveu:

When see POLLERR or POLLHUP, unmap ring buffer from both the main
evlist and overwrite evlist.

When you use an auxiliary evlist this makes evlist->parent be different
than evlist, right? So can't we just hide this from tools and do it all
in perf_evlist__filter_pollfd?


So we need to find all children from parent. Currently we can only track
parent from children. This is not enough. Although currently we can pass
rec->overwrite_evlist to the filter so perf_evlist__filter_pollfd knows
it should release mmaps from both evlists, if in future we have more than
one aux_evlists it doesn't work.

Thank you.



This way we will not need to expose perf_evlist__mmap_put(), etc.

I haven't tried to _actually_ do this, just asking from browsing the
patchkit and with the goal of hiding the internals of this thing inside
evlist.c as much as possible.

- Arnaldo
  

Signed-off-by: Wang Nan 
Cc: He Kuang 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
---
  tools/perf/builtin-record.c | 30 +-
  tools/perf/util/evlist.c|  3 +--
  tools/perf/util/evlist.h|  2 +-
  3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index b9094f0..ca6376c 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -914,6 +914,26 @@ out:
return err;
  }
  
+static void record__munmap_filtered(struct fdarray *fda, int fd, void *arg)

+{
+   struct record *rec = (struct record *)arg;
+
+   perf_evlist__mmap_put(rec->evlist, fda->priv[fd].idx);
+   if (rec->overwrite_evlist)
+   perf_evlist__mmap_put(rec->overwrite_evlist, fda->priv[fd].idx);
+}
+
+static int record__filter_pollfd(struct record *rec)
+{
+   /*
+* Although we may have auxiliray evlist, there is
+* only one pollfd, so we don't need to filter pollfd
+* for auxiliray evlist.
+*/
+   return fdarray__filter(&rec->evlist->pollfd, POLLERR | POLLHUP,
+  record__munmap_filtered, rec);
+}
+
  static int __cmd_record(struct record *rec, int argc, const char **argv)
  {
int err;
@@ -1150,15 +1170,7 @@ static int __cmd_record(struct record *rec, int argc, 
const char **argv)
err = 0;
waking++;
  
-			/*

-* Although we may have auxiliray evlist, there is
-* only one pollfd, so we don't need to filter pollfd
-* for auxiliray evlist.
-*
-* TODO: if an event is terminated (POLLERR | POLLHUP),
-* unmap mmaps for auxiliray evlist too.
-*/
-   if (perf_evlist__filter_pollfd(rec->evlist, POLLERR | 
POLLHUP) == 0)
+   if (record__filter_pollfd(rec) == 0)
draining = true;
}
  
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c

index e8fcb22..d43ee81 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -27,7 +27,6 @@
  #include 
  #include 
  
-static void perf_evlist__mmap_put(struct perf_evlist *evlist, int idx);

  static void __perf_evlist__munmap(struct perf_evlist *evlist, int idx);
  
  #define FD(e, x, y) (*(int *)xyarray__entry(e->fd, x, y))

@@ -863,7 +862,7 @@ static void perf_evlist__mmap_get(struct perf_evlist 
*evlist, int idx)
atomic_inc(&evlist->mmap[idx].refcnt);
  }
  
-static void perf_evlist__mmap_put(struct perf_evlist *evlist, int idx)

+void perf_evlist__mmap_put(struct perf_evlist *evlist, int idx)
  {
struct perf_mmap *md = &evlist->mmap[idx];
  
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h

index 41e65ac..ba5e006 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -141,7 +141,7 @@ union perf_event *perf_evlist__mmap_read_backward(struct 
perf_evlist *evlist,
  void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx);
  
  void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx);

-
+void perf_evlist__mmap_put(struct perf_evlist *evlist, int idx);
  int perf_evlist__pause(struct perf_evlist *evlist);
  int perf_evlist__resume(struct perf_evlist *evlist);
  int perf_evlist__open(struct perf_evlist *evlist);
--
1.8.3.4

Re: [PATCH v2] tools/perf: Fix the mask in regs_dump__printf and

2016-06-20 Thread Wangnan (F)




On 2016/6/20 17:18, Jiri Olsa wrote:

On Mon, Jun 20, 2016 at 02:14:01PM +0530, Madhavan Srinivasan wrote:

When decoding the perf_regs mask in regs_dump__printf(),
we loop through the mask using find_first_bit and find_next_bit functions.
"mask" is of type "u64", but sent as a "unsigned long *" to
lib functions along with sizeof(). While the exisitng code works fine in
most of the case, the logic is broken when using a 32bit perf on a
64bit kernel (Big Endian). We end up reading the wrong word of the u64
first in the lib functions.

hum, I still don't see why this happens.. why do we read the
wrong word in this case?


If you read a u64 using (u32 *)(&val)[0] and (u32 *)(&val)[1]
you can get wrong value. This is what _find_next_bit() is doing.

In a big endian environment where 'unsigned long' is 32 bits
long, "(u32 *)(&val)[0]" gets upper 32 bits, but without this patch
perf assumes it gets lower 32 bits. The root cause is wrongly convert
u64 value to bitmap.


Thank you.

Re: [PATCH v8 2/8] perf evlist: Introduce aux evlist

2016-06-20 Thread Wangnan (F)




On 2016/6/21 4:36, Arnaldo Carvalho de Melo wrote:

Em Mon, Jun 20, 2016 at 10:47:19AM +, Wang Nan escreveu:

An auxiliary evlist is created by perf_evlist__new_aux() using an
existing evlist as its parent. An auxiliary evlist can have its own
'struct perf_mmap', but can't have any other data. User should use its
parent instead when accessing other data.

Auxiliary evlists are containers of 'struct perf_mmap'. It is introduced
to allow its parent evlist to map different events into separated mmaps.

Following commits create an auxiliary evlist for overwritable
events, because overwritable events need a read only and backwards ring
buffer, which is different from normal events.

To achieve this goal, this patch carefully changes 'evlist' to
'evlist->parent' in all functions in the path of 'perf_evlist__mmap_ex',
except 'evlist->mmap' related operations, to make sure all evlist
modifications (like pollfd and event id hash tables) goes to original
evlist.

A 'evlist->parent' pointer is added to 'struct perf_evlist' and points to
the evlist itself for normal evlists.

Children of one evlist are linked into it so one can find all children
from its parent.
To avoid potential complexity, forbid creating aux evlist from another
aux evlist.

Improve perf_evlist__munmap_filtered(), so when recording, if an event
is terminated, unmap mmaps, from parent and children.
  

Signed-off-by: Wang Nan 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
---
  tools/perf/util/evlist.c | 49 +---
  tools/perf/util/evlist.h | 12 
  2 files changed, 50 insertions(+), 11 deletions(-)


[SNIP]


diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 68cb136..5b50692 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -37,6 +37,10 @@ struct perf_mmap {
  
  struct perf_evlist {

struct list_head entries;
+   union {
+   struct list_head children;
+   struct list_head list;

This is something new, right, i.e. having a list of evlists from a
parent evlist, one where there can be at most one level, i.e. no
grandchildrens...

Is this strictly needed? I.e. the existing code calling
perf_evlist__munmap_filtered() has just the parent pointer and thus, to
not change it we need to have a way to go the children?

Also, can you se a parent with more than one child?


If we further restrict aux evlist and parent list, allow only one aux
evlist for a parent, we can avoid the linked list. Simply linking parent
and child using pointers would be enough.

However, I plan to introduce multiple children for a parent. There are
two possible aux evlists:

  1. Trigger aux evlist: we can use BPF script to measure events, and 
do something

 (dumpping, printing, computing, reporting...) when it detect something
 (for example, FPS drop to 10 frames/sec). We can create a bpf-output
 event and put it in a separated aux evlist with buffering turned off,
 so once BPF script generate a event perf receives it immediately.

  2. Non-sampling aux evlist: we can use a dummy event to collect fork, 
comm
 and mmap{,2} events in a separated aux evlist. We can use this 
evlist to
 maintain system state during recording, so we can totally solve 
the problem

 describe in patch 8/8.

Thank you.

Re: [PATCH 2/2] perf record: Add --dry-run option to check cmdline options

2016-06-20 Thread Wangnan (F)




On 2016/6/20 22:38, Arnaldo Carvalho de Melo wrote:

Em Mon, Jun 20, 2016 at 11:29:13AM +0800, Wangnan (F) escreveu:

On 2016/6/17 0:48, Arnaldo Carvalho de Melo wrote:

Em Thu, Jun 16, 2016 at 08:02:41AM +, Wang Nan escreveu:


[SNIP]


About fallback, if user explicitly uses '.o' or '.bpf' as suffix our
parser can be easier. Technically we need a boundary to split event
name and configuration.  '.c', '.o' and '.bpf' are boundaries. In
addition, is there any difference between '-e mybpf' and '-e
mybpf.bpf'? We can define that, when using '-e mybpf' the search path
whould be the BPF object cache, when using '-e mybpf.bpf' the search
path is current directory. It is acceptable, but why not make '-e
mybpf.bpf' search BPF object cache also?

Well there is a namespace issue here, if we say:

perf record -e cycles

then this is well known, we want PERF_TYPE_HARDWARE,
PERF_COUNT_HW_CPU_CYCLES. If we instead use:

perf record -e cycles.c

Then this also is well known, we need to build this somehow, and right
now the only way to do this is to use the llvm/clang infrastructure and
then load it into the kernel via sys_bpf.

If we say:

perf record -e cycles.bpf

Then we don't have anything associated with this and may go on trying to
map it to a PERF_TYPE_HARDWARE, PERF_TYPE_SOFTWARE, etc till we find a
suitable event, i.e. if it doesn't match anything, we would end up
looking at a file in the current directory, figure out it is an ELF file
and that its contents are a BPF proggie, that we would load via sys_bpf,
etc.


cycles.bpf is not a good example. See tools/perf/util/parse-events.l:
  ...
  bpf_object  .*\.(o|bpf)
  ...

currently '.o' equals to '.bpf'.

Thank you.

Re: [PATCH 2/2] perf record: Add --dry-run option to check cmdline options

2016-06-20 Thread Wangnan (F)




On 2016/6/21 0:22, Alexei Starovoitov wrote:

On Mon, Jun 20, 2016 at 11:38:18AM -0300, Arnaldo Carvalho de Melo wrote:

Em Mon, Jun 20, 2016 at 11:29:13AM +0800, Wangnan (F) escreveu:

On 2016/6/17 0:48, Arnaldo Carvalho de Melo wrote:

Em Thu, Jun 16, 2016 at 08:02:41AM +, Wang Nan escreveu:

With '--dry-run', 'perf record' doesn't do reall recording. Combine with
llvm.dump-obj option, --dry-run can be used to help compile BPF objects for
embedded platform.

So these are nice and have value, but can we have a subcommand to do all
this with an expressive name, Something like:
   perf bpfcc foo.c -o foo
or shorter:
   perf bcc foo.c -o foo
Just like one would use gcc or some other compiler to generate something
for later use?

I'll try it today. I thought a subcommand require a bigger feature,
and wrapping clang is not big enough.

Not really, we may have as many as we like, given that they provide
something useful, like I think is the case here.

Having to edit ~/.perfconfig, create a new section, a variable in it
with a boolean value (at first, just reading the changeset comment, I
thought I had to provide a directory where to store the objects
"dumped"), to then use a tool to record a .c event, but not recording
(use dry-run, which is useful to test the command line, etc), to then
get, on the current directory, the end result looked to me a convoluted
way to ask perf to compile the given .c file into a .o for later use.

Doing:

perf bcc -c foo.c

Looks so much simpler and similar to an existing compile source code
into object file workflow (gcc's, any C compiler) that I think it would
fit in the workflow being discussed really nicely.

I'm hopeful that eventually we'll be able merge iovisor/bcc project
with perf, so would be good to reserve 'perf bcc' command for that
future use. Also picking a different name for compiling would be less
confusing to users who already familiar with bcc. Instead we can use:
perf bpfcc foo.c -o foo.o
perf cc foo.c
perf compile foo.c



I think finally we should make perf independent with LLVM runtime.
I suggest 'perf bpf' subcommand to deal with all BPF related things,
include compiling, configuration and potential cache.

Thank you.

Re: [PATCH v8 3/8] perf tests: Add testcase for auxiliary evlist

2016-06-21 Thread Wangnan (F)




On 2016/6/22 5:05, Arnaldo Carvalho de Melo wrote:

Em Mon, Jun 20, 2016 at 10:47:20AM +, Wang Nan escreveu:

Improve test backward


[SNIP]



diff --git a/tools/perf/tests/backward-ring-buffer.c 
b/tools/perf/tests/backward-ring-buffer.c
index d9ba991..76e34c0 100644
--- a/tools/perf/tests/backward-ring-buffer.c
+++ b/tools/perf/tests/backward-ring-buffer.c
@@ -31,16 +31,19 @@ static int count_samples(struct perf_evlist *evlist, int 
*sample_count,
for (i = 0; i < evlist->nr_mmaps; i++) {
union perf_event *event;
  
-		perf_evlist__mmap_read_catchup(evlist, i);

-   while ((event = perf_evlist__mmap_read_backward(evlist, i)) != 
NULL) {
+   if (evlist->backward)
+   perf_evlist__mmap_read_catchup(evlist, i);

So, shouldn't this be done at perf_evlist__mmap_read_catchup()? I.e. you
will use this only when you know that one of the evlists count_samples()
will deal with can be a backwards one...

I.e. do with perf_evlist__mmap_read_catchup the same thing you did in
perf_evlist__mmap_read, checking there this evlist->backward.


I can make the code clearer, but I don't agree hiding evlist->backward 
checker in

perf_evlist__mmap_read_catchup():

1. If we make perf_evlist__mmap_read_catchup() implicitly ignore 
non-backward evlist,
   then we are creating a new rule for reading from mmaps that, before 
calling
   perf_evlist__mmap_read() we need to call 
perf_evlist__mmap_read_catchup() first.
   Theoretically, existing code should be adjusted to satisify this new 
rule, but

   actually most of catchup does nothing.

   If we don't require existing code be adjusted, then we are still 
required to
   clarify when catchup() is required, so evlist->backward is still 
exposed.


2. I think we don't need to restrict perf_evlist__mmap_read_catchup() 
for backward
   ring buffer. It is a generic operations, can be used for a normal 
evlist to

   consume existing data in ring buffer.



This can be done on top, so I'll continue tentatively merging this.


+   while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) {
const u32 type = event->header.type;
  
  			switch (type) {

case PERF_RECORD_SAMPLE:
-   (*sample_count)++;
+   if (sample_count)
+   (*sample_count)++;
break;
case PERF_RECORD_COMM:
-   (*comm_count)++;
+   if (comm_count)
+   (*comm_count)++;

You could've avoided all this by passing some dummy integer pointer for
the enter_sample_count case. Making the patch smaller helps reviewing
:-)


Will do.

Thank you.

Re: [PATCH] perf record: Fix segfault when running with suid and kptr_restrict is 1

2016-09-23 Thread Wangnan (F)




On 2016/9/22 23:41, Arnaldo Carvalho de Melo wrote:

Em Wed, Sep 21, 2016 at 03:56:20AM +, Wang Nan escreveu:

Before this patch perf panic if kptr_restrict set to 1 and perf is owned
by root with suid set:

  $ whoami
  wangnan
  $ ls -l ./perf
  -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf
  $ cat /proc/sys/kernel/kptr_restrict
  1
  $ cat /proc/sys/kernel/perf_event_paranoid
  -1
  $ ./perf record -a
  Segmentation fault (core dumped)

Trying to reproduce this here, and failing, where am I making a mistake?

  


I tried again. Not related to paranoid. If 
/proc/sys/kernel/kptr_restrict is set to

1 and perf runs with (euid == 0 && uid != 0) it will panic at:

int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
...
kmap = map__kmap(map);
size = snprintf(event->mmap.filename, sizeof(event->mmap.filename),
"%s%s", mmap_name, kmap->ref_reloc_sym->name) + 
1; <-- *kmap->ref_reloc_sym is NULL*

...
}

because following function:

machine__create_kernel_maps
 machine__get_running_kernel_start

returns NULL, so kmap->ref_reloc_sym is never set.

No-root user never get the crash point because in 
perf_event__synthesize_kernel_mmap()
symbol_conf.kptr_restrict is true so perf_event__synthesize_kernel_mmap 
directly returns

fail.

Have you tried following 'cat' test?


The reason is perf assumes it is allowed to read kptr from /proc/kallsyms
when euid is root, but in fact kernel doesn't allow it reading kptr when
euid and uid are not match with each other:
  

  $ cp /bin/cat .
  $ sudo chown root:root ./cat
  $ sudo chmod u+s ./cat
  $ cat /proc/kallsyms | grep do_fork
   T _do_fork  <--- kptr is hidden even euid is root
  $ sudo cat /proc/kallsyms | grep do_fork
  81080230 T _do_fork

See lib/vsprintf.c for kernel side code.

This patch fixes this problem by checking both uid and euid.

Humm, can't we just do:

-   value = (geteuid() != 0) ?
+   value = (getuid() != 0) ?


No. though not very useful, by chown u+s one can make ((uid == 0) && 
(euid != 0)) by:


 # chown wangnan:wangnan ./perf
 # chmod u+s ./perf

Then perf fail if run by root:
 # ./perf record -a
 Segmentation fault (core dumped)

But okay for normal user.

Thank you.


I did it here and it seems to work:

[acme@jouet linux]$ whoami
acme
[acme@jouet linux]$ ls -la ~/bin/perf
-rwsr-xr-x. 2 root root 15539952 Sep 22 12:38 /home/acme/bin/perf
[acme@jouet linux]$ cat /proc/sys/kernel/kptr_restrict
1
[acme@jouet linux]$ cat /proc/sys/kernel/perf_event_paranoid
-1
[acme@jouet linux]$ ~acme/bin/perf record -a
   Warning: File /home/acme/.perfconfig not owned by current user or root, 
ignoring it.
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
^C[ perf record: Woken up 17 times to write data ]
[ perf record: Captured and wrote 6.474 MB perf.data (90771 samples) ]

[acme@jouet linux]$ perf report -D 2>&1 | grep PERF_RECORD_SAMPLE | wc -l
90770
[acme@jouet linux]$ perf evlist
   Warning: File /home/acme/.perfconfig not owned by current user or root, 
ignoring it.
cycles:ppp
[acme@jouet linux]$ ls -la ~/.perfconfig
-rw-rw-r--. 1 acme acme 27 Aug 12 17:57 /home/acme/.perfconfig
[acme@jouet linux]$


- Arnaldo

  

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
---

Resend with a meanless blank like removed.

---
  tools/perf/util/symbol.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 19c9c55..c55e781 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1947,7 +1947,7 @@ static bool symbol__read_kptr_restrict(void)
char line[8];
  
  		if (fgets(line, sizeof(line), fp) != NULL)

-   value = (geteuid() != 0) ?
+   value = ((geteuid() != 0) || (getuid() != 0)) ?
(atoi(line) != 0) :
(atoi(line) == 2);
  
--

1.8.3.4

Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang

2016-09-25 Thread Wangnan (F)




On 2016/9/24 23:16, Alexei Starovoitov wrote:

On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote:

This patch set is the first step to implement features I announced
in LinuxCon NA 2016. See page 31 of:

  
http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf

This patch set links LLVM and Clang libraries to perf, so perf
is able to compile BPF script to BPF object on the fly.

Nice!
So single perf binary won't have llvm external dependency anymore
or both ways will be maintained?
The command line stays the same?


Yes. This patch set doesn't change interface. It compiles BPF script
with builtin clang, and if it fail, fall back to external clang.


If I understand the patches correctly, this set is establishing
the basic functionality and more complex features coming?



Yes. Following steps are:

 1. Ease of use improvement: automatically include BPF functions
declaration and macros.

 2. Perf's hook: compile part of BPF script into native code, run
them in perf when something happen. Create a channel, coordinate
BPF and native code use bpf-output event.

 3. Define a new language to support common profiling task. I'm not
very clear what the new language should be. It may looks like lua,
perf converts it to C then to LLVM IR with builtin clang.

Thank you.

Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang

2016-09-26 Thread Wangnan (F)




On 2016/9/27 7:58, Arnaldo Carvalho de Melo wrote:


Le 26 sept. 2016 8:47 PM, "Alexei Starovoitov" 
mailto:alexei.starovoi...@gmail.com>> a 
écrit :

>
> On Mon, Sep 26, 2016 at 09:49:30AM +0800, Wangnan (F) wrote:
> >
> >
> > On 2016/9/24 23:16, Alexei Starovoitov wrote:
> > >On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote:
> > >>This patch set is the first step to implement features I announced
> > >>in LinuxCon NA 2016. See page 31 of:
> > >>
> > >> 
http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf

> > >>
> > >>This patch set links LLVM and Clang libraries to perf, so perf
> > >>is able to compile BPF script to BPF object on the fly.
> > >Nice!
> > >So single perf binary won't have llvm external dependency anymore
> > >or both ways will be maintained?
> > >The command line stays the same?
> >
> > Yes. This patch set doesn't change interface. It compiles BPF script
> > with builtin clang, and if it fail, fall back to external clang.
> >
> > >If I understand the patches correctly, this set is establishing
> > >the basic functionality and more complex features coming?
> > >
> >
> > Yes. Following steps are:
> >
> >  1. Ease of use improvement: automatically include BPF functions
> > declaration and macros.
>
> +1
>
> >  2. Perf's hook: compile part of BPF script into native code, run
> > them in perf when something happen. Create a channel, coordinate
> > BPF and native code use bpf-output event.
>
> +1
>
> >  3. Define a new language to support common profiling task. I'm not
> > very clear what the new language should be. It may looks like lua,
> > perf converts it to C then to LLVM IR with builtin clang.
>
> Many tracing languages were invented in the past.
> At this point I'm not sure what exactly new language will solve.
> To make it easier to write bpf programs?
> I think it will be more fruitful to tweak clang/llvm to add
> good warnings/errors for cases where we know that C is not going
> be compiled into the code that the kernel verifier will accept.
> Like we can complain about loops, unitialized variables,
> non-inlined and unkown helper functions... all from clang/llvm.
> imo that would be the better path forward and will help
> both tracing and networking users that write in this restricted C.

++1
>



OK. Now let's focus on the first two goals. After that let's consider
how to help writing BPF program.

Thank you.

Re: [PATCH] perf/core: Add a tracepoint for perf sampling

2016-07-29 Thread Wangnan (F)




On 2016/7/30 2:05, Brendan Gregg wrote:

On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg  wrote:

When perf is performing hrtimer-based sampling, this tracepoint can be used
by BPF to run additional logic on each sample. For example, BPF can fetch
stack traces and frequency count them in kernel context, for an efficient
profiler.

Any comments on this patch? Thanks,

Brendan


Sorry for the late.

I think it is a useful feature. Could you please provide an example
to show how to use it in perf?

If I understand correctly, I can have a BPF script run 99 times per
second using

  # perf -e cpu-clock/freq=99/ -e mybpf.c ...

And in mybpf.c, attach a BPF script on the new tracepoint. Right?

Also, since we already have timer:hrtimer_expire_entry, please provide
some further information about why we need a new tracepoint.

Thank you.

Re: [PATCH 1/3] tools include: Add uapi mman.h for each architecture

2016-09-12 Thread Wangnan (F)




On 2016/9/13 5:15, Arnaldo Carvalho de Melo wrote:

Em Mon, Sep 12, 2016 at 04:07:42PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Mon, Sep 12, 2016 at 12:54:29PM +, Wang Nan escreveu:

Some mmap related macros have different values for different
architectures. This patch introduces uapi mman.h for each
architectures.

Three headers are cloned from kernel include to tools/include:

  tools/include/uapi/asm-generic/mman-common.h
  tools/include/uapi/asm-generic/mman.h
  tools/include/uapi/linux/mman.h

Cool, the above was done as copies, why not the rest? IIRC you mentioned
some reasoning behind that decision, but we need it spelled out here.

For instance, I went on and look at arch/xtensa/include/uapi/asm/mman.h,
and couldn't find why we shouldn't copy it just like the three files
above.

I'm looking now at why the build breaks in so many systems, first hunch
is that bits/ part (the ones without the failure details probably have
the same errors), alpine uses musl libc, but some that broke are glibc
based.

So, please take a look at my perf/core branch, I applied 1/3 and 3/3,
but took a different path for 2/3, now it builds for all systems I have
containers for:


Clever! The key is to use mman.h in uapi instead of .
Code require these macros is coincidentally needn't the definition
of functions like mmap, munmap and mprotect. Your solution is
better.

Thank you.


   # time dm
1 67.700 alpine:3.4: Ok
2 23.565 android-ndk:r12b-arm: Ok
3 67.823 archlinux:latest: Ok
4 37.277 centos:5: Ok
5 57.621 centos:6: Ok
6 68.348 centos:7: Ok
7 61.971 debian:7: Ok
8 65.711 debian:8: Ok
9 36.894 debian:experimental: Ok
   10 66.274 fedora:20: Ok
   11 70.054 fedora:21: Ok
   12 68.310 fedora:22: Ok
   13 68.033 fedora:23: Ok
   14 72.322 fedora:24: Ok
   15 29.529 fedora:24-x-ARC-uClibc: Ok
   16 77.458 fedora:rawhide: Ok
   17 80.110 mageia:5: Ok
   18 72.664 opensuse:13.2: Ok
   19 70.878 opensuse:42.1: Ok
   20 80.322 opensuse:tumbleweed: Ok
   21 62.237 ubuntu:12.04.5: Ok
   22 39.998 ubuntu:14.04: Ok
   23 69.383 ubuntu:14.04.4: Ok
   24 76.120 ubuntu:15.10: Ok
   25 69.668 ubuntu:16.04: Ok
   26 69.061 ubuntu:16.04-x-arm: Ok
   27 73.337 ubuntu:16.04-x-arm64: Ok
   28 77.061 ubuntu:16.04-x-powerpc64: Ok
   29 55.340 ubuntu:16.04-x-powerpc64el: Ok
   30 85.579 ubuntu:16.10: Ok
   31 59.645 ubuntu:16.10-x-s390: Ok

   real 32m59.385s
   user 0m1.856s
   sys  0m2.077s
   #

Re: [PATCH 1/3] tools include: Add uapi mman.h for each architecture

2016-09-14 Thread Wangnan (F)




On 2016/9/14 17:28, Naveen N. Rao wrote:

On 2016/09/12 06:15PM, Arnaldo Carvalho de Melo wrote:

Em Mon, Sep 12, 2016 at 04:07:42PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Mon, Sep 12, 2016 at 12:54:29PM +, Wang Nan escreveu:

Some mmap related macros have different values for different
architectures. This patch introduces uapi mman.h for each
architectures.

Three headers are cloned from kernel include to tools/include:

  tools/include/uapi/asm-generic/mman-common.h
  tools/include/uapi/asm-generic/mman.h
  tools/include/uapi/linux/mman.h

Cool, the above was done as copies, why not the rest? IIRC you mentioned
some reasoning behind that decision, but we need it spelled out here.

For instance, I went on and look at arch/xtensa/include/uapi/asm/mman.h,
and couldn't find why we shouldn't copy it just like the three files
above.

I'm looking now at why the build breaks in so many systems, first hunch
is that bits/ part (the ones without the failure details probably have
the same errors), alpine uses musl libc, but some that broke are glibc
based.

So, please take a look at my perf/core branch, I applied 1/3 and 3/3,
but took a different path for 2/3, now it builds for all systems I have
containers for:

This still fails for me on ppc64. Perhaps we should guard
P_MMAP_FLAG(32BIT) and potentially others with a #ifdef, which was
earlier reverted by commit 256763b0 ("perf trace beauty mmap: Add more
conditional defines")?


Perhaps we should set all non-exist flag to 0 in each uapi mman.h?

Thank you.


- Naveen

---
In file included from builtin-trace.c:560:0:
trace/beauty/mmap.c: In function ‘syscall_arg__scnprintf_mmap_flags’:
trace/beauty/mmap.c:38:14: error: ‘MAP_32BIT’ undeclared (first use in this 
function)
   if (flags & MAP_##n) { \
   ^
trace/beauty/mmap.c:45:2: note: in expansion of macro ‘P_MMAP_FLAG’
   P_MMAP_FLAG(32BIT);
   ^
trace/beauty/mmap.c:38:14: note: each undeclared identifier is reported only 
once for each function it appears in
   if (flags & MAP_##n) { \
   ^
trace/beauty/mmap.c:45:2: note: in expansion of macro ‘P_MMAP_FLAG’
   P_MMAP_FLAG(32BIT);
   ^
   CC   bench/mem-functions.o
mv: cannot stat ‘./.builtin-trace.o.tmp’: No such file or directory
make[2]: *** [builtin-trace.o] Error 1
make[2]: *** Waiting for unfinished jobs

Re: [PATCH 1/3] tools include: Add uapi mman.h for each architecture

2016-09-14 Thread Wangnan (F)




On 2016/9/14 18:00, Naveen N. Rao wrote:

On 2016/09/14 05:36PM, Wang Nan wrote:


On 2016/9/14 17:28, Naveen N. Rao wrote:

On 2016/09/12 06:15PM, Arnaldo Carvalho de Melo wrote:

Em Mon, Sep 12, 2016 at 04:07:42PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Mon, Sep 12, 2016 at 12:54:29PM +, Wang Nan escreveu:

Some mmap related macros have different values for different
architectures. This patch introduces uapi mman.h for each
architectures.

Three headers are cloned from kernel include to tools/include:

   tools/include/uapi/asm-generic/mman-common.h
   tools/include/uapi/asm-generic/mman.h
   tools/include/uapi/linux/mman.h

Cool, the above was done as copies, why not the rest? IIRC you mentioned
some reasoning behind that decision, but we need it spelled out here.

For instance, I went on and look at arch/xtensa/include/uapi/asm/mman.h,
and couldn't find why we shouldn't copy it just like the three files
above.

I'm looking now at why the build breaks in so many systems, first hunch
is that bits/ part (the ones without the failure details probably have
the same errors), alpine uses musl libc, but some that broke are glibc
based.

So, please take a look at my perf/core branch, I applied 1/3 and 3/3,
but took a different path for 2/3, now it builds for all systems I have
containers for:

This still fails for me on ppc64. Perhaps we should guard
P_MMAP_FLAG(32BIT) and potentially others with a #ifdef, which was
earlier reverted by commit 256763b0 ("perf trace beauty mmap: Add more
conditional defines")?

Perhaps we should set all non-exist flag to 0 in each uapi mman.h?

Will that work for MADV_* since the macro there is for a case statement?


Then fall back to include/uapi/asm-generic/mman-common.h. And I
realized the missing of MADV_FEEE in tools/perf/trace/beauty/mmap.c.
Is that intentionally?

Thank you.

Re: [PATCH 1/3] tools include: Add uapi mman.h for each architecture

2016-09-14 Thread Wangnan (F)




On 2016/9/14 18:46, Naveen N. Rao wrote:

On 2016/09/14 06:23PM, Wang Nan wrote:


On 2016/9/14 18:00, Naveen N. Rao wrote:

On 2016/09/14 05:36PM, Wang Nan wrote:

On 2016/9/14 17:28, Naveen N. Rao wrote:

On 2016/09/12 06:15PM, Arnaldo Carvalho de Melo wrote:

Em Mon, Sep 12, 2016 at 04:07:42PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Mon, Sep 12, 2016 at 12:54:29PM +, Wang Nan escreveu:

Some mmap related macros have different values for different
architectures. This patch introduces uapi mman.h for each
architectures.

Three headers are cloned from kernel include to tools/include:

tools/include/uapi/asm-generic/mman-common.h
tools/include/uapi/asm-generic/mman.h
tools/include/uapi/linux/mman.h

Cool, the above was done as copies, why not the rest? IIRC you mentioned
some reasoning behind that decision, but we need it spelled out here.

For instance, I went on and look at arch/xtensa/include/uapi/asm/mman.h,
and couldn't find why we shouldn't copy it just like the three files
above.

I'm looking now at why the build breaks in so many systems, first hunch
is that bits/ part (the ones without the failure details probably have
the same errors), alpine uses musl libc, but some that broke are glibc
based.

So, please take a look at my perf/core branch, I applied 1/3 and 3/3,
but took a different path for 2/3, now it builds for all systems I have
containers for:

This still fails for me on ppc64. Perhaps we should guard
P_MMAP_FLAG(32BIT) and potentially others with a #ifdef, which was
earlier reverted by commit 256763b0 ("perf trace beauty mmap: Add more
conditional defines")?

Perhaps we should set all non-exist flag to 0 in each uapi mman.h?

Will that work for MADV_* since the macro there is for a case statement?

Then fall back to include/uapi/asm-generic/mman-common.h. And I

mman-common.h is already included on powerpc:

# cat tools/arch/powerpc/include/uapi/asm/mman.h
#ifndef TOOLS_ARCH_POWERPC_UAPI_ASM_MMAN_FIX_H
#define TOOLS_ARCH_POWERPC_UAPI_ASM_MMAN_FIX_H
#define MAP_DENYWRITE   0x0800
#define MAP_EXECUTABLE  0x1000
#define MAP_GROWSDOWN   0x0100
#define MAP_HUGETLB 0x4
#define MAP_LOCKED  0x80
#define MAP_NONBLOCK0x1
#define MAP_NORESERVE   0x40
#define MAP_POPULATE0x8000
#define MAP_STACK   0x2
#include 
#endif


So for powerpc is free from MADV_* problem.

Alpha, MIPS, parisc and xtensa misses some MADV_* macros,
we should define them in their uapi/mman.h.

Thank you.


realized the missing of MADV_FEEE in tools/perf/trace/beauty/mmap.c.
Is that intentionally?

Not sure, Arnaldo should know.

Thanks,
Naveen

[BUG] perf: memory leak in perf top

2016-08-05 Thread Wangnan (F)


perf top comsumes too much memory than it need.

Using following commands:

 # yes '' > /dev/null
 # perf top -e raw_syscalls:sys_enter

System will quickly become unresponsive because of out of memory.

Using GDB I find two call stacks of malloc. See below.

[call stack 1]

#0 __GI___libc_malloc (bytes=140723951319552) at malloc.c:2841
#1 0x004b7246 in memdup (src=0x7f515e04b054, len=68) at 
../lib/string.c:29
#2 0x004f9299 in hist_entry__init (sample_self=true, 
template=0x7ffcd9213a80, he=0xa37da50) at util/hist.c:401
#3 hist_entry__new (template=template@entry=0x7ffcd9213a80 
, 
sample_self=sample_self@entry=true) 
 at util/hist.c:458
#4 0x004fa13f in hists__findnew_entry (hists=0x3549d30, 
hists=0x3549d30, al=0x7ffcd9213ca0, sample_self=true, entry=0x

7ffcd9213a80) at util/hist.c:544
#5 __hists__add_entry (hists=0x3549d30, al=0x7ffcd9213ca0, 
sym_parent=, bi=bi@entry=0x0 , 
mi=mi@entry=0x0 , sampl
e=, sample_self=sample_self@entry=true 
, ops=ops@entry=0x0) 
 at util/hist.c:600
#6 0x004fa3d6 in hists__add_entry (sample_self=true, 
sample=, mi=0x0, bi=0x0, sym_parent=
>, al=, hists=) at util/hist.c:611
#7 iter_add_single_normal_entry (iter=0x7ffcd9213ce0, al=out>) at util/hist.c:823
#8 0x004fc3c5 in hist_entry_iter__add (iter=0x7ffcd9213ce0, 
al=0x7ffcd9213ca0, max_stack_depth=, arg=0x

7ffcd9214120) at util/hist.c:1030
#9 0x004536d7 in perf_event__process_sample (machine=out>, sample=0x7ffcd9213d30, evsel=0x3549bb0, event=
ptimized out>, tool=0x7ffcd9214120) at builtin-top.c:800
#10 perf_top__mmap_read_idx (top=top@entry=0x7ffcd9214120 
, idx=idx@entry=0) 
 at builtin-top.c:866
#11 0x004559fd in perf_top__mmap_read (top=0x7ffcd9214120) at 
builtin-top.c:883

#12 __cmd_top (top=0x7ffcd9214120) at builtin-top.c:1018
#13 cmd_top (argc=, argv=, 
prefix=) at builtin-top.c:1344
#14 0x00493561 in run_builtin (p=p@entry=0x948000 
 , argc=argc@entry=4 
, argv=argv@entry=0x7ffcd92199b0) 
 at

perf.c:357
#15 0x00437157 in handle_internal_command (argv=0x7ffcd92199b0, 
argc=4) at perf.c:419

#16 run_argv (argv=0x7ffcd9219720, argcp=0x7ffcd921972c) at perf.c:465
#17 main (argc=4, argv=0x7ffcd92199b0) at perf.c:609

[call stack 2]

#0 __GI___libc_malloc (bytes=140723951319456) at malloc.c:2841
#1 0x005766de in trace_seq_init (s=0x7ffcd92139b0) at trace-seq.c:60
#2 0x004f7047 in get_trace_output (he=0x7ffcd9213a80) at 
util/sort.c:584

#3 sort__trace_cmp (left=0x40e6d30, right=0x7ffcd9213a80) at util/sort.c:606
#4 0x004fa007 in hist_entry__cmp (right=0x7ffcd9213a80, 
left=0x40e6d30) at util/hist.c:1072
#5 hists__findnew_entry (hists=0x3549d30, hists=0x3549d30, 
al=0x7ffcd9213ca0, sample_self=true, entry=0x7ffcd9213a80) at util/

hist.c:509
#6 __hists__add_entry (hists=0x3549d30, al=0x7ffcd9213ca0, 
sym_parent=, bi=bi@entry=0x0 , 
mi=mi@entry=0x0 , sampl
e=, sample_self=sample_self@entry=true 
, ops=ops@entry=0x0) 
 at util/hist.c:600
#7 0x004fa3d6 in hists__add_entry (sample_self=true, 
sample=, mi=0x0, bi=0x0, sym_parent=
>, al=, hists=) at util/hist.c:611
#8 iter_add_single_normal_entry (iter=0x7ffcd9213ce0, al=out>) at util/hist.c:823
#9 0x004fc3c5 in hist_entry_iter__add (iter=0x7ffcd9213ce0, 
al=0x7ffcd9213ca0, max_stack_depth=, arg=0x

7ffcd9214120) at util/hist.c:1030
#10 0x004536d7 in perf_event__process_sample (machine=out>, sample=0x7ffcd9213d30, evsel=0x3549bb0, event=
ptimized out>, tool=0x7ffcd9214120) at builtin-top.c:800
#11 perf_top__mmap_read_idx (top=top@entry=0x7ffcd9214120 
, idx=idx@entry=0) 
 at builtin-top.c:866
#12 0x004559fd in perf_top__mmap_read (top=0x7ffcd9214120) at 
builtin-top.c:883

#13 __cmd_top (top=0x7ffcd9214120) at builtin-top.c:1018
#14 cmd_top (argc=, argv=, 
prefix=) at builtin-top.c:1344
#15 0x00493561 in run_builtin (p=p@entry=0x948000 
 , argc=argc@entry=4 
, argv=argv@entry=0x7ffcd92199b0) 
 at

perf.c:357
#16 0x00437157 in handle_internal_command (argv=0x7ffcd92199b0, 
argc=4) at perf.c:419

#17 run_argv (argv=0x7ffcd9219720, argcp=0x7ffcd921972c) at perf.c:465
#18 main (argc=4, argv=0x7ffcd92199b0) at perf.c:609

Re: [BUG] perf: memory leak in perf top

2016-08-08 Thread Wangnan (F)




On 2016/8/5 22:12, Arnaldo Carvalho de Melo wrote:

Em Fri, Aug 05, 2016 at 06:58:12PM +0800, Wangnan (F) escreveu:

perf top comsumes too much memory than it need.

Using following commands:

  # yes '' > /dev/null
  # perf top -e raw_syscalls:sys_enter

System will quickly become unresponsive because of out of memory.

Using GDB I find two call stacks of malloc. See below.

[call stack 1]

#0 __GI___libc_malloc (bytes=140723951319552) at malloc.c:2841
#1 0x004b7246 in memdup (src=0x7f515e04b054, len=68) at
../lib/string.c:29
#2 0x004f9299 in hist_entry__init (sample_self=true,
template=0x7ffcd9213a80, he=0xa37da50) at util/hist.c:401
#3 hist_entry__new (template=template@entry=0x7ffcd9213a80
<mailto:%28template=template@entry=0x7ffcd9213a80>,
sample_self=sample_self@entry=true)
<mailto:sample_self=sample_self@entry=true%29> at util/hist.c:458
#4 0x004fa13f in hists__findnew_entry (hists=0x3549d30,
hists=0x3549d30, al=0x7ffcd9213ca0, sample_self=true, entry=0x
7ffcd9213a80) at util/hist.c:544
#5 __hists__add_entry (hists=0x3549d30, al=0x7ffcd9213ca0,
sym_parent=, bi=bi@entry=0x0 <mailto:bi=bi@entry=0x0>,
mi=mi@entry=0x0 <mailto:mi=mi@entry=0x0>, sampl
e=, sample_self=sample_self@entry=true
<mailto:sample_self=sample_self@entry=true>, ops=ops@entry=0x0)
<mailto:ops=ops@entry=0x0%29> at util/hist.c:600
#6 0x004fa3d6 in hists__add_entry (sample_self=true,
sample=, mi=0x0, bi=0x0, sym_parent=
, al=, hists=) at util/hist.c:611

#7 iter_add_single_normal_entry (iter=0x7ffcd9213ce0, al=) at
util/hist.c:823
#8 0x004fc3c5 in hist_entry_iter__add (iter=0x7ffcd9213ce0,
al=0x7ffcd9213ca0, max_stack_depth=, arg=0x
7ffcd9214120) at util/hist.c:1030
#9 0x004536d7 in perf_event__process_sample (machine=, sample=0x7ffcd9213d30, evsel=0x3549bb0, event=, tool=0x7ffcd9214120) at builtin-top.c:800
#10 perf_top__mmap_read_idx (top=top@entry=0x7ffcd9214120
<mailto:%28top=top@entry=0x7ffcd9214120>, idx=idx@entry=0)
<mailto:idx=idx@entry=0%29> at builtin-top.c:866
#11 0x004559fd in perf_top__mmap_read (top=0x7ffcd9214120) at
builtin-top.c:883
#12 __cmd_top (top=0x7ffcd9214120) at builtin-top.c:1018
#13 cmd_top (argc=, argv=, prefix=) at builtin-top.c:1344
#14 0x00493561 in run_builtin (p=p@entry=0x948000
<mailto:%28p=p@entry=0x948000> , argc=argc@entry=4
<mailto:argc=argc@entry=4>, argv=argv@entry=0x7ffcd92199b0)
<mailto:argv=argv@entry=0x7ffcd92199b0%29> at
perf.c:357
#15 0x00437157 in handle_internal_command (argv=0x7ffcd92199b0,
argc=4) at perf.c:419
#16 run_argv (argv=0x7ffcd9219720, argcp=0x7ffcd921972c) at perf.c:465
#17 main (argc=4, argv=0x7ffcd92199b0) at perf.c:609

[call stack 2]
  

#0 __GI___libc_malloc (bytes=140723951319456) at malloc.c:2841
#1 0x005766de in trace_seq_init (s=0x7ffcd92139b0) at trace-seq.c:60
#2 0x004f7047 in get_trace_output (he=0x7ffcd9213a80) at
util/sort.c:584

Yeah, this allocates 4K, way too much, can you try with this patch,
which amounts to a trace_seq_trim()?

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 947d21f38398..37ddcf57500f 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -588,7 +588,7 @@ static char *get_trace_output(struct hist_entry *he)
} else {
pevent_event_info(&seq, evsel->tp_format, &rec);
}
-   return seq.buffer;
+   return realloc(seq.buffer, seq.len);
  }
  
  static int64_t


It works: from 1.8g/10s to 40m/10s, but unfortunately your
patch is wrong: please use realloc(seq.buffer, seq.len + 1)
because of the ending '\0' :)

In my experiment RSS continuously increases because I use
raw_syscalls, which can result in too much entries to record
in perf top because of the nature of argument list (garbage
arguments are also considered by perf). It becomes much better
if I use --fields to reduce the number of possible entries.
However, theoretically perf will eventually consume all
system memory, unless we carefully choose tracepoints and
fields to make sure the possible entries is limited.

Thank you.

Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)




On 2017/11/1 18:03, Namhyung Kim wrote:

On Wed, Nov 01, 2017 at 05:53:27AM +, Wang Nan wrote:

The meaning of perf record's "overwrite" option and many "overwrite" in
source code are not clear. In perf's code, the 'overwrite' has 2 meanings:
  1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
  2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite option, its
real meaning is setting backward.

This patch separates these two concepts, introduce 'flightrecorder' mode
which is what we really want. It combines these 2 concept together, wraps
them into a record mode. In flight recorder mode, perf only dumps data before
something happen.

I'm ok with the it but removing old name looks not good.  How about
keeping them for a while (as deprecated)?.


Is there a way to hide '--overwrite' from 'perf record --help' and print 
something

when user really use it?


And 'flightrecorder' seems too long.  Maybe you can use an acronym
like FDR or fdr-mode?


fdr-mode is a good name.

Thank you.

Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)




On 2017/11/1 17:49, Namhyung Kim wrote:

Hi,

On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote:

perf record backward recording doesn't work as we expected: it never
overwrite when ring buffer full.

Test:

(Run a busy printing python task background like this:

  while True:
  print 123

send SIGUSR2 to perf to capture snapshot.)


[SNIP]



Signed-off-by: Wang Nan 
---
  tools/perf/util/evlist.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..4c5daba 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist 
__maybe_unused,
  }
  
  static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,

-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int 
*_output_backward)
  {
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp;
  
  	evlist__for_each_entry(evlist, evsel) {

struct perf_mmap *maps = evlist->mmap;
+   struct mmap_params rdonly_mp;
int *output = _output;
int fd;
int cpu;
  
+		mp = _mp;

if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   rdonly_mp = *_mp;
+   rdonly_mp.prot &= ~PROT_WRITE;
+   mp = &rdonly_mp;
  
  			if (!maps) {

maps = perf_evlist__alloc_mmap(evlist);
--

What about this instead (not tested)?


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int idx,
 if (*output == -1) {
 *output = fd;
  
+   if (evsel->attr.write_backward)

+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+


If evlist->overwrite is true, PROT_WRITE should be unset even if 
write_backward is

not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

Then why not removing mp->prot totally, and add a prot argument to 
perf_mmap__mmap,

since prot setting is always decided independently?


 if (perf_mmap__mmap(&maps[idx], mp, *output)  < 0)
 return -1;
 } else {
@@ -1058,9 +1063,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, 
unsigned int pages,
 struct perf_evsel *evsel;
 const struct cpu_map *cpus = evlist->cpus;
 const struct thread_map *threads = evlist->threads;
-   struct mmap_params mp = {
-   .prot = PROT_READ | (overwrite ? 0 : PROT_WRITE),
-   };
+   struct mmap_params mp;
  
 if (!evlist->mmap)

 evlist->mmap = perf_evlist__alloc_mmap(evlist);


The 'overwrite' argument in perf_evlist__mmap_ex() controls 
evlist->overwrite.


Thank you.

Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)




On 2017/11/1 20:00, Namhyung Kim wrote:

On Wed, Nov 01, 2017 at 06:32:50PM +0800, Wangnan (F) wrote:


On 2017/11/1 17:49, Namhyung Kim wrote:

Hi,

On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote:

perf record backward recording doesn't work as we expected: it never
overwrite when ring buffer full.

Test:

(Run a busy printing python task background like this:

   while True:
   print 123

send SIGUSR2 to perf to capture snapshot.)

[SNIP]


Signed-off-by: Wang Nan 
---
   tools/perf/util/evlist.c | 8 +++-
   1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..4c5daba 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist 
__maybe_unused,
   }
   static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int 
*_output_backward)
   {
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp;
evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap;
+   struct mmap_params rdonly_mp;
int *output = _output;
int fd;
int cpu;
+   mp = _mp;
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   rdonly_mp = *_mp;
+   rdonly_mp.prot &= ~PROT_WRITE;
+   mp = &rdonly_mp;
if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);
--

What about this instead (not tested)?


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int idx,
  if (*output == -1) {
  *output = fd;
+   if (evsel->attr.write_backward)
+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+

If evlist->overwrite is true, PROT_WRITE should be unset even if
write_backward is
not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

I thought evsel->attr.write_backward should be set when
evlist->overwrite is set.  Do you mean following case?

   perf record --overwrite -e 'cycles/no-overwrite/'



No. evlist->overwrite is unrelated to '--overwrite'. This is why I
said the concept of 'overwrite' and 'backward' is ambiguous.

perf record never sets 'evlist->overwrite'. What '--overwrite' actually
does is setting write_backward. Some testcases needs overwrite evlist.

Thank you.

Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)




On 2017/11/1 20:39, Jiri Olsa wrote:

On Wed, Nov 01, 2017 at 08:10:49PM +0800, Wangnan (F) wrote:

SNIP


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int idx,
   if (*output == -1) {
   *output = fd;
+   if (evsel->attr.write_backward)
+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+

If evlist->overwrite is true, PROT_WRITE should be unset even if
write_backward is
not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

I thought evsel->attr.write_backward should be set when
evlist->overwrite is set.  Do you mean following case?

perf record --overwrite -e 'cycles/no-overwrite/'


No. evlist->overwrite is unrelated to '--overwrite'. This is why I
said the concept of 'overwrite' and 'backward' is ambiguous.

perf record never sets 'evlist->overwrite'. What '--overwrite' actually
does is setting write_backward. Some testcases needs overwrite evlist.

did not check deeply, but so why can't we do the below?

jirka


---
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f4d9fc54b382..4643c487bd29 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -300,7 +300,7 @@ static int record__mmap_evlist(struct record *rec,
struct record_opts *opts = &rec->opts;
char msg[512];
  
-	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, false,

+   if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, opts->overwrite,
 opts->auxtrace_mmap_pages,
 opts->auxtrace_snapshot_mode) < 0) {


perf_evlist__mmap_ex's overwrite argument is used to control evlist->mmap.

Consider Namhyung's example:

  perf record --overwrite -e 'cycles/no-overwrite/'

In this case both evlist->mmap and evlist->backward_mmap would be set to 
overwrite.
'cycles' will be put into evlist->mmap, so it will be set to overwrite 
incorrectly.

Patch 2/2 clarifies these concepts. What we want is the flight recorder mode,
not only a read only ring buffer. Even flight recorder mode is selected, we can
still have a normal ring buffer which keep output data.

Thank you.

Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)




On 2017/11/1 21:26, Liang, Kan wrote:

The meaning of perf record's "overwrite" option and many "overwrite" in
source code are not clear. In perf's code, the 'overwrite' has 2 meanings:
  1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
  2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite option, its
real meaning is setting backward.


I don't understand here.
'overwrite' has 2 meanings. perf record only support 1.
It should be a bug, and need to be fixed.


Not a bug, but ambiguous.

Perf record doesn't need overwrite main channel (we have two channels:
evlist->mmap is main channel and evlist->backward_mmap is backward channel),
but some testcases require it, and your new patchset may require it.
'perf record --overwrite' doesn't set main channel overwrite. What it does
is moving all evsels to backward channel, and we can move some evsels back
to the main channel by /no-overwrite/ setting. This behavior is hard to
understand.

Thank you.

Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)




On 2017/11/1 22:22, Liang, Kan wrote:

On 2017/11/1 21:26, Liang, Kan wrote:

The meaning of perf record's "overwrite" option and many "overwrite"
in source code are not clear. In perf's code, the 'overwrite' has 2 meanings:
   1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
   2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite
option, its real meaning is setting backward.


I don't understand here.
'overwrite' has 2 meanings. perf record only support 1.
It should be a bug, and need to be fixed.

Not a bug, but ambiguous.

Perf record doesn't need overwrite main channel (we have two channels:
evlist->mmap is main channel and evlist->backward_mmap is backward
evlist->channel),
but some testcases require it, and your new patchset may require it.
'perf record --overwrite' doesn't set main channel overwrite. What it does is
moving all evsels to backward channel, and we can move some evsels back to
the main channel by /no-overwrite/ setting. This behavior is hard to
understand.


As my understanding, the 'main channel' should depends on what user sets.
If --overwrite is applied, then evlist->backward_mmap should be the
'main channel'. evlist->overwrite should be set to true as well.


Then it introduces a main channel switching mechanism, and we need
checking evlist->overwrite and another factor to determine which
one is the main channel. Make things more complex.


/no-overwrite/ setting is per-event setting.
Only when we finish the global setting, then the per-event setting will be
considered. You may refer to apply_config_terms.


Yes.


Thanks,
Kan

Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)




On 2017/11/1 23:04, Liang, Kan wrote:

On 2017/11/1 22:22, Liang, Kan wrote:

On 2017/11/1 21:26, Liang, Kan wrote:

The meaning of perf record's "overwrite" option and many "overwrite"
in source code are not clear. In perf's code, the 'overwrite' has 2

meanings:

1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite
option, its real meaning is setting backward.


I don't understand here.
'overwrite' has 2 meanings. perf record only support 1.
It should be a bug, and need to be fixed.

Not a bug, but ambiguous.

Perf record doesn't need overwrite main channel (we have two channels:
evlist->mmap is main channel and evlist->backward_mmap is backward
evlist->channel),
but some testcases require it, and your new patchset may require it.
'perf record --overwrite' doesn't set main channel overwrite. What it does

is

moving all evsels to backward channel, and we can move some evsels

back to

the main channel by /no-overwrite/ setting. This behavior is hard to
understand.


As my understanding, the 'main channel' should depends on what user sets.
If --overwrite is applied, then evlist->backward_mmap should be the
'main channel'. evlist->overwrite should be set to true as well.

Then it introduces a main channel switching mechanism, and we need
checking evlist->overwrite and another factor to determine which
one is the main channel. Make things more complex.

We should check the evlist->overwrite.
Now, all perf tools force evlist->overwrite = false. I think it doesn’t make 
sense.

What is another factor?


If we support mixed channel as well as forward overwrite ring buffer,
evlist->overwrite is not enough.


I don't think it will be too complex.

In perf_evlist__mmap_ex, we just need to add a check.
If (!overwrite)
evlist->mmap = perf_evlist__alloc_mmap(evlist);
else
evlist->backward_mmap = perf_evlist__alloc_mmap(evlist);

In perf_evlist__mmap_per_evsel, we already handle per-event overwrite.
It just need to add some similar codes to handler per-event nonoverwrite.


Then the logic becomes:

 if (check write_backward) {
maps = evlist->backward_mmap;
if (!maps) {
  maps = perf_evlist__alloc_mmap(...);
  if (!maps) {
  /* error processing */
  }
  evlist->backward_mmap = maps;
  if (evlist->bkw_mmap_state == BKW_MMAP_NOTREADY)
perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_RUNNING);
}
 } else {
maps = evlist->mmap;
if (!maps) {
  maps = perf_evlist__alloc_mmap(...);
  if (!maps) {
  /* error processing */
  }
  evlist->mmap = maps;
  
}
 }


For other codes, they should already check evlist->mmap and 
evlist->backward_mmap.
So they probably don't need to change.


Thanks,
Kan

Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)




On 2017/11/1 21:57, Liang, Kan wrote:

On 2017/11/1 20:00, Namhyung Kim wrote:

On Wed, Nov 01, 2017 at 06:32:50PM +0800, Wangnan (F) wrote:

On 2017/11/1 17:49, Namhyung Kim wrote:

Hi,

On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote:

perf record backward recording doesn't work as we expected: it never
overwrite when ring buffer full.

Test:

(Run a busy printing python task background like this:

while True:
print 123

send SIGUSR2 to perf to capture snapshot.)

[SNIP]


Signed-off-by: Wang Nan 
---
tools/perf/util/evlist.c | 8 +++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..4c5daba 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist

*evlist __maybe_unused,

}
static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int

idx,

-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int

*_output_backward)

{
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp;
evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap;
+   struct mmap_params rdonly_mp;
int *output = _output;
int fd;
int cpu;
+   mp = _mp;
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   rdonly_mp = *_mp;
+   rdonly_mp.prot &= ~PROT_WRITE;
+   mp = &rdonly_mp;
if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);
--

What about this instead (not tested)?


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct

perf_evlist *evlist, int idx,

   if (*output == -1) {
   *output = fd;
+   if (evsel->attr.write_backward)
+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+

If evlist->overwrite is true, PROT_WRITE should be unset even if
write_backward is
not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

I thought evsel->attr.write_backward should be set when
evlist->overwrite is set.  Do you mean following case?

perf record --overwrite -e 'cycles/no-overwrite/'


No. evlist->overwrite is unrelated to '--overwrite'. This is why I
said the concept of 'overwrite' and 'backward' is ambiguous.


Yes, I think we should make it clear.

As we discussed previously, there are four possible combinations
to access ring buffer , 'forward non-overwrite', 'forward overwrite',
'backward non-overwrite' and 'backward overwrite'.

Actually, not all of the combinations are necessary.
- 'forward overwrite' mode brings some problems which were mentioned
   in commit ID 9ecda41acb97 ("perf/core: Add ::write_backward attribute
   to perf event").
- 'backward non-overwrite' mode is very similar as 'forward non-overwrite'.
There is no extra advantage. Only keep one non-overwrite mode is enough.
So 'forward non-overwrite' and 'backward overwrite' are enough for all perf 
tools.

Furthermore, 'forward' and 'backward' only indicate the direction of the
ring buffer. They don't impact the result and performance. It is not
important as the concept of overwrite/non-overwrite.

To simplify the concept, only 'non-overwrite' mode and 'overwrite' mode should
be kept. 'non-overwrite' mode indicates the forward ring buffer. 'overwrite' 
mode
indicates the backward ring buffer.


perf record never sets 'evlist->overwrite'. What '--overwrite' actually
does is setting write_backward. Some testcases needs overwrite evlist.


There are only four test cases which set overwrite, sw-clock,task-exit,
mmap-basic, backward-ring-buffer.
Only backward-ring-buffer is 'backward overwrite'.
The rest three are all 'forward overwrite'. We just need to set write_backward
to convert them to 'backward overwrite'.
I think it's not hard to clean up.


If we add a new rule that overwrite ring buffers are always backward
then it is not hard to cleanup. However, the support of forward
overwrite ring buffer has a long history and the code is not written
by me. I'd like to check if there is some reason to keep support this
configuration?

Thank you.

Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-06 Thread Wangnan (F)




On 2017/11/6 16:52, Michal Hocko wrote:

On Mon 06-11-17 15:04:40, Bob Liu wrote:

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan  wrote:

tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space.
In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
TLB when tlb->fullmm is true:

   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").


CC'ed Will Deacon.


Which makes leaking of tlb entries. For example, when oom_reaper
selects a task and reaps its virtual memory space, another thread
in this task group may still running on another core and access
these already freed memory through tlb entries.

No threads should be running in userspace by the time the reaper gets to
unmap their address space. So the only potential case is they are
accessing the user memory from the kernel when we should fault and we
have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing
real?


This patch gather each vma instead of gathering full vm space,
tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.

I do not have any objections to do per vma tlb flushing because it would
free gathered pages sooner but I am not sure I see any real problem
here. Have you seen any real issues or this is more of a review driven
fix?


We saw the problem when we try to reuse oom reaper's code in
another situation. In our situation, we allow reaping a task
before all other tasks in its task group finish their exiting
procedure.

I'd like to know what ensures "No threads should be running in
userspace by the time the reaper"?

Thank you.


Signed-off-by: Wang Nan 
Cc: Bob Liu 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: David Rientjes 
Cc: Ingo Molnar 
Cc: Roman Gushchin 
Cc: Konstantin Khlebnikov 
Cc: Andrea Arcangeli 
---
  mm/oom_kill.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dee0f75..18c5b35 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
  */
 set_bit(MMF_UNSTABLE, &mm->flags);

-   tlb_gather_mmu(&tlb, mm, 0, -1);
 for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 if (!can_madv_dontneed_vma(vma))
 continue;
@@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
  * we do not want to block exit_mmap by keeping mm ref
  * count elevated without a good reason.
  */
-   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+   tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
 unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
  NULL);
+   tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end);
+   }
 }
-   tlb_finish_mmu(&tlb, 0, -1);
 pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
 task_pid_nr(tsk), tsk->comm,
 K(get_mm_counter(mm, MM_ANONPAGES)),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org

Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-06 Thread Wangnan (F)




On 2017/11/6 18:40, Michal Hocko wrote:

On Mon 06-11-17 17:59:54, Wangnan (F) wrote:


On 2017/11/6 16:52, Michal Hocko wrote:

On Mon 06-11-17 15:04:40, Bob Liu wrote:

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan  wrote:

tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space.
In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
TLB when tlb->fullmm is true:

commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").


CC'ed Will Deacon.


Which makes leaking of tlb entries. For example, when oom_reaper
selects a task and reaps its virtual memory space, another thread
in this task group may still running on another core and access
these already freed memory through tlb entries.

No threads should be running in userspace by the time the reaper gets to
unmap their address space. So the only potential case is they are
accessing the user memory from the kernel when we should fault and we
have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing
real?


This patch gather each vma instead of gathering full vm space,
tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.

I do not have any objections to do per vma tlb flushing because it would
free gathered pages sooner but I am not sure I see any real problem
here. Have you seen any real issues or this is more of a review driven
fix?

We saw the problem when we try to reuse oom reaper's code in
another situation. In our situation, we allow reaping a task
before all other tasks in its task group finish their exiting
procedure.

I'd like to know what ensures "No threads should be running in
userspace by the time the reaper"?

All tasks are killed by the time. So they should be taken out to the
kernel.


Sorry. I read oom_kill_process() but still unable to understand
why all tasks are killed.

oom_kill_process() kill victim by sending SIGKILL. It will be
broadcast to all tasks in its task group, but it is asynchronized.
In the following case, race can happen (Thread1 in Task1's task group):

core 1core 2
Thread1 running   oom_kill_process() selects Task1 as victim
  oom_kill_process() sends SIGKILL to Task1
  oom_kill_process() sends SIGKILL to Thread1
  oom_kill_process() wakes up oom reaper
  switch to oom_reaper
  __oom_reap_task_mm
  tlb_gather_mmu
  unmap_page_range, reap Task1
  tlb_finish_mmu
Write page
be kicked off from core
Receives SIGKILL

So what makes Thread1 being kicked off from core 1 before core 2
starting unmapping?

Thank you.

Re: [PATCH 06/32] perf tools: Enable passing bpf object file to --event

2015-08-28 Thread Wangnan (F)




On 2015/8/28 15:05, Wang Nan wrote:

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index ef5fde6..24c8b63 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -3090,6 +3090,7 @@ int cmd_trace(int argc, const char **argv, const char 
*prefix __maybe_unused)
if (trace.evlist->nr_entries > 0)
evlist__set_evsel_handler(trace.evlist, trace__event_handler);
  
+	/* trace__record calls cmd_record, which calls bpf__clear() */

if ((argc >= 1) && (strcmp(argv[0], "record") == 0))
return trace__record(&trace, argc-1, &argv[1]);
  
@@ -3100,7 +3101,8 @@ int cmd_trace(int argc, const char **argv, const char *prefix __maybe_unused)

if (!trace.trace_syscalls && !trace.trace_pgfaults &&
trace.evlist->nr_entries == 0 /* Was --events used? */) {
pr_err("Please specify something to trace.\n");
-   return -1;
+   err = -1;
+   goto out;
}
  
  	if (output_name != NULL) {

@@ -3159,5 +3161,6 @@ out_close:
if (output_name != NULL)
fclose(trace.output);
  out:
+   bpf__clear();
return err;
  }



Sorry, here is a silly mistake that I miss

#include "bpf-loader.h"

at the head of builtin-trace.c. In my default environment 
builtin-trace.c is not compiled
so I find this problem today when I compile it on another machine. I'll 
fix in my tree.


Arnaldo, since you suggest Ingo to pull directly, shall I make another pull 
request with the whole 32 patches
sent for fixing that line?

Thank you.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)




On 2015/8/29 8:45, Alexei Starovoitov wrote:

On 8/28/15 12:06 AM, Wang Nan wrote:

his patch adds a new trace event to establish infrastruction for bpf to
output data to perf. Userspace perf tools can detect and use this event
as using the existing tracepoint events.

New bpf trace event entry in debugfs:

  /sys/kernel/debug/tracing/events/bpf/bpf_output_data

Userspace perf tools detect the new tracepoint event as:

  bpf:bpf_output_data  [Tracepoint event]

Data in ring-buffer of perf events added to this event will be polled
out, sample types and other attributes can be adjusted to those events
directly without touching the original kprobe events.


Wang,
I have 2nd thoughts on this.
I've played with it, but global bpf:bpf_output_data event is limiting.
I'd like to use this bpf_output_trace_data() helper for tcp estats
gathering, but global collector will prevent other similar bpf programs
running in parallel.


So current model work for you but the problem is all output goes into one
place, which prevents similar BPF programs run in parallel because the
reveicer is unable to tell what message is generated by who. So actually
you want a publish-and-subscribe model, subscriber get messages from only
the publisher it interested in. Am I understand your problem correctly?


So as a concept I think it's very useful, but we need a way to select
which ring-buffer to output data to.
proposal A:
Can we use ftrace:instances concept and make bpf_output_trace_data()
into that particular trace_pipe ?
proposal B:
bpf_perf_event_read() model is using nice concept of an array of
perf_events. Can we perf_event_open a 'new' event that can be mmaped
in user space and bpf_output_trace_data(idx, buf, buf_size) into it.
Where 'idx' will be an index of FD from perf_even_open of such
new event?



I've also thinking about adding the extra id parameter in 
bpf_output_trace_data()

but it is for encoding the type of output data, which is totally different
from what you want.

For me, I use bpf_output_trace_data() to output information like PMU count
value. Perf is the only receiver, so global collector is perfect. Could you
please describe your usecase in more detail?

Thank you for using that feature!


Thanks!




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)




On 2015/8/29 9:34, Alexei Starovoitov wrote:

On 8/28/15 6:19 PM, Wangnan (F) wrote:
For me, I use bpf_output_trace_data() to output information like PMU 
count
value. Perf is the only receiver, so global collector is perfect. 
Could you

please describe your usecase in more detail?


there is a special receiver in user space that only wants the data from
the bpf program that it loaded. It shouldn't conflict with any other
processes. Like when it's running, I still should be able to use perf
for other performance analysis. There is no way to share single
bpf:bpf_output_data event, since these user processes are completely
independent.


I'd like to see whether it is possible to create dynamic tracepoints so
different receivers can listen on different tracepoints. For my side, maybe
I can encode format information into the new tracepoints so don't need
those LLVM patches.

For example:

# echo 'dynamic_tracepoint:mytracepoint ' >> 
/sys/kernel/debug/tracing/dynamic_trace_events

# perf list
  ...
  dynamic_tracepoint:mytracepoint
  ...

In perf side we can encode the creation of dynamic tracepoint into 
bpf-loader

like what we currectly do for probing the kprobes.

This way reqires us to create a fresh new event source, in parallel with 
tracepoint. I'm not sure

how much work it needs. What do you think?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)




On 2015/8/29 10:22, Alexei Starovoitov wrote:

On 8/28/15 7:15 PM, Wangnan (F) wrote:

I'd like to see whether it is possible to create dynamic tracepoints so
different receivers can listen on different tracepoints.


see my proposal A. I think ftrace instances might work for this.

I'm not sure about 'format' part though. Kernel side shouldn't be
aware of it. It's only the contract between bpf program and user process
that deals with it.


It is an option. Let's keep an open mind now :)

For current patch 32/32, I think it is useful enough for some simple cases,
and we have already start using it internally. What about keep it as what it
is now and create a independent method for your usecase?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 32/32] bpf: Introduce function for outputing data to perf event

2015-08-28 Thread Wangnan (F)




On 2015/8/29 10:49, Alexei Starovoitov wrote:

On 8/28/15 7:36 PM, Wangnan (F) wrote:
For current patch 32/32, I think it is useful enough for some simple 
cases,

and we have already start using it internally. What about keep it as
what it
is now and create a independent method for your usecase?


well, though the patch is small and contained, I think we can do better
and define more generic helper. I believe Namhyung back in July had
the same concern.


OK. I'll drop this one in my next pull request.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL 00/32] perf tools: filtering events using eBPF programs

2015-08-31 Thread Wangnan (F)




On 2015/8/31 21:59, Arnaldo Carvalho de Melo wrote:

Em Fri, Aug 28, 2015 at 05:25:27PM -0700, Alexei Starovoitov escreveu:

On 8/28/15 12:05 AM, Wang Nan wrote:

This time I adjust all Cc and Link field in each patch.
Four new patches (1,2,3,12/32) is newly introduced for fixing a bug
related to '--filter' option. Patch 06/32 is also modified. Please keep
an eye on it.
  

Arnaldo, what is the latest news on this set?
I think you've looked at most of them over the last months and few patch
reorders were necessary. Is it all addressed ? All further work is
sadly blocked, because these core patches need to come in first.
I took another look today and to me patches 1-30 look good.

I asked Ingo if he had anything else to mention about changelog format
so that I could try pulling it directly, i.e. I need give it a last
look, and this is not a per-patchkit cost, its just fine tuning that
_should_ make processing subsequent patchkits faster, by pulling instead
of me going thru each patch.

But I disagree it "prevents further work", nobody has to wait for
everything to get upstream to do work, anyway, it will be processed.


I think Xia Kaixu's BPF read pmu patch series is waiting for this. 
Please have

a look at [1]. His kernel side patches has already collected by net-next,
and waiting for userside update.

However he is also waiting for net-next be merged, and currently we are 
the only

user of that feature :)

[1] 
http://lkml.kernel.org/r/1440672142-89311-1-git-send-email-xiaka...@huawei.com 
.



- Arnaldo



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 23/31] perf tools: Introduce arch_get_reg_info() for x86

2015-08-31 Thread Wangnan (F)




On 2015/9/1 4:43, Arnaldo Carvalho de Melo wrote:

Em Sat, Aug 29, 2015 at 04:21:57AM +, Wang Nan escreveu:

From: He Kuang 

arch_get_reg_info() is a helper function which converts register name
like "%rax" to offset of a register in 'struct pt_regs', which is
required by BPF prologue generator.

Is this something like:

/* Query offset/name of register from its name/offset */
extern int regs_query_register_offset(const char *name);

in ptrace? Can't we reuse that name and even code?


Unfortunately we can't reuse its code, because pt_regs is defined 
differently

in user and kernel side.

In arch/x86/kernel/ptrace.c we have:

struct pt_regs_offset {
const char *name;
int offset;
};

#define REG_OFFSET_NAME(r) {.name = #r, .offset = offsetof(struct 
pt_regs, r)}

#define REG_OFFSET_END {.name = NULL, .offset = 0}

static const struct pt_regs_offset regoffset_table[] = {
#ifdef CONFIG_X86_64
...
REG_OFFSET_NAME(r15),
REG_OFFSET_NAME(r14),
REG_OFFSET_NAME(r13),
...

The definition of REG_OFFSET_NAME relys on the field name and the string 
name of a

register are identical. This is true for kernel, but not true for userspace.

For example, for x86_64, 'struct pt_regs' is defined in 
arch/x86/include/asm/ptrace.h
for kernel, and the reigster name is 'ax, cx, dx, si, di ...'. In 
contract, which is
defined in arch/x86/include/uapi/asm/ptrace.h for user, and the register 
name becomes

'rax, rcx, rdx, rsi, rdi ...'.

Since logical of regs_query_register_offset() is very simple, changing 
REG_OFFSET_NAME()

makes it a totally different function.

But yes, let's reuse its name. And it may worth considering to reuse its 
code for other

archs.

Thank you.


Was this that was done and only a rename was made?

- Arnaldo


This patch replaces original string table by a 'struct reg_info' table,
which records offset of registers according to its name.

For x86, since there are two sub-archs (x86_32 and x86_64) but we can
only get pt_regs for the arch we are currently on, this patch fills
offset with '-1' for another sub-arch. This introduces a limitation to
perf prologue that, we are unable to generate prologue on a x86_32
compiled perf for BPF programs targeted on x86_64 kernel. This
limitation is acceptable, because this is a very rare usecase.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-34-git-send-email-wangn...@huawei.com
---
  tools/perf/arch/x86/Makefile  |   1 +
  tools/perf/arch/x86/util/Build|   2 +
  tools/perf/arch/x86/util/dwarf-regs.c | 104 --
  3 files changed, 78 insertions(+), 29 deletions(-)

diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 21322e0..a84a6f6f 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -2,3 +2,4 @@ ifndef NO_DWARF
  PERF_HAVE_DWARF_REGS := 1
  endif
  HAVE_KVM_STAT_SUPPORT := 1
+PERF_HAVE_ARCH_GET_REG_INFO := 1
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 2c55e1b..09429f6 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,8 @@ libperf-y += tsc.o
  libperf-y += pmu.o
  libperf-y += kvm-stat.o
  
+# BPF_PROLOGUE also need dwarf-regs.o. However, if CONFIG_BPF_PROLOGUE

+# is true, CONFIG_DWARF must true.
  libperf-$(CONFIG_DWARF) += dwarf-regs.o
  
  libperf-$(CONFIG_LIBUNWIND)  += unwind-libunwind.o

diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
index be22dd4..9928caf 100644
--- a/tools/perf/arch/x86/util/dwarf-regs.c
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -22,44 +22,67 @@
  
  #include 

  #include 
+#include 
+#include 
+#include  /* for offsetof */
+#include 
+
+struct reg_info {
+   const char  *name;  /* Reg string in debuginfo  */
+   int offset; /* Reg offset in struct pt_regs */
+};
  
  /*

   * Generic dwarf analysis helpers
   */
-
+/*
+ * x86_64 compiling can't access pt_regs for x86_32, so fill offset
+ * with -1.
+ */
+#ifdef __x86_64__
+# define REG_INFO(n, f) { .name = n, .offset = -1, }
+#else
+# define REG_INFO(n, f) { .name = n, .offset = offsetof(struct pt_regs, f), }
+#endif
  #define X86_32_MAX_REGS 8
-const char *x86_32_regs_table[X86_32_MAX_REGS] = {
-   "%ax",
-   "%cx",
-   "%dx",
-   "%bx",
-   "$stack", /* Stack address instead of %sp */
-   "%bp",
-   "%si",
-   "%di",
+
+struct reg_info x86_32_regs_table[X86_32_MAX_REGS] = {
+   REG_INFO("%ax", eax),
+   REG_INFO("%cx", ecx),
+   REG_INFO("%dx", edx),
+   REG_INFO("%bx", ebx),
+   REG_INFO("$stack", esp),

Re: [PATCH 17/31] perf tests: Enforce LLVM test for BPF test

2015-08-31 Thread Wangnan (F)




On 2015/8/29 12:21, Wang Nan wrote:

diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build
index c1518bd..8c98409 100644
--- a/tools/perf/tests/Build
+++ b/tools/perf/tests/Build
@@ -32,7 +32,14 @@ perf-y += sample-parsing.o
  perf-y += parse-no-sample-id-all.o
  perf-y += kmod-path.o
  perf-y += thread-map.o
-perf-y += llvm.o
+perf-y += llvm.o llvm-src.o
+
+$(OUTPUT)tests/llvm-src.c: tests/bpf-script-example.c

This rule requires a $(call rule_mkdir). Will be fixed in next pull request.

+   $(Q)echo '#include ' > $@
+   $(Q)echo 'const char test_llvm__bpf_prog[] =' >> $@
+   $(Q)sed -e 's/"/\\"/g' -e 's/\(.*\)/"\1\\n"/g' $< >> $@
+   $(Q)echo ';' >> $@
+
  
  perf-$(CONFIG_X86) += perf-time-to-tsc.o
  




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf tools: Fix random building error

2015-08-31 Thread Wangnan (F)




On 2015/9/1 14:24, Adrian Hunter wrote:

On 01/09/15 08:56, Wang Nan wrote:

I hit following building error randomly:

Random presumably because there is a race to use/create the directory by
different make jobs (i.e. -j option).


Yes, I use -j 160 :)


   ...
/bin/sh: /path/to/kernel/buildperf/util/intel-pt-decoder/inat-tables.c: No such 
file or directory
   ...
   LINK /path/to/kernel/buildperf/plugin_mac80211.so
   LINK /path/to/kernel/buildperf/plugin_kmem.so
   LINK /path/to/kernel/buildperf/plugin_xen.so
   LINK /path/to/kernel/buildperf/plugin_hrtimer.so
In file included from util/intel-pt-decoder/intel-pt-insn-decoder.c:25:0:
util/intel-pt-decoder/inat.c:24:25: fatal error: inat-tables.c: No such file or 
directory
  #include "inat-tables.c"
  ^
compilation terminated.
make[4]: *** 
[/path/to/kernel/buildperf/util/intel-pt-decoder/intel-pt-insn-decoder.o] Error 
1
make[4]: *** Waiting for unfinished jobs
   LINK /path/to/kernel/buildperf/plugin_function.so

This is caused by tools/perf/util/intel-pt-decoder/Build that, it tries to
generate $(OUTPUT)util/intel-pt-decoder/inat-tables.c atomatically but
forget to ensure the existance of $(OUTPUT)util/intel-pt-decoder directory.

This patch fixes it by adding $(call rule_mkdir) like other similar rules.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Adrian Hunter 

Looks ok to me. Jiri?


---
  tools/perf/util/intel-pt-decoder/Build | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/intel-pt-decoder/Build 
b/tools/perf/util/intel-pt-decoder/Build
index 240730d..2386322 100644
--- a/tools/perf/util/intel-pt-decoder/Build
+++ b/tools/perf/util/intel-pt-decoder/Build
@@ -4,6 +4,7 @@ inat_tables_script = util/intel-pt-decoder/gen-insn-attr-x86.awk
  inat_tables_maps = util/intel-pt-decoder/x86-opcode-map.txt
  
  $(OUTPUT)util/intel-pt-decoder/inat-tables.c: $(inat_tables_script) $(inat_tables_maps)

+   $(call rule_mkdir)
@$(call echo-cmd,gen)$(AWK) -f $(inat_tables_script) $(inat_tables_maps) 
> $@ || rm -f $@
  
  $(OUTPUT)util/intel-pt-decoder/intel-pt-insn-decoder.o: util/intel-pt-decoder/inat.c $(OUTPUT)util/intel-pt-decoder/inat-tables.c





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/31] perf tools: Don't set cmdline_group_boundary if no evsel is collected

2015-09-01 Thread Wangnan (F)




On 2015/9/1 3:20, Arnaldo Carvalho de Melo wrote:

Em Sat, Aug 29, 2015 at 04:21:36AM +, Wang Nan escreveu:

If parse_events__scanner() collects no entry, perf_evlist__last(evlist)
is invalid. Then setting of cmdline_group_boundary touches invalid.

It could happend in currect BPF implementation. See [1]. Although it
can be fixed, for safety reason it whould be better to introduce this
check.

Instead of checking number of entries, check data.list instead, so we
can add dummy evsel here.

Event parsing fixes should have Jiri Olsa on the CC list, Jiri, is this
ok?

 From what I can see it looks Ok, my question, just from looking at this
patch, is if it is valid to get to this point with an empty data.list,
i.e. was it ever possible and this is a bug irrespective of eBPF?


It should not be a existing bug in perf. There are other places rely on
non-empty of the list. For example, in parse_events__set_leader(). 
Furtunately,
it won't triggered problem because we don't allow a BPF object to be 
wrapped with "{}"
lexically ("{./aaa.o}" will be interpreterd as file '{./aaa.o' and a 
extra '}').




- Arnaldo
  

[1]: 
http://lkml.kernel.org/n/1436445342-1402-19-git-send-email-wangn...@huawei.com

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/r/1440742821-44548-3-git-send-email-wangn...@huawei.com
---
  tools/perf/util/parse-events.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index d826e6f..14cd7e3 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1143,10 +1143,14 @@ int parse_events(struct perf_evlist *evlist, const char 
*str,
int entries = data.idx - evlist->nr_entries;
struct perf_evsel *last;
  
+		if (!list_empty(&data.list)) {

+   last = list_entry(data.list.prev,
+ struct perf_evsel, node);
+   last->cmdline_group_boundary = true;
+   }
+
perf_evlist__splice_list_tail(evlist, &data.list, entries);
evlist->nr_groups += data.nr_groups;
-   last = perf_evlist__last(evlist);
-   last->cmdline_group_boundary = true;
  
  		return 0;

}
--
2.1.0



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/31] perf tools: Don't set cmdline_group_boundary if no evsel is collected

2015-09-01 Thread Wangnan (F)




On 2015/9/1 18:38, Jiri Olsa wrote:

On Mon, Aug 31, 2015 at 04:20:03PM -0300, Arnaldo Carvalho de Melo wrote:

Em Sat, Aug 29, 2015 at 04:21:36AM +, Wang Nan escreveu:

If parse_events__scanner() collects no entry, perf_evlist__last(evlist)
is invalid. Then setting of cmdline_group_boundary touches invalid.

It could happend in currect BPF implementation. See [1]. Although it
can be fixed, for safety reason it whould be better to introduce this
check.

Instead of checking number of entries, check data.list instead, so we
can add dummy evsel here.

Event parsing fixes should have Jiri Olsa on the CC list, Jiri, is this
ok?

 From what I can see it looks Ok, my question, just from looking at this
patch, is if it is valid to get to this point with an empty data.list,
i.e. was it ever possible and this is a bug irrespective of eBPF?

good point, I believe it's either fail or event(s) added to the list
I haven't checked how's eBPF connected with event parsing, is there a
git tree I could check?


Please check:

https://git.kernel.org/cgit/linux/kernel/git/pi3orama/linux.git/log/?h=perf/ebpf

commit d7d91228cad0a78eae5ea9526a8a78debf3cf584
commit 2606fe61219899cb386823eddc1bc231ff5067a6

related to parsing.

Thank you.


thanks,
jirka



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 23/31] perf tools: Introduce regs_query_register_offset() for x86

2015-09-01 Thread Wangnan (F)




On 2015/9/1 19:47, 平松雅巳 / HIRAMATU，MASAMI wrote:

From: Wang Nan [mailto:wangn...@huawei.com]

regs_query_register_offset() is a helper function which converts
register name like "%rax" to offset of a register in 'struct pt_regs',
which is required by BPF prologue generator. Since the function is
identical, try to reuse the code in arch/x86/kernel/ptrace.c.

Comment inside dwarf-regs.c list the differences between this
implementation and kernel code.

Hmm, this also introduce a duplication of the code...
It might be a good time to move them into arch/x86/lib/ and
reuse it directly from perf code.


So you want to move it from ./arch/x86/kernel/ptrace.c to arch/x86/lib and
let perf link against arch/x86/lib/lib.a to use it?

I think it worth a specific work to do it. Currently we lake
scaffold to compile and link against the kernel side library. Moreover,
we should also consider other archs. Seems not very easy.

This is not the only one file which can benifite from kernel's arch/x86/lib
Newly introduced tools/perf/util/intel-pt-decoder/insn.c, and I believe 
there's

more. Therefore I think it should be a separated work from perf BPF patches.
So how about keep this patch at this time? Or do you have some idea?

Thank you.


Thank you,


get_arch_regstr() switches to regoffset_table and the old string table
is dropped.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
---
  tools/perf/arch/x86/Makefile  |   1 +
  tools/perf/arch/x86/util/Build|   1 +
  tools/perf/arch/x86/util/dwarf-regs.c | 122 --
  3 files changed, 90 insertions(+), 34 deletions(-)

diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 21322e0..09ba923 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -2,3 +2,4 @@ ifndef NO_DWARF
  PERF_HAVE_DWARF_REGS := 1
  endif
  HAVE_KVM_STAT_SUPPORT := 1
+PERF_HAVE_ARCH_REGS_QUERY_REGISTER_OFFSET := 1
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 2c55e1b..d4d1f23 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -4,6 +4,7 @@ libperf-y += pmu.o
  libperf-y += kvm-stat.o

  libperf-$(CONFIG_DWARF) += dwarf-regs.o
+libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o

  libperf-$(CONFIG_LIBUNWIND)  += unwind-libunwind.o
  libperf-$(CONFIG_LIBDW_DWARF_UNWIND) += unwind-libdw.o
diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
index a08de0a..de5b936 100644
--- a/tools/perf/arch/x86/util/dwarf-regs.c
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -21,55 +21,109 @@
   */

  #include 
+#include  /* for EINVAL */
+#include  /* for strcmp */
+#include  /* for struct pt_regs */
+#include  /* for offsetof */
  #include 

  /*
- * Generic dwarf analysis helpers
+ * See arch/x86/kernel/ptrace.c.
+ * Different from it:
+ *
+ *  - Since struct pt_regs is defined differently for user and kernel,
+ *but we want to use 'ax, bx' instead of 'rax, rbx' (which is struct
+ *field name of user's pt_regs), we make REG_OFFSET_NAME to accept
+ *both string name and reg field name.
+ *
+ *  - Since accessing x86_32's pt_regs from x86_64 building is difficult
+ *and vise versa, we simply fill offset with -1, so
+ *get_arch_regstr() still works but regs_query_register_offset()
+ *returns error.
+ *The only inconvenience caused by it now is that we are not allowed
+ *to generate BPF prologue for a x86_64 kernel if perf is built for
+ *x86_32. This is really a rare usecase.
+ *
+ *  - Order is different from kernel's ptrace.c for get_arch_regstr(), which
+ *is defined by dwarf.
   */

-#define X86_32_MAX_REGS 8
-const char *x86_32_regs_table[X86_32_MAX_REGS] = {
-   "%ax",
-   "%cx",
-   "%dx",
-   "%bx",
-   "$stack", /* Stack address instead of %sp */
-   "%bp",
-   "%si",
-   "%di",
+struct pt_regs_offset {
+   const char *name;
+   int offset;
+};
+
+#define REG_OFFSET_END {.name = NULL, .offset = 0}
+
+#ifdef __x86_64__
+# define REG_OFFSET_NAME_64(n, r) {.name = n, .offset = offsetof(struct 
pt_regs, r)}
+# define REG_OFFSET_NAME_32(n, r) {.name = n, .offset = -1}
+#else
+# define REG_OFFSET_NAME_64(n, r) {.name = n, .offset = -1}
+# define REG_OFFSET_NAME_32(n, r) {.name = n, .offset = offsetof(struct 
pt_regs, r)}
+#endif
+
+static const struct pt_regs_offset x86_32_regoffset_table[] = {
+   REG_OFFSET_NAME_32("%ax", eax),
+   REG_OFFSET_NAME_32("%cx", ecx),
+   REG_OFFSET_NAME_32("%dx", edx),
+   REG_OFFSET_NAME_32("%bx", ebx),
+   REG_OFFSET_NAME_32("$stack",  esp),   /* Stack address instead of %sp */
+   REG_OFFSET_

Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread Wangnan (F)




On 2015/10/12 20:02, Peter Zijlstra wrote:

On Mon, Oct 12, 2015 at 09:02:42AM +, Kaixu Xia wrote:

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -483,6 +483,8 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
  
+	atomic_t			*sample_disable;

+
  #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
struct event_filter *filter;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..f6ef45c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending);
}
  
+	if ((event->sample_disable) && atomic_read(event->sample_disable))

+   return ret;
+
if (event->overflow_handler)
event->overflow_handler(event, data, regs);
else

Try and guarantee sample_disable lives in the same cacheline as
overflow_handler.


Could you please explain why we need them to be in a same cacheline?

Thank you.


I think we should at the very least replace the kzalloc() currently used
with a cacheline aligned alloc, and check the structure layout to verify
these two do in fact share a cacheline.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers

2015-10-12 Thread Wangnan (F)




On 2015/10/13 3:29, Alexei Starovoitov wrote:

On 10/12/15 2:02 AM, Kaixu Xia wrote:

+extern const struct bpf_func_proto bpf_perf_event_sample_enable_proto;
+extern const struct bpf_func_proto bpf_perf_event_sample_disable_proto;


externs are unnecessary. Just make them static.
Also I prefer single helper that takes a flag, so we can extend it
instead of adding func_id for every little operation.

To avoid conflicts if you touch kernel/bpf/* or bpf.h please always
base your patches of net-next.

> +atomic_set(&map->perf_sample_disable, 0);

global flag per map is no go.
events are independent and should be treated as such.



Then how to avoid racing? For example, when one core disabling all events
in a map, another core is enabling all of them. This racing may causes 
sereval
perf events in a map dump samples while other events not. To avoid such 
racing

I think some locking must be introduced, then cost is even higher.

The reason why we introduce an atomic pointer is because each operation 
should
controls a set of events, not one event, due to the per-cpu manner of 
perf events.


Thank you.


Please squash these two patches, since they're part of one logical
feature. Splitting them like this only makes review harder.

--
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers

2015-10-12 Thread Wangnan (F)




On 2015/10/13 11:39, Alexei Starovoitov wrote:

On 10/12/15 8:27 PM, Wangnan (F) wrote:
Then how to avoid racing? For example, when one core disabling all 
events

in a map, another core is enabling all of them. This racing may causes
sereval
perf events in a map dump samples while other events not. To avoid such
racing
I think some locking must be introduced, then cost is even higher.

The reason why we introduce an atomic pointer is because each operation
should
controls a set of events, not one event, due to the per-cpu manner of
perf events.


why 'set disable' is needed ?
the example given in cover letter shows the use case where you want
to receive samples only within sys_write() syscall.
The example makes sense, but sys_write() is running on this cpu, so just
disabling it on the current one is enough.



Our real use case is control of the system-wide sampling. For example,
we need sampling all CPUs when smartphone start refershing its display.
We need all CPUs because in Android system there are plenty of threads
get involed into this behavior. We can't achieve this by controling
sampling on only one CPU. This is the reason we need 'set enable'
and 'set disable'.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers

2015-10-12 Thread Wangnan (F)




On 2015/10/13 12:16, Alexei Starovoitov wrote:

On 10/12/15 8:51 PM, Wangnan (F) wrote:

why 'set disable' is needed ?
the example given in cover letter shows the use case where you want
to receive samples only within sys_write() syscall.
The example makes sense, but sys_write() is running on this cpu, so 
just

disabling it on the current one is enough.



Our real use case is control of the system-wide sampling. For example,
we need sampling all CPUs when smartphone start refershing its display.
We need all CPUs because in Android system there are plenty of threads
get involed into this behavior. We can't achieve this by controling
sampling on only one CPU. This is the reason we need 'set enable'
and 'set disable'.


ok, but that use case may have different enable/disable pattern.
In sys_write example ultra-fast enable/disable is must have, since
the whole syscall is fast and overhead should be minimal.
but for display refresh? we're talking milliseconds, no?
Can you just ioctl() it from user space?
If cost of enable/disable is high or the time range between toggling is
long, then doing it from the bpf program doesn't make sense. Instead
the program can do bpf_perf_event_output() to send a notification to
user space that condition is met and the user space can ioctl() events.



OK. I think I understand your design principle that, everything inside BPF
should be as fast as possible.

Make userspace control events using ioctl make things harder. You know that
'perf record' itself doesn't care too much about events it reveived. It only
copies data to perf.data, but what we want is to use perf record simply like
this:

 # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100

And in control.o we create uprobe point to mark the start and finish of 
a frame:


 SEC("target=/a/b/c.o\nstartFrame=0x123456")
 int startFrame(void *) {
   bpf_pmu_enable(pmu);
   return 1;
 }

 SEC("target=/a/b/c.o\nfinishFrame=0x234568")
 int finishFrame(void *) {
   bpf_pmu_disable(pmu);
   return 1;
 }

I think it is make sence also.

I still think perf is not necessary be independent each other. You know 
we have

PERF_EVENT_IOC_SET_OUTPUT which can set multiple events output through one
ringbuffer. This way perf events are connected.

I think the 'set disable/enable' design in this patchset satisify the 
design goal
that in BPF program we only do simple and fast things. The only 
inconvience is
we add something into map, which is ugly. What about using similar 
implementation
like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like 
PERF_EVENT_IOC_SET_ENABLER,
then let perf to select an event as 'enabler', then BPF can still 
control one atomic

variable to enable/disable a set of events.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers

2015-10-13 Thread Wangnan (F)




On 2015/10/13 13:15, Alexei Starovoitov wrote:

On 10/12/15 9:34 PM, Wangnan (F) wrote:



On 2015/10/13 12:16, Alexei Starovoitov wrote:

On 10/12/15 8:51 PM, Wangnan (F) wrote:

why 'set disable' is needed ?
the example given in cover letter shows the use case where you want
to receive samples only within sys_write() syscall.
The example makes sense, but sys_write() is running on this cpu, so
just
disabling it on the current one is enough.



Our real use case is control of the system-wide sampling. For example,
we need sampling all CPUs when smartphone start refershing its 
display.

We need all CPUs because in Android system there are plenty of threads
get involed into this behavior. We can't achieve this by controling
sampling on only one CPU. This is the reason we need 'set enable'
and 'set disable'.


ok, but that use case may have different enable/disable pattern.
In sys_write example ultra-fast enable/disable is must have, since
the whole syscall is fast and overhead should be minimal.
but for display refresh? we're talking milliseconds, no?
Can you just ioctl() it from user space?
If cost of enable/disable is high or the time range between toggling is
long, then doing it from the bpf program doesn't make sense. Instead
the program can do bpf_perf_event_output() to send a notification to
user space that condition is met and the user space can ioctl() events.



OK. I think I understand your design principle that, everything 
inside BPF

should be as fast as possible.

Make userspace control events using ioctl make things harder. You 
know that

'perf record' itself doesn't care too much about events it reveived. It
only
copies data to perf.data, but what we want is to use perf record simply
like
this:

  # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100

And in control.o we create uprobe point to mark the start and finish of
a frame:

  SEC("target=/a/b/c.o\nstartFrame=0x123456")
  int startFrame(void *) {
bpf_pmu_enable(pmu);
return 1;
  }

  SEC("target=/a/b/c.o\nfinishFrame=0x234568")
  int finishFrame(void *) {
bpf_pmu_disable(pmu);
return 1;
  }

I think it is make sence also.


yes. that looks quite useful,
but did you consider re-entrant startFrame() ?
start << here sampling starts
  start
  finish << here all samples disabled?!
finish
and startFrame()/finishFrame() running on all cpus of that user app ?
One cpu entering into startFrame() while another cpu doing finishFrame
what behavior should be? sampling is still enabled on all cpus? or off?
Either case doesn't seem to work with simple enable/disable.
Few emails in this thread back, I mentioned inc/dec of a flag
to solve that.


Correct.




What about using similar
implementation
like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like
PERF_EVENT_IOC_SET_ENABLER,
then let perf to select an event as 'enabler', then BPF can still
control one atomic
variable to enable/disable a set of events.


you lost me on that last sentence. How this 'enabler' will work?


Like what we did in this patchset: add an atomic flag to perf_event,
make all perf_event connected to this enabler by PERF_EVENT_IOC_SET_ENABLER.
During running, check the enabler's atomic flag. So we use one atomic
variable to control a set of perf_event. Finally create a BPF helper
function to control that atomic variable.


Also I'm still missing what's wrong with perf doing ioctl() on
events on all cpus manually when bpf program tells it to do so.
Is it speed you concerned about or extra work in perf ?



I think both speed and extra work need be concerned.

Say we use perf to enable/disable sampling. Use the above example to
describe, when smartphone starting refresing display, we write something
into ringbuffer, then display refreshing start. We have to wait for
perf be scheduled in, parse event it get (perf record doesn't do this
currently), discover the trigger event then enable sampling perf events
on all cpus. We make trigger and action asynchronous. I'm not sure how
many ns or ms it need, and I believe asynchronization itself introduces
complexity, which I think need to be avoided except we can explain the
advantages asynchronization can bring.

But yes, perf based implementation can shut down the PMU completely, which
is better than current light-weight implementation.

In summary:

 - In next version we will use counter based flag instead of current
   0/1 switcher in considering of reentering problem.

 - I think we both agree we need a light weight solution in which we can
   enable/disable sampling in function level. This light-weight solution
   can be applied to only one perf-event.

 - Our disagreement is whether to introduce a heavy-weight solution based
   on perf to enable/disable a group of perf event. For me, perf-based
   solution can shut down PMU completly, which is good. However, it

Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers

2015-10-13 Thread Wangnan (F)




On 2015/10/13 18:54, He Kuang wrote:

hi, Alexei


What about using similar
implementation
like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like
PERF_EVENT_IOC_SET_ENABLER,
then let perf to select an event as 'enabler', then BPF can still
control one atomic
variable to enable/disable a set of events.


you lost me on that last sentence. How this 'enabler' will work?
Also I'm still missing what's wrong with perf doing ioctl() on
events on all cpus manually when bpf program tells it to do so.
Is it speed you concerned about or extra work in perf ?




For not having too much wakeups, perf ringbuffer has a watermark
limit to cache events and reduce the wakeups, which causes perf
userspace tool can not receive perf events immediately.

Here's a simple demo expamle to prove it, 'sleep_exec' does some
writes and prints a timestamp every second, and an lable is
printed when perf poll gets events.

  $ perf record -m 2 -e syscalls:sys_enter_write sleep_exec 1000
  userspace sleep time: 0 seconds
  userspace sleep time: 1 seconds
  userspace sleep time: 2 seconds
  userspace sleep time: 3 seconds
  perf record wakeup onetime 0
  userspace sleep time: 4 seconds
  userspace sleep time: 5 seconds
  userspace sleep time: 6 seconds
  userspace sleep time: 7 seconds
  perf record wakeup onetime 1
  userspace sleep time: 8 seconds
  perf record wakeup onetime 2
  ..

  $ perf record -m 1 -e syscalls:sys_enter_write sleep_exec 1000
  userspace sleep time: 0 seconds
  userspace sleep time: 1 seconds
  perf record wakeup onetime 0
  userspace sleep time: 2 seconds
  userspace sleep time: 3 seconds
  perf record wakeup onetime 1
  userspace sleep time: 4 seconds
  userspace sleep time: 5 seconds
  ..

By default, if no mmap_pages is specified, perf tools wakeup only
when the target executalbe finished:

  $ perf record -e syscalls:sys_enter_write sleep_exec 5
  userspace sleep time: 0 seconds
  userspace sleep time: 1 seconds
  userspace sleep time: 2 seconds
  userspace sleep time: 3 seconds
  userspace sleep time: 4 seconds
  perf record wakeup onetime 0
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.006 MB perf.data (54 samples) ]

If we want perf to reflect as soon as our sample event be generated,
--no-buffering should be used, but this option has a greater
impact on performance.

  $ perf record --no-buffering -e syscalls:sys_enter_write sleep_exec 
1000

  userspace sleep time: 0 seconds
  perf record wakeup onetime 0
  perf record wakeup onetime 1
  perf record wakeup onetime 2
  perf record wakeup onetime 3
  perf record wakeup onetime 4
  perf record wakeup onetime 5
  perf record wakeup onetime 6
  ..



Hi Alexei,

Based on He Kuang's test result, if we choose to use perf to control 
perf event

and output trigger event through bof_output_data, with default setting we
have to wait for sereval seconds until perf can get first trigger event 
if the
trigger event's frequency is low. In my display refreshing example, it 
causes

losting of event triggering. From user's view, random frames would miss.

With --no-buffering, things can become faster, but --no-buffering causes 
perf
to be scheduled in faster than normal, which is conflict to the goal of 
event

disabling that we want to reduce recording overhead as much as possible.

Thank you.


Thank you




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 23/31] perf tools: Introduce regs_query_register_offset() for x86

2015-09-05 Thread Wangnan (F)

On 2015/9/1 23:54, 平松雅巳 / HIRAMATU，MASAMI wrote:

From: Arnaldo Carvalho de Melo [mailto:a...@redhat.com]

Em Tue, Sep 01, 2015 at 11:47:41AM +, 平松雅巳 / HIRAMATU，MASAMI escreveu:

From: Wang Nan [mailto:wangn...@huawei.com]

regs_query_register_offset() is a helper function which converts
register name like "%rax" to offset of a register in 'struct pt_regs',
which is required by BPF prologue generator. Since the function is
identical, try to reuse the code in arch/x86/kernel/ptrace.c.

Comment inside dwarf-regs.c list the differences between this
implementation and kernel code.

Hmm, this also introduce a duplication of the code...
It might be a good time to move them into arch/x86/lib/ and
reuse it directly from perf code.

It is strange to, having tried sharing stuff directly from the kernel,
to be now in a position where I advocate against it...

Copy'n'pasting what I said in another message:

-
We would go back to sharing stuff with the kernel, but this time around
we would be using something that everybody knows is being shared, which
doesn't elliminates the possibility that at some point changes made with
the kernel in mind would break the tools/ using code.

Perhaps it is better to keep copying what we want and introduce
infrastructure to check for differences and warn us as soon as possible
so that we would do the copy, test if it doesn't break what we use, etc.

I.e. we wouldn't be putting any new burden on the "kernel people", i.e.
the burden of having to check that changes they make don't break tools/
living code, nor any out of the blue breakage on tools/ for tools/
developers to fix when changes are made on the kernel "side" -
---

The "stop sharing directly stuff with the kernel" stance was taken after
a report from Linus about breakage due to tools/ using kernel files
directly and then a change made in some RCU files broke the tools/perf/
build, even with tools/perf/ not using anything RCU related so far.

Looking at tools/perf/MANIFEST, the file used to create a detached
tarball so that perf can be built outside the kernel sources there are
still some kernel source files listed, but those probably need to be
copied too...

OK, so let this apply.

Acked-by: Masami Hiramatsu 

And also we'll need a testcase for this.

I created a testcase for the whole BPF prologue, so I think this can be 
covered?

I'll post them by replying this mail.

Thank you.

Thank you,

- Arnaldo

Thank you,

get_arch_regstr() switches to regoffset_table and the old string table
is dropped.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
---
  tools/perf/arch/x86/Makefile  |   1 +
  tools/perf/arch/x86/util/Build|   1 +
  tools/perf/arch/x86/util/dwarf-regs.c | 122 --
  3 files changed, 90 insertions(+), 34 deletions(-)

diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 21322e0..09ba923 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -2,3 +2,4 @@ ifndef NO_DWARF
  PERF_HAVE_DWARF_REGS := 1
  endif
  HAVE_KVM_STAT_SUPPORT := 1
+PERF_HAVE_ARCH_REGS_QUERY_REGISTER_OFFSET := 1
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 2c55e1b..d4d1f23 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -4,6 +4,7 @@ libperf-y += pmu.o
  libperf-y += kvm-stat.o

  libperf-$(CONFIG_DWARF) += dwarf-regs.o
+libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o

  libperf-$(CONFIG_LIBUNWIND)  += unwind-libunwind.o
  libperf-$(CONFIG_LIBDW_DWARF_UNWIND) += unwind-libdw.o
diff --git a/tools/perf/arch/x86/util/dwarf-regs.c 
b/tools/perf/arch/x86/util/dwarf-regs.c
index a08de0a..de5b936 100644
--- a/tools/perf/arch/x86/util/dwarf-regs.c
+++ b/tools/perf/arch/x86/util/dwarf-regs.c
@@ -21,55 +21,109 @@
   */

  #include 
+#include  /* for EINVAL */
+#include  /* for strcmp */
+#include  /* for struct pt_regs */
+#include  /* for offsetof */
  #include 

  /*
- * Generic dwarf analysis helpers
+ * See arch/x86/kernel/ptrace.c.
+ * Different from it:
+ *
+ *  - Since struct pt_regs is defined differently for user and kernel,
+ *but we want to use 'ax, bx' instead of 'rax, rbx' (which is struct
+ *field name of user's pt_regs), we make REG_OFFSET_NAME to accept
+ *both string name and reg field name.
+ *
+ *  - Since accessing x86_32's pt_regs from x86_64 building is difficult
+ *and vise versa, we simply fill offset with -1, so
+ *get_arch_regstr() still works but regs_query_register_offset()
+ *returns error.
+ *The only inconvenience caused by it now is that we are not allowed
+ *to generate BPF prologue for a x86_64 kernel if perf is built for
+ *

Re: [PATCH v2 1/5] perf probe: Split add_perf_probe_events()

2015-09-06 Thread Wangnan (F)


Hi Namhyung,

Thanks for this patchset.

Could you plase have a look at patch 5/27 and 6/27 in my newest pull 
request?

These 2 patches utilize new probing API to create probe point and collect
probe_trace_events. I'm not very sure I fully understand your design 
principle,
especially the cleanup part, because I can see different functions 
dealing with

cleanup:

cleanup_perf_probe_events
del_perf_probe_events
clear_perf_probe_event
clear_probe_trace_event

But non of them works perfectly for me.

In bpf_prog_priv__clear() function of 6/27, I copied some code from
cleanup_perf_probe_events(), because I think when destroying bpf programs,
the probe_trace_events should also be cleanuped, but we don't need call
exit_symbol_maps() many times, because we are in 'perf record', and not
sure whether other parts of perf need symbol maps. Otherwise I think 
directly

calling cleanup_perf_probe_events() sould be better.

You can find patch from:

http://lkml.kernel.org/n/1441523623-152703-6-git-send-email-wangn...@huawei.com

http://lkml.kernel.org/n/1441523623-152703-7-git-send-email-wangn...@huawei.com

Thank you.

On 2015/9/4 20:15, Namhyung Kim wrote:

The add_perf_probe_events() does 3 things:

  1. convert all perf events to trace events
  2. add all trace events to kernel
  3. cleanup all trace events

But sometimes we need to do something with the trace events.  So split
the funtion into three, so that it can access intermediate trace events
via struct __event_package if needed.

Acked-by: Masami Hiramatsu 
Signed-off-by: Namhyung Kim 
---
  tools/perf/util/probe-event.c | 39 +++
  1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index eb5f18b75402..2c762f41e7a5 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -2765,9 +2765,10 @@ struct __event_package {
int ntevs;
  };
  
-int add_perf_probe_events(struct perf_probe_event *pevs, int npevs)

+static int convert_perf_probe_events(struct perf_probe_event *pevs, int npevs,
+struct __event_package **ppkgs)
  {
-   int i, j, ret;
+   int i, ret;
struct __event_package *pkgs;
  
  	ret = 0;

@@ -2792,12 +2793,21 @@ int add_perf_probe_events(struct perf_probe_event 
*pevs, int npevs)
ret  = convert_to_probe_trace_events(pkgs[i].pev,
 &pkgs[i].tevs);
if (ret < 0)
-   goto end;
+   return ret;
pkgs[i].ntevs = ret;
}
/* This just release blacklist only if allocated */
kprobe_blacklist__release();
  
+	*ppkgs = pkgs;

+
+   return 0;
+}
+
+static int apply_perf_probe_events(struct __event_package *pkgs, int npevs)
+{
+   int i, ret = 0;
+
/* Loop 2: add all events */
for (i = 0; i < npevs; i++) {
ret = __add_probe_trace_events(pkgs[i].pev, pkgs[i].tevs,
@@ -2806,7 +2816,16 @@ int add_perf_probe_events(struct perf_probe_event *pevs, 
int npevs)
if (ret < 0)
break;
}
-end:
+   return ret;
+}
+
+static void cleanup_perf_probe_events(struct __event_package *pkgs, int npevs)
+{
+   int i, j;
+
+   if (pkgs == NULL)
+   return;
+
/* Loop 3: cleanup and free trace events  */
for (i = 0; i < npevs; i++) {
for (j = 0; j < pkgs[i].ntevs; j++)
@@ -2815,6 +2834,18 @@ end:
}
free(pkgs);
exit_symbol_maps();
+}
+
+int add_perf_probe_events(struct perf_probe_event *pevs, int npevs)
+{
+   int ret;
+   struct __event_package *pkgs = NULL;
+
+   ret = convert_perf_probe_events(pevs, npevs, &pkgs);
+   if (ret == 0)
+   ret = apply_perf_probe_events(pkgs, npevs);
+
+   cleanup_perf_probe_events(pkgs, npevs);
  
  	return ret;

  }



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf report: Fix invalid memory accessing

2015-09-07 Thread Wangnan (F)




On 2015/9/7 21:03, Jiri Olsa wrote:

On Mon, Sep 07, 2015 at 12:51:55PM +, Wang Nan wrote:

Commit e1e499aba570a2ea84d29822b7ea637ac41d9a51 (perf tools: Add
processor socket info to hist_entry and addr_location) reads env->cpu
array for each sample using index al.cpu. However, al.cpu can be -1 if
sample doesn't select PERF_SAMPLE_CPU. Also, env->cpu can be invalid if
feature CPU_TOPOLOGY not selected. We should validate env->cpu and al.cpu
before setting al.socket.

Signed-off-by: Wang Nan 
Cc: Kan Liang 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: Arnaldo Carvalho de Melo 
---

Although theoretically CPU_TOPOLOGY feature should always be selected by
'perf record', I did generate a perf.data without that feature. It has
header like this:

  # perf report -i ./bad.perf.data  --header-only
  # 
  # captured on: Thu Jan  8 09:30:15 2009
  # hostname : localhost
  # os release : 3.10.49-gd672fc4
  # perf version : 4.2.gc9df
  # arch : aarch64
  # nrcpus online : 8
  # nrcpus avail : 8
  # total memory : 1850768 kB
  # cmdline : /system/bin/perf record -e sync:sync_timeline -e 
kgsl:kgsl_register_event -g -a sleep 5
  # event : name = sync:sync_timeline, , id = { 1107, 1108, 1109, 1110, , 
1112 }, type = 2, size = 112, config = 0x3e7, { sample_period, sample_freq } = 
1, sample_type = IP|TID|TIME|CALLCHAIN|ID|CPU|PERIOD|RAW, read_format = ID, 
disabled = 1, inherit = 1, mmap = 1, comm = 1, task = 1, sample_id_all = 1, 
exclude_guest = 1, mmap2 = 1, comm_exec = 1
  # event : name = kgsl:kgsl_register_event, , id = { 1113, 1114, 1115, 1116, 
1117, 1118 }, type = 2, size = 112, config = 0x350, { sample_period, 
sample_freq } = 1, sample_type = IP|TID|TIME|CALLCHAIN|ID|CPU|PERIOD|RAW, 
read_format = ID, disabled = 1, inherit = 1, sample_id_all = 1, exclude_guest = 
1
  # pmu mappings: cpu = 4, software = 1, tracepoint = 2
  # 
  #

It should be:

  # 
  # captured on: Thu Jan  8 11:26:41 2009
  ...
  # HEADER_CPU_TOPOLOGY info available, use -I to display
  # pmu mappings: cpu = 4, software = 1, tracepoint = 2
  # 

However, bad perf.data appears randomly. I can't stably reproduce it, so I
guess there might have another invalid memory accessing.

---
  tools/perf/builtin-report.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 4b43245..16d097d 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -158,8 +158,16 @@ static int process_sample_event(struct perf_tool *tool,
return -1;
}
  
-	/* read socket id from perf.data for perf report */

-   al.socket = env->cpu[al.cpu].socket_id;
+   /*
+* read socket id from perf.data for perf report
+* al.cpu is invalid if PERF_SAMPLE_CPU is not selected by this
+* sample.
+* env->cpu is invalid if CPU_TOPOLOGY feature is not set in
+* header.
+*/
+   al.socket = -1;
+   if (env->cpu && al.cpu >= 0)
+   al.socket = env->cpu[al.cpu].socket_id;

perf_event__preprocess_sample initializes al.socket from current system


No. For 'perf report' it initializes al.cpu from sample.


Commit message of e1e499aba570a2ea84d29822b7ea637ac41d9a51:

Finor 'perf report', the socket id info is from perf.data.

For others, the socket id info is from current system.

And at least checking of env->cpu is essential. I'm looking the problem 
I reported.

Looks like build_cpu_topology() is possible to fail.

Thank you.

do we want to move this over there?

also this change is just report specific, and we could need
this in at least perf top

jirka



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf report: Fix invalid memory accessing

2015-09-07 Thread Wangnan (F)




On 2015/9/7 20:51, Wang Nan wrote:


[SNIP]


Although theoretically CPU_TOPOLOGY feature should always be selected by
'perf record', I did generate a perf.data without that feature. It has
header like this:

  # perf report -i ./bad.perf.data  --header-only
  # 
  # captured on: Thu Jan  8 09:30:15 2009
  # hostname : localhost
  # os release : 3.10.49-gd672fc4
  # perf version : 4.2.gc9df
  # arch : aarch64
  # nrcpus online : 8
  # nrcpus avail : 8
  # total memory : 1850768 kB
  # cmdline : /system/bin/perf record -e sync:sync_timeline -e 
kgsl:kgsl_register_event -g -a sleep 5
  # event : name = sync:sync_timeline, , id = { 1107, 1108, 1109, 1110, , 
1112 }, type = 2, size = 112, config = 0x3e7, { sample_period, sample_freq } = 
1, sample_type = IP|TID|TIME|CALLCHAIN|ID|CPU|PERIOD|RAW, read_format = ID, 
disabled = 1, inherit = 1, mmap = 1, comm = 1, task = 1, sample_id_all = 1, 
exclude_guest = 1, mmap2 = 1, comm_exec = 1
  # event : name = kgsl:kgsl_register_event, , id = { 1113, 1114, 1115, 1116, 
1117, 1118 }, type = 2, size = 112, config = 0x350, { sample_period, 
sample_freq } = 1, sample_type = IP|TID|TIME|CALLCHAIN|ID|CPU|PERIOD|RAW, 
read_format = ID, disabled = 1, inherit = 1, sample_id_all = 1, exclude_guest = 
1
  # pmu mappings: cpu = 4, software = 1, tracepoint = 2
  # 
  #

It should be:

  # 
  # captured on: Thu Jan  8 11:26:41 2009
  ...
  # HEADER_CPU_TOPOLOGY info available, use -I to display
  # pmu mappings: cpu = 4, software = 1, tracepoint = 2
  # 

However, bad perf.data appears randomly. I can't stably reproduce it, so I
guess there might have another invalid memory accessing.




I found the problem.

perf relies on build_cpu_topology() to fetch CPU_TOPOLOGY from sysfs. It 
depend on

the existance of

/sys/devices/system/cpu/cpu%d/topology/core_siblings_list

However, CPU can be canceled by hotcpu subsystem. After that the 
directory of

/sys/devices/system/cpu/cpu%d/topology is gone, which causes perf's
write_cpu_topology() --> uild_cpu_topology() to fail, result in the 
above perf.data.


So I think my patch is required.

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf report: Fix invalid memory accessing

2015-09-08 Thread Wangnan (F)




On 2015/9/8 15:37, Jiri Olsa wrote:

On Mon, Sep 07, 2015 at 09:27:26PM +0800, Wangnan (F) wrote:

SNIP


I found the problem.

perf relies on build_cpu_topology() to fetch CPU_TOPOLOGY from sysfs. It
depend on
the existance of

/sys/devices/system/cpu/cpu%d/topology/core_siblings_list

However, CPU can be canceled by hotcpu subsystem. After that the directory
of
/sys/devices/system/cpu/cpu%d/topology is gone, which causes perf's
write_cpu_topology() --> uild_cpu_topology() to fail, result in the above
perf.data.

So I think my patch is required.

no question there.. I just meant it should be placed in
perf_event__preprocess_sample function with the rest of
the 'al' initialization, like in the patch below?

it does not compile, because there're many places calling
it and it'd need changing all callers to pass env, which
seems to require more changes..

also I'm not sure about removing:
-   al->socket = cpu_map__get_socket_id(al->cpu);


Does any command actually need this initialized from current system?

thanks,
jirka


---
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0bf8c9889fc0..3339d2579bfc 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -990,7 +990,8 @@ void thread__find_addr_location(struct thread *thread,
  int perf_event__preprocess_sample(const union perf_event *event,
  struct machine *machine,
  struct addr_location *al,
- struct perf_sample *sample)
+ struct perf_sample *sample,
+ struct perf_env *env)
  {
u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
struct thread *thread = machine__findnew_thread(machine, sample->pid,
@@ -1021,7 +1022,10 @@ int perf_event__preprocess_sample(const union perf_event 
*event,
  
  	al->sym = NULL;

al->cpu = sample->cpu;
-   al->socket = cpu_map__get_socket_id(al->cpu);
+
+   al.socket = -1;
+   if (env->cpu && al->cpu >= 0)
+   al.socket = env->cpu[al->cpu].socket_id;
  
  	if (al->map) {

struct dso *dso = al->map->dso;


Now I understand your suggestion. You mean we can build env->cpu before
processing the first sample, then init al.socket using that map instead
of calling cpu_map__get_socket_id() unconditionally in an ad-hoc way.

And I have another question that, since build_cpu_topo() and
perf_event__preprocess_sample() are more or less doing similar things,
why we need both of them?

Then we need more code for this bug...

Kan Liang, do you have any suggestion?

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/31] perf tools: Compile scriptlets to BPF objects when passing '.c' to --event

2015-10-14 Thread Wangnan (F)




On 2015/10/14 23:45, Namhyung Kim wrote:

On Wed, Oct 14, 2015 at 12:41:20PM +, Wang Nan wrote:

This patch provides infrastructure for passing source files to --event
directly using:

  # perf record --event bpf-file.c command

This patch does following works:

  1) Allow passing '.c' file to '--event'. parse_events_load_bpf() is
 expanded to allow caller tell it whether the passed file is source
 file or object.

  2) llvm__compile_bpf() is called to compile the '.c' file, the result
 is saved into memory. Use bpf_object__open_buffer() to load the
 in-memory object.

Introduces a bpf-script-example.c so we can manually test it:

  # perf record --clang-opt "-DLINUX_VERSION_CODE=0x40200" --event 
./bpf-script-example.c sleep 1

Note that '--clang-opt' must put before '--event'.

Futher patches will merge it into a testcase so can be tested automatically.

Signed-off-by: Wang Nan 
Signed-off-by: He Kuang 
Acked-by: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Signed-off-by: Arnaldo Carvalho de Melo 
Link: http://lkml.kernel.org/n/ebpf-6yw9eg0ej3l4jnqhinngk...@git.kernel.org
---
  tools/perf/tests/bpf-script-example.c | 44 +++
  tools/perf/util/bpf-loader.c  | 17 --
  tools/perf/util/bpf-loader.h  |  5 ++--
  tools/perf/util/parse-events.c|  5 ++--
  tools/perf/util/parse-events.h|  3 ++-
  tools/perf/util/parse-events.l|  3 +++
  tools/perf/util/parse-events.y| 15 ++--
  7 files changed, 83 insertions(+), 9 deletions(-)
  create mode 100644 tools/perf/tests/bpf-script-example.c

diff --git a/tools/perf/tests/bpf-script-example.c 
b/tools/perf/tests/bpf-script-example.c
new file mode 100644
index 000..410a70b
--- /dev/null
+++ b/tools/perf/tests/bpf-script-example.c

Shouldn't it be a part of the next patch?


I think putting the sample file in this patch should be better.

In commit message I show a cmdline to utilize scriptlet compiling. If
we put the sample code into next patch, then people get to this point
have to write his/her own script to test scriptlet compiling manually.

Thank you.


Thanks,
Namhyung



@@ -0,0 +1,44 @@
+#ifndef LINUX_VERSION_CODE
+# error Need LINUX_VERSION_CODE
+# error Example: for 4.2 kernel, put 'clang-opt="-DLINUX_VERSION_CODE=0x40200" 
into llvm section of ~/.perfconfig'
+#endif
+#define BPF_ANY 0
+#define BPF_MAP_TYPE_ARRAY 2
+#define BPF_FUNC_map_lookup_elem 1
+#define BPF_FUNC_map_update_elem 2
+
+static void *(*bpf_map_lookup_elem)(void *map, void *key) =
+   (void *) BPF_FUNC_map_lookup_elem;
+static void *(*bpf_map_update_elem)(void *map, void *key, void *value, int 
flags) =
+   (void *) BPF_FUNC_map_update_elem;
+
+struct bpf_map_def {
+   unsigned int type;
+   unsigned int key_size;
+   unsigned int value_size;
+   unsigned int max_entries;
+};
+
+#define SEC(NAME) __attribute__((section(NAME), used))
+struct bpf_map_def SEC("maps") flip_table = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(int),
+   .max_entries = 1,
+};
+
+SEC("func=sys_epoll_pwait")
+int bpf_func__sys_epoll_pwait(void *ctx)
+{
+   int ind =0;
+   int *flag = bpf_map_lookup_elem(&flip_table, &ind);
+   int new_flag;
+   if (!flag)
+   return 0;
+   /* flip flag and store back */
+   new_flag = !*flag;
+   bpf_map_update_elem(&flip_table, &ind, &new_flag, BPF_ANY);
+   return new_flag;
+}
+char _license[] SEC("license") = "GPL";
+int _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index aa784a4..ba6f752 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -12,6 +12,7 @@
  #include "bpf-loader.h"
  #include "probe-event.h"
  #include "probe-finder.h" // for MAX_PROBES
+#include "llvm-utils.h"
  
  #define DEFINE_PRINT_FN(name, level) \

  static int libbpf_##name(const char *fmt, ...)\
@@ -33,7 +34,7 @@ struct bpf_prog_priv {
struct perf_probe_event pev;
  };
  
-struct bpf_object *bpf__prepare_load(const char *filename)

+struct bpf_object *bpf__prepare_load(const char *filename, bool source)
  {
struct bpf_object *obj;
static bool libbpf_initialized;
@@ -45,7 +46,19 @@ struct bpf_object *bpf__prepare_load(const char *filename)
libbpf_initialized = true;
}
  
-	obj = bpf_object__open(filename);

+   if (source) {
+   int err;
+   void *obj_buf;
+   size_t obj_buf_sz;
+
+   err = llvm__compile_bpf(filename, &obj_buf, &obj_buf_sz);
+   if (err)
+   return ERR_PTR(err);
+   obj = bpf_object__open_buffer(obj_buf, obj_buf_sz, filename);
+

Re: [PATCH 16/31] perf tools: Add prologue for BPF programs for fetching arguments

2015-10-15 Thread Wangnan (F)




On 2015/10/15 13:26, Namhyung Kim wrote:

On Wed, Oct 14, 2015 at 12:41:27PM +, Wang Nan wrote:

From: He Kuang 

This patch generates prologue for a BPF program which fetch arguments
for it. With this patch, the program can have arguments as follow:

  SEC("lock_page=__lock_page page->flags")
  int lock_page(struct pt_regs *ctx, int err, unsigned long flags)
  {
 return 1;
  }

This patch passes at most 3 arguments from r3, r4 and r5. r1 is still
the ctx pointer. r2 is used to indicate the successfulness of
dereferencing.

This patch uses r6 to hold ctx (struct pt_regs) and r7 to hold stack
pointer for result. Result of each arguments first store on stack:

  low address
  BPF_REG_FP - 24  ARG3
  BPF_REG_FP - 16  ARG2
  BPF_REG_FP - 8   ARG1
  BPF_REG_FP
  high address

Then loaded into r3, r4 and r5.

The output prologue for offn(...off2(off1(reg should be:

  r6 <- r1   // save ctx into a callee saved register
  r7 <- fp
  r7 <- r7 - stack_offset// pointer to result slot
  /* load r3 with the offset in pt_regs of 'reg' */
  (r7) <- r3 // make slot valid
  r3 <- r3 + off1// prepare to read unsafe pointer
  r2 <- 8
  r1 <- r7   // result put onto stack
  call probe_read   // read unsafe pointer
  jnei r0, 0, err   // error checking
  r3 <- (r7) // read result
  r3 <- r3 + off2// prepare to read unsafe pointer
  r2 <- 8
  r1 <- r7
  call probe_read
  jnei r0, 0, err
  ...
  /* load r2, r3, r4 from stack */
  goto success
err:
  r2 <- 1
  /* load r3, r4, r5 with 0 */
  goto usercode
success:
  r2 <- 0
usercode:
  r1 <- r6   // restore ctx
  // original user code

If all of arguments reside in register (dereferencing is not
required), gen_prologue_fastpath() will be used to create
fast prologue:

  r3 <- (r1 + offset of reg1)
  r4 <- (r1 + offset of reg2)
  r5 <- (r1 + offset of reg3)
  r2 <- 0

P.S.

eBPF calling convention is defined as:

* r0- return value from in-kernel function, and exit value
   for eBPF program
* r1 - r5   - arguments from eBPF program to in-kernel function
* r6 - r9   - callee saved registers that in-kernel function will
   preserve
* r10   - read-only frame pointer to access stack

Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: http://lkml.kernel.org/n/ebpf-6yw9eg0ej3l4jnqhinngk...@git.kernel.org
---

[SNIP]

+int bpf__gen_prologue(struct probe_trace_arg *args, int nargs,
+ struct bpf_insn *new_prog, size_t *new_cnt,
+ size_t cnt_space)
+{
+   struct bpf_insn *success_code = NULL;
+   struct bpf_insn *error_code = NULL;
+   struct bpf_insn *user_code = NULL;
+   struct bpf_insn_pos pos;
+   bool fastpath = true;
+   int i;
+
+   if (!new_prog || !new_cnt)
+   return -EINVAL;
+
+   pos.begin = new_prog;
+   pos.end = new_prog + cnt_space;
+   pos.pos = new_prog;
+
+   if (!nargs) {
+   ins(BPF_ALU64_IMM(BPF_MOV, BPF_PROLOGUE_FETCH_RESULT_REG, 0),
+   &pos);
+
+   if (check_pos(&pos))
+   goto errout;
+
+   *new_cnt = pos_get_cnt(&pos);
+   return 0;
+   }
+
+   if (nargs > BPF_PROLOGUE_MAX_ARGS)
+   nargs = BPF_PROLOGUE_MAX_ARGS;

Wouldn't it be better to inform user if it ignored some arguments?


Correct. I'd like to add a notification in next version:

diff --git a/tools/perf/util/bpf-prologue.c b/tools/perf/util/bpf-prologue.c
index e4adb18..36093d9 100644
--- a/tools/perf/util/bpf-prologue.c
+++ b/tools/perf/util/bpf-prologue.c
@@ -337,8 +337,10 @@ int bpf__gen_prologue(struct probe_trace_arg *args, 
int nargs,

return 0;
}

-   if (nargs > BPF_PROLOGUE_MAX_ARGS)
+   if (nargs > BPF_PROLOGUE_MAX_ARGS) {
+   pr_warning("bpf: prologue: too many arguments\n");
nargs = BPF_PROLOGUE_MAX_ARGS;
+   }
if (cnt_space > BPF_MAXINSNS)
cnt_space = BPF_MAXINSNS;


Thank you.



Thanks,
Namhyung



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/31] perf test: Enforce LLVM test for BPF test

2015-10-15 Thread Wangnan (F)




On 2015/10/14 23:48, Namhyung Kim wrote:

On Wed, Oct 14, 2015 at 12:41:21PM +, Wang Nan wrote:

This patch replaces the original toy BPF program with previous introduced
bpf-script-example.c. Dynamically embedded it into 'llvm-src.c'.

The newly introduced BPF program attaches a BPF program at
'sys_epoll_pwait()', and collect half samples from it. perf itself never
use that syscall, so further test can verify their result with it.

Since BPF program require LINUX_VERSION_CODE of runtime kernel, this
patch computes that code from uname.

Since the resuling BPF object is useful for further testcases, this patch
introduces 'prepare' and 'cleanup' method to tests, and makes test__llvm()
create a MAP_SHARED memory array to hold the resulting object.

Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: http://lkml.kernel.org/n/ebpf-6yw9eg0ej3l4jnqhinngk...@git.kernel.org
---

[SNIP]


+void test__llvm_prepare(void)
+{
+   p_test_llvm__bpf_result = mmap(NULL, SHARED_BUF_INIT_SIZE,
+  PROT_READ | PROT_WRITE,
+  MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+   if (!p_test_llvm__bpf_result)

It should check MAP_FAILED instead.



Fixed by this way:

diff --git a/tools/perf/tests/llvm.c b/tools/perf/tests/llvm.c
index e722e8a..25ddeaf 100644
--- a/tools/perf/tests/llvm.c
+++ b/tools/perf/tests/llvm.c
@@ -199,12 +199,15 @@ void test__llvm_prepare(void)

for (i = 0; llvm_testcases[i].source; i++) {
struct test_llvm__bpf_result *result;
+   void *p;

-   result = mmap(NULL, SHARED_BUF_INIT_SIZE,
- PROT_READ | PROT_WRITE,
- MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-   if (!result)
+   p = mmap(NULL, SHARED_BUF_INIT_SIZE,
+PROT_READ | PROT_WRITE,
+MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+   if (p == MAP_FAILED)
return;
+
+   result = p;
memset((void *)result, '\0', SHARED_BUF_INIT_SIZE);

llvm_testcases[i].result = result;

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/7] perf tools: Config BPF maps through perf cmdline

2015-10-17 Thread Wangnan (F)




On 2015/10/18 4:35, Alexei Starovoitov wrote:

On 10/17/15 3:48 AM, Wang Nan wrote:

In these 7 patches:

  1. perf is able to put values into map:
   # perf record -e mybpf.c/maps.values.value=1234/ ...

  2. perf is able to control different slots in a map separately:
   # perf record -e 
mybpf.c/maps.values.value[1,4-6]=1234,maps.values.value[0,2-3]=5678/ ...


  3. The second syntax can be applied to perf event also:
   # perf record -v -a -e evt=cycles -e 
mybpf.c/maps.pmu_map.event[0]=evt/ ...


  4. Compatible with the old syntax:
   # perf record -v -a -e evt=cycles -e 
mybpf.c/maps.pmu_map.event=evt/ ...


The concept looks good and solves real need.
No opinion on implementation.


Can I translate these words into an acked-by?

And what's your PERF_COUNT_SW_BPF_OUTPUT going on? I think based on this
patchset you can test it with perf now.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/31] perf tools: Enable passing bpf object file to --event

2015-10-20 Thread Wangnan (F)




On 2015/10/20 23:15, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 20, 2015 at 12:12:55PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Wed, Oct 14, 2015 at 12:41:14PM +, Wang Nan escreveu:

By introducing new rules in tools/perf/util/parse-events.[ly], this
patch enables 'perf record --event bpf_file.o' to select events by an
eBPF object file. It calls parse_events_load_bpf() to load that file,
which uses bpf__prepare_load() and finally calls bpf_object__open() for
the object files.

After applying this patch, commands like:

  # perf record --event foo.o sleep

become possible.

So, trying the above command I get almost perfect output:

   [root@felicio ~]# perf record --event foo.o sleep
   libbpf: failed to open foo.o: No such file or directory
   event syntax error: 'foo.o'
\___ BPF object file 'foo.o' is invalid

   (add -v to see detail)
   Run 'perf list' for a list of valid events

Usage: perf record [] []
   or: perf record [] --  []

   -e, --eventevent selector. use 'perf list' to list available 
events
   [root@felicio ~]#


Good thing would be to not have any message from libbpf and the right error
message from the parser, i.e. the first three lines become these two:


   event syntax error: 'foo.o'
\___ BPF object file 'foo.o' not found.o

But that can be fixed up in an upcoming patch, so I am applying this one now in
my new attempt at processing this patchkit.


I think fixing that is not very hard. When designing libbpf, I thought
the problem like this so makes libbpf output controled by caller using
'libbpf_set_print()'. What we need to do is to pass different output
functions to libbpf. We can easily disable all output from libbpf if
verbose is set 0. The only question is: do you want me to provide a new
version of patch 'perf ebpf: Add the libbpf glue' or you prefer another
patch to adjust the output functions?

Thank you.


Ditto for:

   [acme@felicio linux]$ perf record --event /tmp/build/perf/perf.o sleep
   libbpf: /tmp/build/perf/perf.o is not an eBPF object file
   event syntax error: '/tmp/build/perf/perf.o'
\___ BPF object file '/tmp/build/perf/perf.o' is invalid
   
   (add -v to see detail)

   Run 'perf list' for a list of valid events

Usage: perf record [] []
   or: perf record [] --  []

   -e, --eventevent selector. use 'perf list' to list available 
events
   [acme@felicio linux]$

Now trying to find a _valid_ ebpf object file to test with.

- Arnaldo



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/31] perf tools: Enable passing bpf object file to --event

2015-10-20 Thread Wangnan (F)




On 2015/10/20 23:42, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 20, 2015 at 12:15:58PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Tue, Oct 20, 2015 at 12:12:55PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Wed, Oct 14, 2015 at 12:41:14PM +, Wang Nan escreveu:


Managed after running:

  perf test LLVM

copy'n'pasting the output of those "set env" lines, replacing it with
export, etc to get to:

[acme@felicio linux]$ echo '__attribute__((section("do_fork"), used)) int fork(void *ctx) {return 0;} char _license[] 
__attribute__((section("license"), used)) = "GPL";int _version __attribute__((section("version"), 
used)) = 0x40100;' | $CLANG_EXEC -D__KERNEL__ $CLANG_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign 
-working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o /tmp/foo.o
[acme@felicio linux]$ file /tmp/foo.o
/tmp/foo.o: ELF 64-bit LSB relocatable, no machine, version 1 (SYSV), not 
stripped
[acme@felicio linux]$

And finally:


[acme@felicio linux]$ file /tmp/foo.o
/tmp/foo.o: ELF 64-bit LSB relocatable, no machine, version 1 (SYSV), not 
stripped
[acme@felicio linux]$ perf record --event /tmp/foo.o sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data ]
[acme@felicio linux]$ perf evlist  -v
/tmp/foo.o: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 
4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 
1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, 
mmap2: 1, comm_exec: 1
[acme@felicio linux]$ perf evlist
/tmp/foo.o
[acme@felicio linux]$

So, type 1 is PERF_TYPE_SOFTWARE, config 0x9 is PERF_COUNT_SW_DUMMY, ok.


You won't see this dummy event after patch 'perf tools: Collect perf_evsel
in BPF object files'. I use dummy event here so we can test basic parsing
and loading before that enabler patch to avoid enable too much code by
one patch.

Thank you, and glad to see you start working on this patch set again!


And it behaves accordingly, no samples, etc.

Added the sequence of testing with a non-existing file, with a normal ELF file
and then with a valid eBPF ELF file as committer notes.

Continuing.

- Arnaldo



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/31] perf record, bpf: Create probe points for BPF programs

2015-10-20 Thread Wangnan (F)




On 2015/10/21 3:12, Arnaldo Carvalho de Melo wrote:

Em Wed, Oct 14, 2015 at 12:41:15PM +, Wang Nan escreveu:

This patch introduces bpf__{un,}probe() functions to enable callers to
create kprobe points based on section names a BPF program. It parses
the section names in the program and creates corresponding 'struct
perf_probe_event' structures. The parse_perf_probe_command() function is
used to do the main parsing work. The resuling 'struct perf_probe_event'
is stored into program private data for further using.

By utilizing the new probing API, this patch creates probe points during
event parsing.

To ensure probe points be removed correctly, register an atexit hook
so even perf quit through exit() bpf__clear() is still called, so probing
points are cleared. Note that bpf_clear() should be registered before
bpf__probe() is called, so failure of bpf__probe() can still trigger
bpf__clear() to remove probe points which are already probed.

strerror style error reporting scaffold is created by this patch.
bpf__strerror_probe() is the first error reporting function in bpf-loader.c.

So, this one, for a non-root user gives me:

[acme@felicio linux]$ perf record --event /tmp/foo.o sleep 1
event syntax error: '/tmp/foo.o'
  \___ Invalid argument

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
[acme@felicio linux]$


It should be:

 # perf record -e ./test_config_map.o ls
 Failed to init vmlinux path.
 event syntax error: './test_config_map.o'
  \___ You need to be root, and 
/proc/sys/kernel/kptr_restrict should be 0



 (add -v to see detail)
 Run 'perf list' for a list of valid events

  usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list 
available events


So this is another problem related to probing. I'll have a look at your lot.

Thank you.




I.e. no libbpf error (good!) but then, just an -EINVAL as the "event syntax
error", which clearly isn't a syntax error, we need to tell the user that he or 
she
needs special perfmissions for using sys_bpf() :-)

As root:

[root@felicio ~]# perf record --event /tmp/foo.o sleep
event syntax error: '/tmp/foo.o'
  \___ Invalid argument

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
[root@felicio ~]# ls -la /tmp/foo.o
-rw-rw-r--. 1 acme acme 824 Oct 20 12:35 /tmp/foo.o
[root@felicio ~]# file /tmp/foo.o
/tmp/foo.o: ELF 64-bit LSB relocatable, no machine, version 1 (SYSV), not 
stripped


Humm, its something else, this is an ancient kernel, 4.2.0, probably without
eBPF support? Nope, its there:

[root@felicio ~]# grep -i sys_bpf /proc/kallsyms
811829d0 T SyS_bpf
811829d0 T sys_bpf
[root@felicio ~]#

Its something else, we need to improve this error reporting:

[root@felicio ~]# perf record -v --event /tmp/foo.o sleep 1
libbpf: loading /tmp/foo.o
libbpf: section .strtab, size 60, link 0, flags 0, type=3
libbpf: section .text, size 0, link 0, flags 6, type=1
libbpf: section .data, size 0, link 0, flags 3, type=1
libbpf: section .bss, size 0, link 0, flags 3, type=8
libbpf: section do_fork, size 16, link 0, flags 6, type=1
libbpf: found program do_fork
libbpf: section license, size 4, link 0, flags 3, type=1
libbpf: license of /tmp/foo.o is GPL
libbpf: section version, size 4, link 0, flags 3, type=1
libbpf: kernel version of /tmp/foo.o is 40100
libbpf: section .symtab, size 96, link 1, flags 0, type=2
bpf: config program 'do_fork'
symbol:do_fork file:(null) line:0 offset:0 return:0 lazy:(null)
bpf: 'do_fork': event name is missing
event syntax error: '/tmp/foo.o'
  \___ Invalid argument

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
[root@felicio ~]#

[root@felicio ~]# grep do_fork /proc/kallsyms
81099ab0 T _do_fork
81ccc800 d do_fork_test
[root@felicio ~]#

$ echo '__attribute__((section("_do_fork"), used)) int fork(void *ctx) {return 0;} char _license[] 
__attribute__((section("license"), used)) = "GPL";int _version 
__attribute__((section("version"), used)) = 0x40100;' | clang -D__KERNEL__ $CLANG_OPTIONS $KERNEL_INC_OPTIONS 
-Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c - -target bpf -O2 -o /tmp/foo.o

[root@felicio ~]# perf record  -v --event /tmp/foo.o sleep 1
libbpf: loading /tmp/foo.o
libbpf: section .strtab, size 61, link 0, flags 0, type=3
libbpf: section .text, size 0, link 0, flags 6, type=1
libbpf: section .data, size 0, link 0, flags 3, type=1
libbpf: section .bss, size

Re: [PATCH 04/31] perf record, bpf: Create probe points for BPF programs

2015-10-20 Thread Wangnan (F)




On 2015/10/21 3:12, Arnaldo Carvalho de Melo wrote:

Em Wed, Oct 14, 2015 at 12:41:15PM +, Wang Nan escreveu:

This patch introduces bpf__{un,}probe() functions to enable callers to
create kprobe points based on section names a BPF program. It parses
the section names in the program and creates corresponding 'struct
perf_probe_event' structures. The parse_perf_probe_command() function is
used to do the main parsing work. The resuling 'struct perf_probe_event'
is stored into program private data for further using.

By utilizing the new probing API, this patch creates probe points during
event parsing.

To ensure probe points be removed correctly, register an atexit hook
so even perf quit through exit() bpf__clear() is still called, so probing
points are cleared. Note that bpf_clear() should be registered before
bpf__probe() is called, so failure of bpf__probe() can still trigger
bpf__clear() to remove probe points which are already probed.

strerror style error reporting scaffold is created by this patch.
bpf__strerror_probe() is the first error reporting function in bpf-loader.c.

So, this one, for a non-root user gives me:

[acme@felicio linux]$ perf record --event /tmp/foo.o sleep 1
event syntax error: '/tmp/foo.o'
  \___ Invalid argument

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
[acme@felicio linux]$



I.e. no libbpf error (good!) but then, just an -EINVAL as the "event syntax
error", which clearly isn't a syntax error, we need to tell the user that he or 
she
needs special perfmissions for using sys_bpf() :-)

As root:

[root@felicio ~]# perf record --event /tmp/foo.o sleep
event syntax error: '/tmp/foo.o'
  \___ Invalid argument

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
[root@felicio ~]# ls -la /tmp/foo.o
-rw-rw-r--. 1 acme acme 824 Oct 20 12:35 /tmp/foo.o
[root@felicio ~]# file /tmp/foo.o
/tmp/foo.o: ELF 64-bit LSB relocatable, no machine, version 1 (SYSV), not 
stripped


Humm, its something else, this is an ancient kernel, 4.2.0, probably without
eBPF support? Nope, its there:

[root@felicio ~]# grep -i sys_bpf /proc/kallsyms
811829d0 T SyS_bpf
811829d0 T sys_bpf
[root@felicio ~]#

Its something else, we need to improve this error reporting:

[root@felicio ~]# perf record -v --event /tmp/foo.o sleep 1
libbpf: loading /tmp/foo.o
libbpf: section .strtab, size 60, link 0, flags 0, type=3
libbpf: section .text, size 0, link 0, flags 6, type=1
libbpf: section .data, size 0, link 0, flags 3, type=1
libbpf: section .bss, size 0, link 0, flags 3, type=8
libbpf: section do_fork, size 16, link 0, flags 6, type=1
libbpf: found program do_fork
libbpf: section license, size 4, link 0, flags 3, type=1
libbpf: license of /tmp/foo.o is GPL
libbpf: section version, size 4, link 0, flags 3, type=1
libbpf: kernel version of /tmp/foo.o is 40100
libbpf: section .symtab, size 96, link 1, flags 0, type=2
bpf: config program 'do_fork'
symbol:do_fork file:(null) line:0 offset:0 return:0 lazy:(null)
bpf: 'do_fork': event name is missing


BPF report the problem, but it is a little bit hard to understand...


event syntax error: '/tmp/foo.o'
  \___ Invalid argument

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
[root@felicio ~]#

[root@felicio ~]# grep do_fork /proc/kallsyms
81099ab0 T _do_fork
81ccc800 d do_fork_test
[root@felicio ~]#

$ echo '__attribute__((section("_do_fork"), used)) int fork(void *ctx) {return 0;} char _license[] 
__attribute__((section("license"), used)) = "GPL";int _version 
__attribute__((section("version"), used)) = 0x40100;' | clang -D__KERNEL__ $CLANG_OPTIONS $KERNEL_INC_OPTIONS 
-Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c - -target bpf -O2 -o /tmp/foo.o


In your program you only provide "do_fork", but we need "key=value" syntax.
"key" will become the name of created kprobe. Please try 
"__attribute__((section("func=do_fork"), used)) "

instead.

I think when event name is missing we'd better construct one name for it
like perf probe, but then we need to deal with perf probe code again. It
may require another patch.

For this patch, I think we can assign a new errorno so bpf__strerror_probe()
can give more information to let user know whether the problem is reside 
in bpf

program or perf configuration. Do you think ENOEXEC is a good choice?

Thank you.



[root@felicio ~]# perf record  -v --event /tmp/foo.o sleep 1
libbpf: loading /tmp/foo.o
libbpf: section .strtab, size 61, li

Re: [RFC PATCH] bpf: Add new bpf map type for timer

2015-10-21 Thread Wangnan (F)




On 2015/10/21 18:20, Ingo Molnar wrote:

* He Kuang  wrote:


ping and add a...@plumgrid.com, what's your opinion on this?

Firstly, two days isn't nearly enough for a 'review timeout', secondly, have you
seen the kbuild test reports?

Thirdly, I suspect others will do a deeper review, but even stylistically the
patch is a bit weird, for example these kinds of unstructured struct 
initializers
are annoying:


   struct bpf_map_def SEC("maps") timer_map = {
.type = BPF_MAP_TYPE_TIMER_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(unsigned long long),
.max_entries = 4,
   };
.map_alloc = fd_array_map_alloc,
.map_free = fd_array_map_free,
.map_get_next_key = array_map_get_next_key,
-   .map_lookup_elem = fd_array_map_lookup_elem,
+   .map_lookup_elem = empty_array_map_lookup_elem,
.map_update_elem = fd_array_map_update_elem,
.map_delete_elem = fd_array_map_delete_elem,
.map_fd_get_ptr = prog_fd_array_get_ptr,
@@ -312,7 +318,7 @@ static const struct bpf_map_ops perf_event_array_ops = {
.map_alloc = fd_array_map_alloc,
.map_free = perf_event_array_map_free,
.map_get_next_key = array_map_get_next_key,
-   .map_lookup_elem = fd_array_map_lookup_elem,
+   .map_lookup_elem = empty_array_map_lookup_elem,
.map_update_elem = fd_array_map_update_elem,
.map_delete_elem = fd_array_map_delete_elem,
.map_fd_get_ptr = perf_event_fd_array_get_ptr,
+static const struct bpf_map_ops timer_array_ops = {
+   .map_alloc = timer_array_map_alloc,
+   .map_free = timer_array_map_free,
+   .map_get_next_key = array_map_get_next_key,
+   .map_lookup_elem = empty_array_map_lookup_elem,
+   .map_update_elem = timer_array_map_update_elem,
+   .map_delete_elem = timer_array_map_delete_elem,
+};
+
+static struct bpf_map_type_list timer_array_type __read_mostly = {
+   .ops = &timer_array_ops,
+   .type = BPF_MAP_TYPE_TIMER_ARRAY,
+};

Please align initializations vertically, so the second column becomes readable,
patterns in them become easy to see and individual entries become easier to
compare.

See for example kernel/sched/core.c:

struct cgroup_subsys cpu_cgrp_subsys = {
 .css_alloc  = cpu_cgroup_css_alloc,
 .css_free   = cpu_cgroup_css_free,
 .css_online = cpu_cgroup_css_online,
 .css_offline= cpu_cgroup_css_offline,
 .fork   = cpu_cgroup_fork,
 .can_attach = cpu_cgroup_can_attach,
 .attach = cpu_cgroup_attach,
 .exit   = cpu_cgroup_exit,
 .legacy_cftypes = cpu_files,
 .early_init = 1,
};

That's a _lot_ more readable than:

struct cgroup_subsys cpu_cgrp_subsys = {
 .css_alloc = cpu_cgroup_css_alloc,
 .css_free = cpu_cgroup_css_free,
 .css_online = cpu_cgroup_css_online,
 .css_offline = cpu_cgroup_css_offline,
 .fork = cpu_cgroup_fork,
 .can_attach = cpu_cgroup_attach,


Here :)


 .attach = cpu_cgroup_attach,
 .exit = cpu_cgroup_exit,
 .legacy_cftypes = cpu_files,
 .early_init = 1,
};

right? For example I've hidden a small initialization bug into the second 
variant,
how much time does it take for you to notice it?

Thanks,

Ingo



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next 2/3] bpf: introduce bpf_perf_event_output() helper

2015-10-21 Thread Wangnan (F)




On 2015/10/21 18:01, He Kuang wrote:

hi, Alexei

I've tested the sample in next patch and it works well. I think more 
work on
the perf side needs to be done for parsing PERF_COUNT_SW_BPF_OUTPUT 
event type,

are you working on that?

Thank you.


We need to add something into parse-event.y/l to let it know the new 
software event.
We can do this simple task. However, as I gussed before, perf is unable 
to print
this new event so there is some work need to be done to let 'perf 
report' and

'perf script' know it.

One thing I can still remember is finding a way to inject format 
information through
those output so 'perf script' and 'perf data convert --to-ctf' can 
decode the raw data
without extra work. Can you remember that we have discussed a solution 
which connects
debuginfo and output raw data using a builtin function in LLVM/clang? We 
already have
clang/LLVM patch on it, but on perf side I haven't do start working on 
it. I think we can

make perf output raw data hexadamically at first.

I'll do the above work and put related patch at the end of my eBPF 
patchset because

they shouldn't be merged until this patchset upstreamed.

Thank you.



On 2015/10/21 11:02, Alexei Starovoitov wrote:

This helper is used to send raw data from eBPF program into
special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
User space needs to perf_event_open() it (either for one or all cpus) 
and

store FD into perf_event_array (similar to bpf_perf_event_read() helper)
before eBPF program can send data into it.

Today the programs triggered by kprobe collect the data and either store
it into the maps or print it via bpf_trace_printk() where latter is 
the debug

facility and not suitable to stream the data. This new helper replaces
such bpf_trace_printk() usage and allows programs to have dedicated
channel into user space for post-processing of the raw data collected.

Signed-off-by: Alexei Starovoitov 
---
  include/uapi/linux/bpf.h|   11 ++
  include/uapi/linux/perf_event.h |1 +
  kernel/bpf/arraymap.c   |2 ++
  kernel/bpf/verifier.c   |3 ++-
  kernel/trace/bpf_trace.c|   46 
+++

  5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 564f1f091991..2e032426cfb7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -287,6 +287,17 @@ enum bpf_func_id {
   * Return: realm if != 0
   */
  BPF_FUNC_get_route_realm,
+
+/**
+ * bpf_perf_event_output(ctx, map, index, data, size) - output 
perf raw sample

+ * @ctx: struct pt_regs*
+ * @map: pointer to perf_event_array map
+ * @index: index of event in the map
+ * @data: data on stack to be output as raw data
+ * @size: size of data
+ * Return: 0 on success
+ */
+BPF_FUNC_perf_event_output,
  __BPF_FUNC_MAX_ID,
  };

diff --git a/include/uapi/linux/perf_event.h 
b/include/uapi/linux/perf_event.h

index 2881145cda86..d3c417615361 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -110,6 +110,7 @@ enum perf_sw_ids {
  PERF_COUNT_SW_ALIGNMENT_FAULTS= 7,
  PERF_COUNT_SW_EMULATION_FAULTS= 8,
  PERF_COUNT_SW_DUMMY= 9,
+PERF_COUNT_SW_BPF_OUTPUT= 10,

  PERF_COUNT_SW_MAX,/* non-ABI */
  };
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f2d9e698c753..e3cfe46b074f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -295,6 +295,8 @@ static void *perf_event_fd_array_get_ptr(struct 
bpf_map *map, int fd)

  return (void *)attr;

  if (attr->type != PERF_TYPE_RAW &&
+!(attr->type == PERF_TYPE_SOFTWARE &&
+  attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
  attr->type != PERF_TYPE_HARDWARE) {
  perf_event_release_kernel(event);
  return ERR_PTR(-EINVAL);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d6b97be79e1..b56cf51f8d42 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -245,6 +245,7 @@ static const struct {
  } func_limit[] = {
  {BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
  {BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
+{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_output},
  };

  static void print_verifier_state(struct verifier_env *env)
@@ -910,7 +911,7 @@ static int check_map_func_compatibility(struct 
bpf_map *map, int func_id)

   * don't allow any other map type to be passed into
   * the special func;
   */
-if (bool_map != bool_func)
+if (bool_func && bool_map != bool_func)
  return -EINVAL;
  }

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0fe96c7c8803..47febbe7998e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,6 +215,50 @@ const struct bpf_func_proto 
bpf_perf_event_read_prot

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)




On 2015/10/21 17:12, Peter Zijlstra wrote:

On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote:

On 10/20/15 12:22 AM, Kaixu Xia wrote:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..5219635 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending);
}

+   if (unlikely(!atomic_read(&event->soft_enable)))
+   return 0;
+
if (event->overflow_handler)
event->overflow_handler(event, data, regs);
else

Peter,
does this part look right or it should be moved right after
if (unlikely(!is_sampling_event(event)))
 return 0;
or even to other function?

It feels to me that it should be moved, since we probably don't
want to active throttling, period adjust and event_limit for events
that are in soft_disabled state.

Depends on what its meant to do. As long as you let the interrupt
happen, I think we should in fact do those things (maybe not the
event_limit), but period adjustment and esp. throttling are important
when the event is enabled.

If you want to actually disable the event: pmu->stop() will make it
stop, and you can restart using pmu->start().xiezuo


I also prefer totally disabling event because our goal is to reduce
sampling overhead as mush as possible. However, events in perf is
CPU bounded, one event in perf cmdline becomes multiple 'perf_event'
in kernel in multi-core system. Disabling/enabling events on all CPUs
by a BPF program a hard task due to racing, NMI, ...

Think about an example scenario: we want to sample cycles in a system
width way to see what the whole system does during a smart phone
refreshing its display, and don't want other samples when display
freezing. We probe at the entry and exit points of Display.refresh() (a
fictional user function), then let two BPF programs to enable 'cycle'
sampling PMU at the entry point and disable it at the exit point.

In this task, we need to start all 'cycles' perf_events when display
start refreshing, and disable all of those events when refreshing is
finished. Only enable the event on the core which executes the entry
point of Display.refresh() is not enough because real workers are
running on other cores, we need them to do the computation cooperativly.
Also, scheduler is possible to schedule the exit point of Display.refresh()
on another core, so we can't simply disable the perf_event on that core and
let other core keel sampling after refreshing finishes.

I have thought a method which can disable sampling in a safe way:
we can call pmu->stop() inside the PMU IRQ handler, so we can ensure
that pmu->stop() always be called by core its event resides.
However, I don't know how to reenable them when safely. Maybe need
something in scheduler?

Thank you.


And I suppose you can wrap that with a counter if you need nesting.

I'm not sure if any of that is a viable solution, because the patch
description is somewhat short on the problem statement.

As is, I'm not too charmed with the patch, but lacking a better
understanding of what exactly we're trying to achieve I'm struggling
with proposing alternatives.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)




On 2015/10/21 19:33, Peter Zijlstra wrote:

On Wed, Oct 21, 2015 at 06:31:04PM +0800, xiakaixu wrote:


The RFC patch set contains the necessary commit log [1].

That's of course the wrong place, this should be in the patch's
Changelog. It doesn't become less relevant.


In some scenarios we don't want to output trace data when perf sampling
in order to reduce overhead. For example, perf can be run as daemon to
dump trace data when necessary, such as the system performance goes down.
Just like the example given in the cover letter, we only receive the
samples within sys_write() syscall.

The helper bpf_perf_event_control() in this patch set can control the
data output process and get the samples we are most interested in.
The cpu_function_call is probably too much to do from bpf program, so
I choose current design that like 'soft_disable'.

So, IIRC, we already require eBPF perf events to be CPU-local, which
obviates the entire need for IPIs.


But soft-disable/enable don't require IPI because it is only
a memory store operation.


So calling pmu->stop() seems entirely possible (its even NMI safe).


But we need to turn off sampling across CPUs. Please have a look
at my another email.


This, however, does not explain if you need nesting, your patch seemed
to have a counter, which suggest you do.

To avoid reacing.

If our task is sampling cycle events during a function is running,
and if two cores start that function overlap:

Time:   ...A
Core 0: sys_write\
  \
   \
Core 1: sys_write%return
Core 2: sys_write

Then without counter at time A it is highly possible that
BPF program on core 1 and core 2 get conflict with each other.
The final result is we make some of those events be turned on
and others turned off. Using atomic counter can avoid this
problem.

Thank you.



In any case, you could add perf_event_{stop,start}_local() to mirror the
existing perf_event_read_local(), no? That would stop the entire thing
and reduce even more overhead than simply skipping the overflow handler.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-21 Thread Wangnan (F)




On 2015/10/21 16:09, Namhyung Kim wrote:

Hi Frederic,

On Tue, Oct 20, 2015 at 07:21:16PM +0200, Frederic Weisbecker wrote:

On Tue, Oct 20, 2015 at 10:06:51AM -0300, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 20, 2015 at 02:19:50PM +0200, Frederic Weisbecker escreveu:

On Tue, Oct 20, 2015 at 09:00:34AM -0300, Arnaldo Carvalho de Melo wrote:

Em Mon, Oct 19, 2015 at 05:16:53PM -0700, Brendan Gregg escreveu:
So are you advocating different defaults, one for --stdio (callee),
another for --tui, --gtk (caller)?
This is all configurable via ~/.perfconfig :-\
Indeed, finding a default that is deemed adequate for most people is,
ho-hum, difficult 8-)
  

Most uses I've seen on LKML by the past involved callee because people
mostly look at the precise point where a performance issue is.

A good chunk of that was because that was the default?

I doubt it. When you need to find the culprit of a syscall of IRQ performance 
issue,
you don't care much to see __libc_start_main() / main() on the top of your 
callchain.

  

IMHO changing that order is not a good idea. Unless many users complained
about it.

Perhaps there are not that many users of callchains because the default
is not what they're used to see?

Motivation for the change came from a video from Chandler, that
resurfaced the callchain default issue, Chandler?

Anedoctally, he tweeted about it and people seemed to like it.

Well, I would prefer to hear from regular users than random twitter followers.
I could be wrong so lets ask some users first.

Just a question.  Do you often use --children and/or '--g caller' options?


For me, I always use --no-children. However, I think it is because
I have used to --no-children and no one teach me how to utilize the
additional information --children provided. In case when result
of --no-children hard to explain I use Brendan's flame graph tool.

Thank you.


I guess that for most kernel developers, --children is not that useful
as you said.  But I think it can be useful for many userspace
developers and with '-g caller' it can be even more useful. :)

When '-g caller' is used, the callchains shown in a (self) entry are
less important IMHO.  However callchains in entries generated by
--children will show which functions are called by the entry (since
it's reversed!) and will be more important.

Thanks,
Namhyung



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)




On 2015/10/21 19:56, Peter Zijlstra wrote:

On Wed, Oct 21, 2015 at 07:34:28PM +0800, Wangnan (F) wrote:

If you want to actually disable the event: pmu->stop() will make it
stop, and you can restart using pmu->start().xiezuo

I also prefer totally disabling event because our goal is to reduce
sampling overhead as mush as possible. However, events in perf is
CPU bounded, one event in perf cmdline becomes multiple 'perf_event'
in kernel in multi-core system. Disabling/enabling events on all CPUs
by a BPF program a hard task due to racing, NMI, ...

But eBPF perf events must already be local afaik. Look at the
constraints perf_event_read_local() places on the events.


I think soft disabling/enabling is free of this constraint,
because technically speaking a soft-disabled perf event is still
running.

What we want to disable is only sampling action to avoid
being overwhelming by sampling and reduce the overhead which
output those unneeded sampling data to perf.data. I don't
care whether the PMU counter is stopped or still running
too much. Even if it still generate interrupts I think it
should be acceptable because interruption handling can be fast.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 26/31] perf tools: Support perf event alias name

2015-10-21 Thread Wangnan (F)




On 2015/10/21 16:53, Namhyung Kim wrote:

Hi,

On Wed, Oct 14, 2015 at 12:41:37PM +, Wang Nan wrote:

From: He Kuang 

This patch adds new bison rules for specifying an alias name to a perf
event, which allows cmdline refer to previous defined perf event through
its name. With this patch user can give alias name to a perf event using
following cmdline:

  # perf record -e mypmu=cycles ...

To allow parser refer to existing event selecter, pass event list to
'struct parse_events_evlist'. perf_evlist__find_evsel_by_alias() is
introduced to get evsel through its alias.

What about using event name directly?  I guess the alias name is used
only to refer an event so it'd be better to use the event name.
Anyway we need alias as well when event has no name or name is complex.


It is possible to trigger two perf events with same PMU but
different config options:

 # perf record -e cycles/period=/ -e cycles/period=9/ -a sleep 1

In this case the name of events are:

cycles/period=/ `
cycles/period=9/

Using it in perf cmdline is painful:

 # perf record -e cycles/period=/ -e cycles/period=9/ -e 
bpf_program.c/myevent=cycles/period=//...


Thank you.


Thanks
Namhyung



Signed-off-by: He Kuang 
Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: http://lkml.kernel.org/n/ebpf-7w1s62o0s6ovqlaqwrmx2...@git.kernel.org
---
  tools/perf/util/evlist.c   | 16 
  tools/perf/util/evlist.h   |  4 
  tools/perf/util/evsel.h|  1 +
  tools/perf/util/parse-events.c | 31 ---
  tools/perf/util/parse-events.h |  5 +
  tools/perf/util/parse-events.y | 15 ++-
  6 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index d139219..8dd59aa 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1753,3 +1753,19 @@ void perf_evlist__set_tracking_event(struct perf_evlist 
*evlist,
  
  	tracking_evsel->tracking = true;

  }
+
+struct perf_evsel *
+perf_evlist__find_evsel_by_alias(struct perf_evlist *evlist,
+const char *alias)
+{
+   struct perf_evsel *evsel;
+
+   evlist__for_each(evlist, evsel) {
+   if (!evsel->alias)
+   continue;
+   if (strcmp(alias, evsel->alias) == 0)
+   return evsel;
+   }
+
+   return NULL;
+}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index a459fe7..4e25342 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -292,4 +292,8 @@ void perf_evlist__set_tracking_event(struct perf_evlist 
*evlist,
 struct perf_evsel *tracking_evsel);
  
  void perf_event_attr__set_max_precise_ip(struct perf_event_attr *attr);

+
+struct perf_evsel *
+perf_evlist__find_evsel_by_alias(struct perf_evlist *evlist, const char 
*alias);
+
  #endif /* __PERF_EVLIST_H */
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index a60b5d5..9a95e73 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -87,6 +87,7 @@ struct perf_evsel {
int idx;
u32 ids;
char*name;
+   char*alias;
double  scale;
const char  *unit;
struct event_format *tp_format;
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4849dbd..06ba5a6 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1020,6 +1020,30 @@ int parse_events__modifier_group(struct list_head *list,
return parse_events__modifier_event(list, event_mod, true);
  }
  
+int parse_events__set_event_alias(struct parse_events_evlist *data,

+ struct list_head *list,
+ const char *str,
+ void *loc_alias_)
+{
+   struct perf_evsel *evsel;
+   YYLTYPE *loc_alias = loc_alias_;
+
+   if (!str)
+   return 0;
+
+   if (!list_is_singular(list)) {
+   struct parse_events_error *err = data->error;
+
+   err->idx = loc_alias->first_column;
+   err->str = strdup("One alias can be applied to one event only");
+   return -EINVAL;
+   }
+
+   evsel = list_first_entry(list, struct perf_evsel, node);
+   evsel->alias = strdup(str);
+   return evsel->alias ? 0 : -ENOMEM;
+}
+
  void parse_events__set_leader(char *name, struct list_head *list)
  {
struct perf_evsel *leader;
@@ -1373,9 +1397,10 @@ int parse_events(struct perf_evlist *evlist, const char 
*str,

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)




On 2015/10/21 20:17, Peter Zijlstra wrote:

On Wed, Oct 21, 2015 at 07:49:34PM +0800, Wangnan (F) wrote:

If our task is sampling cycle events during a function is running,
and if two cores start that function overlap:

Time:   ...A
Core 0: sys_write\
   \
\
Core 1: sys_write%return
Core 2: sys_write

Then without counter at time A it is highly possible that
BPF program on core 1 and core 2 get conflict with each other.
The final result is we make some of those events be turned on
and others turned off. Using atomic counter can avoid this
problem.

But but, how and why can an eBPF program access a !local event? I
thought we had hard restrictions on that.


How can an eBPF program access a !local event:

when creating perf event array we don't care which perf event
is for which CPU, so perf program can access any perf event in
that array. Which is straightforward. And in soft
disabling/enabling, what need to be done is an atomic
operation on something in 'perf_event' structure, which is safe
enough.

Why we need an eBPF program access a !local event:

I think I have explained why we need an eBPF program to access
a !local event. In summary, without this ability we can't
speak in user's language because they are focus on higher level
principles (display refreshing, application booting, http
processing...) and 'on which CPU' is not in their dictionaries most
of the time. Without cross-core soft-enabling/disabling it is hard
to translate requirements like "start sampling when display refreshing
begin" and "stop sampling when application booted" into eBPF programs
and perf cmdline. Don't you think it is useful for reducing sampling
data and needs to be considered?

One alternative solution I can image is to attach a BPF program
at sampling like kprobe, and return 0 if we don't want sampling
take action. Thought? Actually speaking I don't like it very much
because the principle of soft-disable is much simpler and safe, but
if you really like it I think we can try.

Do you think soft-disable/enable perf events on other cores makes
any real problem?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)


Hi Alexei,

On 2015/10/21 21:42, Wangnan (F) wrote:



One alternative solution I can image is to attach a BPF program
at sampling like kprobe, and return 0 if we don't want sampling
take action. Thought?


Do you think attaching BPF programs to sampling is an acceptable idea?

Thank you.


Actually speaking I don't like it very much
because the principle of soft-disable is much simpler and safe, but
if you really like it I think we can try.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)




On 2015/10/22 0:57, Peter Zijlstra wrote:

On Wed, Oct 21, 2015 at 11:06:47PM +0800, pi3orama wrote:

So explain; how does this eBPF stuff work.

I think I get your point this time, and let me explain the eBPF stuff to you.

You are aware that BPF programmer can break the system in this way:

A=get_non_local_perf_event()
perf_event_read_local(A)
BOOM!

However the above logic is impossible because BPF program can't work this
way.

First of all, it is impossible for a BPF program directly invoke a
kernel function.  Doesn't like kernel module, BPF program can only
invoke functions designed for them, like what this patch does. So the
ability of BPF programs is strictly restricted by kernel. If we don't
allow BPF program call perf_event_read_local() across core, we can
check this and return error in function we provide for them.

Second: there's no way for a BPF program directly access a perf event.
All perf events have to be wrapped by a map and be accessed by BPF
functions described above. We don't allow BPF program fetch array
element from that map. So pointers of perf event is safely protected
from BPF program.

In summary, your either-or logic doesn't hold in BPF world. A BPF
program can only access perf event in a highly restricted way. We
don't allow it calling perf_event_read_local() across core, so it
can't.

Urgh, that's still horridly inconsistent. Can we please come up with a
consistent interface to perf?


BPF program and kernel module are two different worlds as I said before.

I don't think making them to share a common interface is a good idea
because such sharing will give BPF programs too much freedom than it
really need, then it will be hard prevent them to do something bad.
If we really need kernel interface, I think what we need is kernel
module, not BPF program.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread Wangnan (F)




On 2015/10/22 11:09, Alexei Starovoitov wrote:

On 10/21/15 6:56 PM, Wangnan (F) wrote:

One alternative solution I can image is to attach a BPF program
at sampling like kprobe, and return 0 if we don't want sampling
take action. Thought?


Do you think attaching BPF programs to sampling is an acceptable idea?


If you mean to extend 'filter' concept to sampling events?
So instead of soft_disable of non-local events, you'll attach bpf
program to sampling events and use map lookup to decide whether
to filter out or not such sampling event?


Yes.


What pt_regs would be in such case?



Sampling is based on interruption. We can use pt_reg captured by the IRQ 
handler,

or we can simply pass NULL to those BPF program.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-21 Thread Wangnan (F)

After applying this patch I'm unable to use perf passing perf_event 
again like this:


 # perf record -a -e evt=cycles -e 
./test_config_map.c/maps.pmu_map.event=evt/ --exclude-perf ls


With -v it output:

...
adding perf_bpf_probe:func_write
adding perf_bpf_probe:func_write to 0x367d6a0
add bpf event perf_bpf_probe:func_write_return and attach bpf program 6
adding perf_bpf_probe:func_write_return
adding perf_bpf_probe:func_write_return to 0x3a7fc40
mmap size 528384B
ERROR: failed to insert value to pmu_map[0]
ERROR: Apply config to BPF failed: Invalid option for map, add -v to see 
detail

Opening /sys/kernel/debug/tracing//kprobe_events write=
...

Looks like perf sets attr.inherit for cycles? I'll look into this problem.

Thank you.

On 2015/10/22 6:58, Alexei Starovoitov wrote:

Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
   (do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.

Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the 
selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov 
---
v1->v2: fix compile in case of !CONFIG_PERF_EVENTS

This patch is on top of
http://patchwork.ozlabs.org/patch/533585/
to avoid conflicts.
Even in the worst case the crash is not possible.
Only warn_on_once, so imo net-next is ok.

  kernel/bpf/arraymap.c |9 +
  kernel/events/core.c  |   16 ++--
  2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e3cfe46b074f..75529cc94304 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -294,10 +294,11 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
if (IS_ERR(attr))
return (void *)attr;
  
-	if (attr->type != PERF_TYPE_RAW &&

-   !(attr->type == PERF_TYPE_SOFTWARE &&
- attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
-   attr->type != PERF_TYPE_HARDWARE) {
+   if ((attr->type != PERF_TYPE_RAW &&
+!(attr->type == PERF_TYPE_SOFTWARE &&
+  attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
+attr->type != PERF_TYPE_HARDWARE) ||
+   attr->inherit) {
perf_event_release_kernel(event);
return ERR_PTR(-EINVAL);
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 64754bfecd70..0b6333265872 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3258,7 +3258,7 @@ static inline u64 perf_event_count(struct perf_event 
*event)
  u64 perf_event_read_local(struct perf_event *event)
  {
unsigned long flags;
-   u64 val;
+   u64 val = -EINVAL;
  
  	/*

 * Disabling interrupts avoids all counter scheduling (context
@@ -3267,12 +3267,14 @@ u64 perf_event_read_local(struct perf_event *event)
local_irq_save(flags);
  
  	/* If this is a per-task event, it must be for current */

-   WARN_ON_ONCE((event->attach_state & PERF_ATTACH_TASK) &&
-event->hw.target != current);
+   if ((event->attach_state & PERF_ATTACH_TASK) &&
+   event->hw.target != current)
+   goto out;
  
  	/* If this is a per-CPU event, it must be for this CPU */

-   WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK) &&
-event->cpu != smp_processor_id());
+   if (!(event->attach_state & PERF_ATTACH_TASK) &&
+   event->cpu != smp_processor_id())
+   goto out;
  
  	/*

 * It must not be an event with inherit set, we cannot read
@@ -3284,7 +3286,8 @@ u64 perf_event_read_local(struct perf_event *event)
 * It must not have a pmu::count method, those are not
 * NMI safe.
 */
-   WARN_ON_ONCE(event->pmu->count);
+   if (event->pmu->count)
+   goto out;
  
  	/*

 * If the event is currently on this CPU, its either a per-task event,
@@ -3295,6 +3298,7 @@ u64 perf_event_read_local(struct perf_event *event)
event->pmu->read(event);
  
  	val = local64_read(&event->count);

+out:
local_irq_restore(flags);
  
  	return val;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-21 Thread Wangnan (F)




On 2015/10/22 13:00, Alexei Starovoitov wrote:

On 10/21/15 9:49 PM, Wangnan (F) wrote:

After applying this patch I'm unable to use perf passing perf_event
again like this:


please do not top post and trim your replies.


  # perf record -a -e evt=cycles -e
./test_config_map.c/maps.pmu_map.event=evt/ --exclude-perf ls

With -v it output:

...
adding perf_bpf_probe:func_write
adding perf_bpf_probe:func_write to 0x367d6a0
add bpf event perf_bpf_probe:func_write_return and attach bpf program 6
adding perf_bpf_probe:func_write_return
adding perf_bpf_probe:func_write_return to 0x3a7fc40
mmap size 528384B
ERROR: failed to insert value to pmu_map[0]
ERROR: Apply config to BPF failed: Invalid option for map, add -v to see
detail
Opening /sys/kernel/debug/tracing//kprobe_events write=
...

Looks like perf sets attr.inherit for cycles? I'll look into this 
problem.


yes. that's perf default.
How did it even work before?!
I was testing with your samples/bpf/tracex6 that sets inherit to zero.



Tested perf record -i option and it works for me:

# echo "" > /sys/kernel/debug/tracing/trace
# perf record -i -a -e evt=cycles -e 
./test_config_map.c/maps.pmu_map.event=evt/ --exclude-perf ls

# cat /sys/kernel/debug/tracing/trace  | grep ls
  ls-8227  [001] dN..  2526.184611: : pmu inc: 82270
  ls-8227  [001] dN..  2526.184626: : pmu inc: 40951
  ls-8227  [001] dN..  2526.184642: : pmu inc: 50659
  ls-8227  [001] dN..  2526.184657: : pmu inc: 43511
  ls-8227  [001] dN..  2526.184675: : pmu inc: 56921
...
And no warning message found in dmesg.

So I think your fix is good, we should improve perf.

Thank you.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-21 Thread Wangnan (F)




On 2015/10/22 6:58, Alexei Starovoitov wrote:

Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
   (do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.

Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the 
selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov 
---
v1->v2: fix compile in case of !CONFIG_PERF_EVENTS

This patch is on top of
http://patchwork.ozlabs.org/patch/533585/
to avoid conflicts.
Even in the worst case the crash is not possible.
Only warn_on_once, so imo net-next is ok.

  kernel/bpf/arraymap.c |9 +
  kernel/events/core.c  |   16 ++--
  2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e3cfe46b074f..75529cc94304 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -294,10 +294,11 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
if (IS_ERR(attr))
return (void *)attr;
  
-	if (attr->type != PERF_TYPE_RAW &&

-   !(attr->type == PERF_TYPE_SOFTWARE &&
- attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
-   attr->type != PERF_TYPE_HARDWARE) {
+   if ((attr->type != PERF_TYPE_RAW &&
+!(attr->type == PERF_TYPE_SOFTWARE &&
+  attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
+attr->type != PERF_TYPE_HARDWARE) ||
+   attr->inherit) {


This 'if' statement is so complex. What about using a inline function 
instead?


Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 26/31] perf tools: Support perf event alias name

2015-10-22 Thread Wangnan (F)




On 2015/10/22 15:16, Namhyung Kim wrote:

Hi,

On Wed, Oct 21, 2015 at 10:00 PM, Wangnan (F)  wrote:


On 2015/10/21 16:53, Namhyung Kim wrote:

Hi,

On Wed, Oct 14, 2015 at 12:41:37PM +, Wang Nan wrote:

From: He Kuang 

This patch adds new bison rules for specifying an alias name to a perf
event, which allows cmdline refer to previous defined perf event through
its name. With this patch user can give alias name to a perf event using
following cmdline:

   # perf record -e mypmu=cycles ...

To allow parser refer to existing event selecter, pass event list to
'struct parse_events_evlist'. perf_evlist__find_evsel_by_alias() is
introduced to get evsel through its alias.

What about using event name directly?  I guess the alias name is used
only to refer an event so it'd be better to use the event name.
Anyway we need alias as well when event has no name or name is complex.


It is possible to trigger two perf events with same PMU but
different config options:

  # perf record -e cycles/period=/ -e cycles/period=9/ -a sleep 1

In this case the name of events are:

cycles/period=/ `
cycles/period=9/

Using it in perf cmdline is painful:

  # perf record -e cycles/period=/ -e cycles/period=9/ -e
bpf_program.c/myevent=cycles/period=//...

I understand the need of using aliases but I think it's more natural
to use event name for simple cases..


I will consider this. However, if we allow using event name directly 
like this:


 # perf record -e cycles -e test_pmu.c/myevent=cycles/ ...

Then two '-e' seems redundant, right? Why not directly using:

 # perf record -e test_pmu.c/myevent=cycles/ ...

and make perf creates cycles event for test_pmu.c?

We can make syntax like

 # perf record -e test_pmu.c/myevent=cycles/ ...

as a syntax sugar of

 # perf record -e randomname=cycles -e test_pmu.c/myevent=randomname/ ...

and don't need to find evsel through their names if alias not exist.

So this is a new feature, and worth another patch.

Thought?


Thanks,
Namhyung



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-22 Thread Wangnan (F)




On 2015/10/22 14:21, Alexei Starovoitov wrote:

On 10/21/15 10:31 PM, Wangnan (F) wrote:

+if ((attr->type != PERF_TYPE_RAW &&
+ !(attr->type == PERF_TYPE_SOFTWARE &&
+   attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
+ attr->type != PERF_TYPE_HARDWARE) ||
+attr->inherit) {


This 'if' statement is so complex. What about using a inline function
instead?


hmm. don't see how inline function will help readability.



For example (not tested):

 static inline bool perf_event_can_insert_to_map(struct perf_event_attr 
*attr)

 {
/* is inherit? */
if (attr->inherit)
return false;

/* is software event? */
if (attr->type == PERF_TYPE_SOFTWARE)
if (attr->config == PERF_COUNT_SW_BPF_OUTPUT)
return true;
else
return false;

/* Comment... */
if (attr->type == PERF_TYPE_RAW)
return true;
if (attr->type == PERF_TYPE_HARDWARE)
return true;
return false;
 }

 ...
 if (!perf_event_can_insert_to_map(attr))


Do you think redability is improved?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-22 Thread Wangnan (F)




On 2015/10/22 15:39, Ingo Molnar wrote:

* Wangnan (F)  wrote:


[SNIP]


In summary, your either-or logic doesn't hold in BPF world. A BPF
program can only access perf event in a highly restricted way. We
don't allow it calling perf_event_read_local() across core, so it
can't.

Urgh, that's still horridly inconsistent. Can we please come up with a
consistent interface to perf?

BPF program and kernel module are two different worlds as I said before.

I don't think making them to share a common interface is a good idea because
such sharing will give BPF programs too much freedom than it really need, then
it will be hard prevent them to do something bad. If we really need kernel
interface, I think what we need is kernel module, not BPF program.

What do you mean, as this does not parse for me.


Because I'm not very sure what the meaning of "inconsistent" in
Peter's words...

I think what Peter want us to do is to provide similar (consistent) 
interface

between kernel and eBPF that, if kernel reads from a perf_event through
perf_event_read_local(struct perf_event *), BPF program should
do this work with similar code, or at least similar logic, so
we need to create handler for a perf event, and provide a BPF function
called BPF_FUNC_perf_event_read_local then pass such handler to it.

I don't think like this because if we want kernel interface we'd
better use kernel module, not eBPF so I mentioned kernel module here.

Ingo, do you think BPF inerface should be *consistent* with anything?

Thank you.


We obviously can (and very likely should) make certain perf functionality
available to BPF programs.

It should still be a well defined yet flexible iterface, with safe behavior,
obviously - all in line with existing BPF sandboxing principles.

'Kernel modules' don't enter this consideration at all, not sure why you mention
them - all this functionality is also available if CONFIG_MODULES is turned off
completely.

Thanks,

Ingo




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 26/31] perf tools: Support perf event alias name

2015-10-22 Thread Wangnan (F)




On 2015/10/22 15:53, Namhyung Kim wrote:

On Thu, Oct 22, 2015 at 4:29 PM, Wangnan (F)  wrote:




[SNIP]

I understand the need of using aliases but I think it's more natural
to use event name for simple cases..


I will consider this. However, if we allow using event name directly like
this:

  # perf record -e cycles -e test_pmu.c/myevent=cycles/ ...

Then two '-e' seems redundant, right? Why not directly using:

  # perf record -e test_pmu.c/myevent=cycles/ ...

and make perf creates cycles event for test_pmu.c?

We can make syntax like

  # perf record -e test_pmu.c/myevent=cycles/ ...

as a syntax sugar of

  # perf record -e randomname=cycles -e test_pmu.c/myevent=randomname/ ...

and don't need to find evsel through their names if alias not exist.

So this is a new feature, and worth another patch.

Thought?

Not sure it's worth.  It can confuse users IMHO.

Isn't it enough to give them in a single argument?

   # perf record -e cycles,test_pmu.c/myevent=cycles/


OK. I have put it on my todo-list.

Thank you.

Thanks,
Namhyung



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-22 Thread Wangnan (F)




On 2015/10/22 17:06, Peter Zijlstra wrote:

On Wed, Oct 21, 2015 at 02:19:49PM -0700, Alexei Starovoitov wrote:


Urgh, that's still horridly inconsistent. Can we please come up with a
consistent interface to perf?

My suggestion was to do ioctl(enable/disable) of events from userspace
after receiving notification from kernel via my bpf_perf_event_output()
helper.
Wangnan's argument was that display refresh happens often and it's fast,
so the time taken by user space to enable events on all cpus is too
slow and ioctl does ipi and disturbs other cpus even more.
So soft_disable done by the program to enable/disable particular events
on all cpus kinda makes sense.

And this all makes me think I still have no clue what you're all trying
to do here.

Who cares about display updates and why. And why should there be an
active userspace part to eBPF programs?


So you want the background story? OK, let me describe it. This mail is not
short so please be patient.

On a smartphone, if time between two frames is longer than 16ms, user
can aware it. This is a display glitch. We want to check those glitches
with perf to find what cause them. The basic idea is: use 'perf record'
to collect enough information, find those samples generated just before
the glitch form perf.data then analysis them offline.

There are many works need to be done before perf can support such
analysis. One improtant thing is to reduce the overhead from perf to
avoid perf itself become the reason of glitches. We can do this by reduce
the number of events perf collects, but then we won't have enough
information to analysis when glitch happen. Another way we are trying to 
implement

now is to dynamically turn events on and off, or at least enable/disable
sampling dynamically because the overhead of copying those samples
is a big part of perf's total overhead. After that we can trace as many
event as possible, but only fetch data from them when we detect a glitch.

BPF program is the best choice to describe our relative complex glitch
detection model. For simplicity, you can think our glitch detection model
contain 3 points at user programs: point A means display refreshing begin,
point B means display refreshing has last for 16ms, point C means
display refreshing finished. They can be found in user programs and
probed through uprobe, on which we can attach BPF programs on.

Then we want to use perf this way:

Sample 'cycles' event to collect callgraph so we know what all CPUs are
doing during the refreshing, but only start sampling when we detect a
frame refreshing last for 16ms.

We can translate the above logic into BPF programs: at point B, enable
'cycles' event to generate samples; at point C, disable 'cycles' event
to avoid useless samples to be generated.

Then, make 'perf record -e cycles' to collect call graph and other
information through cycles event. From perf.data, we can use 'perf script'
and other tools to analysis resuling data.

We have consider some potential solution and find them inapproate or need
too much work to do:

 1. As you may prefer, create BPF functions to call pmu->stop() /
pmu->start() for perf event on the CPU on which BPF programs get
triggered.

The shortcoming of this method is we can only turn on the perf event on
the CPU execute point B. We are unable to know what other CPU are doing
during glitching. But what we want is system-wide information. In 
addition,
point C and point B are not necessarily be executed at one core, so 
we may
shut down wrong event if scheduler decide to run point C on another 
core.


 2. As Alexei's suggestion, output something through his 
bpf_perf_event_output(),

let perf disable and enable those events using ioctl in userspace.

This is a good idea, but introduces asynchronization problem.
bpf_perf_event_output() output something to perf's ring buffer, but 
perf
get noticed about this message when epoll_wait() return. We tested 
a tracepoint

event and found that an event may awared by perf sereval seconds after
it generated. One solution is to use --no-buffer, but it still need 
perf to be
scheduled on time and parse its ringbuffer quickly. Also, please 
note that point
C is possible to appear very shortly after point B because some 
APPs are optimized

to make their refreshing time very near to 16ms.

This is the background story. Looks like whether we implement some 
corss-CPU controling
or we satisified with coarse grained controlling. The best method we can 
think is to
use atomic operation to soft enable/disable perf events. We believe it 
is simple enough

and won't cause problem.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-22 Thread Wangnan (F)




On 2015/10/22 6:58, Alexei Starovoitov wrote:

[SNIP]

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e3cfe46b074f..75529cc94304 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -294,10 +294,11 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
if (IS_ERR(attr))
return (void *)attr;
  
-	if (attr->type != PERF_TYPE_RAW &&

-   !(attr->type == PERF_TYPE_SOFTWARE &&
- attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
-   attr->type != PERF_TYPE_HARDWARE) {
+   if ((attr->type != PERF_TYPE_RAW &&
+!(attr->type == PERF_TYPE_SOFTWARE &&
+  attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
+attr->type != PERF_TYPE_HARDWARE) ||
+   attr->inherit) {
perf_event_release_kernel(event);
return ERR_PTR(-EINVAL);
}


I have a question on inherit, not related to this patch:
Is it safe for perf to disable attr->inherit if the event is system wide?
I haven't read relate code completely. In my current knowledge the behavior
of a system wide perf event should be same whether inherit is set or not.
Is that true?

Thank you.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 28/31] perf probe: Init symbol as kprobe

2015-09-01 Thread Wangnan (F)




On 2015/9/2 4:11, Arnaldo Carvalho de Melo wrote:

Em Sat, Aug 29, 2015 at 04:22:02AM +, Wang Nan escreveu:

Before this patch, add_perf_probe_events() init symbol maps only for
uprobe if the first 'struct perf_probe_event' passed to it is a uprobe
event. This is a trick because 'perf probe''s command line syntax
constrains the first elements of the probe_event arrays must be kprobes
if there is one kprobe there.

However, with the incoming BPF uprobe support, that constrain is not
hold since 'perf record' will also probe on k/u probes through BPF
object, and is possible to pass an array with kprobe but the first
element is uprobe.

This patch init symbol maps for kprobes even if all of events are
uprobes, because the extra cost should be small enough.

Masami, are you Ok with this one?


I think he would be okay with it because it is his idea :)

Please refer to: http://lkml.kernel.org/n/558e5f42.1060...@hitachi.com

Thank you.

- Arnaldo
  

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-39-git-send-email-wangn...@huawei.com
---
  tools/perf/util/probe-event.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index e720913..b94a8d7 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -2789,7 +2789,7 @@ int add_perf_probe_events(struct perf_probe_event *pevs, 
int npevs,
  {
int i, ret;
  
-	ret = init_symbol_maps(pevs->uprobes);

+   ret = init_symbol_maps(false);
if (ret < 0)
return ret;
  
--

2.1.0



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf tools: Don't set leader if parser doesn't collect an evsel

2015-09-01 Thread Wangnan (F)




On 2015/9/2 10:53, Wang Nan wrote:

Similar to patch 'perf tools: Don't set cmdline_group_boundary if no
evsel is collected', in case when parser collects no evsel (at this
point it shouldn't happen), parse_events__set_leader() is not safe.

This patch checks list_empty becore calling __perf_evlist__set_leader()
for safty reason.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
---

I'd like to queue this patch into my next pull request. Since it is not
a real bug, it may be dropped.


I think merging this into patch 2/31 should be better. If we decide to 
drop then

only one patch should be considered.


---
  tools/perf/util/parse-events.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index f2c0317..836d226 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -793,6 +793,9 @@ void parse_events__set_leader(char *name, struct list_head 
*list)
  {
struct perf_evsel *leader;
  
+	if (list_empty(list))

+   return;
+
__perf_evlist__set_leader(list);
leader = list_entry(list->next, struct perf_evsel, node);
leader->group_name = name ? strdup(name) : NULL;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/31] perf probe: Attach trace_probe_event with perf_probe_event

2015-09-01 Thread Wangnan (F)




On 2015/9/2 12:32, Namhyung Kim wrote:

Hi,

On Sat, Aug 29, 2015 at 04:21:41AM +, Wang Nan wrote:

This patch drops struct __event_package structure. Instead, it adds
trace_probe_event into 'struct perf_probe_event'.

trace_probe_event information gives further patches a chance to access
actual probe points and actual arguments. Using them, bpf_loader will
be able to attach one bpf program to different probing points of a
inline functions (which has multiple probing points) and glob
functions. Moreover, by reading arguments information, bpf code for
reading those arguments can be generated.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: 
http://lkml.kernel.org/n/1436445342-1402-22-git-send-email-wangn...@huawei.com
---

[SNIP]


+int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
+ bool cleanup)
+{
+   int i, ret;
  
  	ret = init_symbol_maps(pevs->uprobes);

-   if (ret < 0) {
-   free(pkgs);
+   if (ret < 0)
return ret;
-   }
  
  	/* Loop 1: convert all events */

for (i = 0; i < npevs; i++) {
-   pkgs[i].pev = &pevs[i];
/* Init kprobe blacklist if needed */
-   if (!pkgs[i].pev->uprobes)
+   if (pevs[i].uprobes)

Missing '!'.


It's my fault. Already fixed in my local tree.

Thank you for your review!


Thanks,
Namhyung



kprobe_blacklist__init();
/* Convert with or without debuginfo */
-   ret  = convert_to_probe_trace_events(pkgs[i].pev,
-&pkgs[i].tevs);
-   if (ret < 0)
+   ret  = convert_to_probe_trace_events(&pevs[i], &pevs[i].tevs);
+   if (ret < 0) {
+   cleanup = true;
goto end;
-   pkgs[i].ntevs = ret;
+   }
+   pevs[i].ntevs = ret;
}
/* This just release blacklist only if allocated */
kprobe_blacklist__release();
  
  	/* Loop 2: add all events */

for (i = 0; i < npevs; i++) {
-   ret = __add_probe_trace_events(pkgs[i].pev, pkgs[i].tevs,
-  pkgs[i].ntevs,
+   ret = __add_probe_trace_events(&pevs[i], pevs[i].tevs,
+  pevs[i].ntevs,
   probe_conf.force_add);
if (ret < 0)
break;
}
  end:
/* Loop 3: cleanup and free trace events  */
-   for (i = 0; i < npevs; i++) {
-   for (j = 0; j < pkgs[i].ntevs; j++)
-   clear_probe_trace_event(&pkgs[i].tevs[j]);
-   zfree(&pkgs[i].tevs);
-   }
-   free(pkgs);
+   for (i = 0; cleanup && (i < npevs); i++)
+   cleanup_perf_probe_event(&pevs[i]);
exit_symbol_maps();
  
  	return ret;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf tools: Don't set leader if parser doesn't collect an evsel

2015-09-01 Thread Wangnan (F)




On 2015/9/2 13:57, 平松雅巳 / HIRAMATU，MASAMI wrote:

From: Wang Nan [mailto:wangn...@huawei.com]

Similar to patch 'perf tools: Don't set cmdline_group_boundary if no
evsel is collected', in case when parser collects no evsel (at this
point it shouldn't happen), parse_events__set_leader() is not safe.

This patch checks list_empty becore calling __perf_evlist__set_leader()
for safty reason.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
---

I'd like to queue this patch into my next pull request. Since it is not
a real bug, it may be dropped.

---
  tools/perf/util/parse-events.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index f2c0317..836d226 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -793,6 +793,9 @@ void parse_events__set_leader(char *name, struct list_head 
*list)
  {
struct perf_evsel *leader;

+   if (list_empty(list))

Would we need to warn/debug something here?


OK, let's add a WARN message here and other 2 places.

Thank you.


Thank you,


+   return;
+
__perf_evlist__set_leader(list);
leader = list_entry(list->next, struct perf_evsel, node);
leader->group_name = name ? strdup(name) : NULL;
--
1.8.3.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf tools: Don't write to evsel if parser doesn't collect evsel

2015-09-01 Thread Wangnan (F)


Sorry, forget to CC kernel mailing list...

On 2015/9/2 14:49, Wang Nan wrote:

If parse_events__scanner() collects no entry, perf_evlist__last(evlist)
is invalid.

Although it shouldn't happen at this point, before calling
perf_evlist__last(), we should ensure the list is not empty for safety
reason.

There are 3 places need this checking:

  1. Before setting cmdline_group_boundary;
  2. Before __perf_evlist__set_leader();
  3. In foreach_evsel_in_last_glob.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
---

Merge all 3 list_empty() test together into one patch.

Add warning messages.

Improve commit message.

---
  tools/perf/util/parse-events.c | 22 +++---
  1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index d826e6f..069848d 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -793,6 +793,11 @@ void parse_events__set_leader(char *name, struct list_head 
*list)
  {
struct perf_evsel *leader;
  
+	if (list_empty(list)) {

+   __WARN_printf("WARNING: failed to set leader: empty list");
+   return;
+   }
+
__perf_evlist__set_leader(list);
leader = list_entry(list->next, struct perf_evsel, node);
leader->group_name = name ? strdup(name) : NULL;
@@ -1143,10 +1148,15 @@ int parse_events(struct perf_evlist *evlist, const char 
*str,
int entries = data.idx - evlist->nr_entries;
struct perf_evsel *last;
  
+		if (!list_empty(&data.list)) {

+   last = list_entry(data.list.prev,
+ struct perf_evsel, node);
+   last->cmdline_group_boundary = true;
+   } else
+   __WARN_printf("WARNING: event parser found nothing");
+
perf_evlist__splice_list_tail(evlist, &data.list, entries);
evlist->nr_groups += data.nr_groups;
-   last = perf_evlist__last(evlist);
-   last->cmdline_group_boundary = true;
  
  		return 0;

}
@@ -1252,7 +1262,13 @@ foreach_evsel_in_last_glob(struct perf_evlist *evlist,
struct perf_evsel *last = NULL;
int err;
  
-	if (evlist->nr_entries > 0)

+   /*
+* Don't return when list_empty, give func a chance to report
+* error when it found last == NULL.
+*
+* So no need to WARN here, let *func do this.
+*/
+   if (!list_empty(&evlist->entries))
last = perf_evlist__last(evlist);
  
  	do {



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf tools: Support bpf prologue for arm64

2015-09-02 Thread Wangnan (F)




On 2015/9/2 18:34, Will Deacon wrote:

On Mon, Aug 31, 2015 at 09:16:28PM +0100, Arnaldo Carvalho de Melo wrote:

Em Sat, Aug 29, 2015 at 03:16:52AM +, He Kuang escreveu:

This patch implements arch_get_reg_info() for arm64 to enable
HAVE_BPF_PROLOGUE feature. For arm64, structure pt_regs is not composed
by fields of register names but an array of regs, so here we simply
multiply fixed register size by index number to get the byte offset.

Hi Jean, Will, are you ok with this? Can I have Acked-by or Reviewed-by
tags from you?

Any idea what this applies against? It's difficult to review it without
knowing what PERF_HAVE_ARCH_GET_REG_INFO or HAVE_BPF_PROLOGUE are.

Anyway, a couple of small comments below.


He, please try to add the authors of the files you change in the CC
list.

- Arnaldo
  

Signed-off-by: He Kuang 
---
  tools/perf/arch/arm64/Makefile  |  1 +
  tools/perf/arch/arm64/util/dwarf-regs.c | 26 ++
  2 files changed, 27 insertions(+)

Any plan to do arch/arm/ too?


diff --git a/tools/perf/arch/arm64/Makefile b/tools/perf/arch/arm64/Makefile
index 7fbca17..1256e6e 100644
--- a/tools/perf/arch/arm64/Makefile
+++ b/tools/perf/arch/arm64/Makefile
@@ -1,3 +1,4 @@
  ifndef NO_DWARF
  PERF_HAVE_DWARF_REGS := 1
  endif
+PERF_HAVE_ARCH_GET_REG_INFO := 1
diff --git a/tools/perf/arch/arm64/util/dwarf-regs.c 
b/tools/perf/arch/arm64/util/dwarf-regs.c
index d49efeb..cb2c50a 100644
--- a/tools/perf/arch/arm64/util/dwarf-regs.c
+++ b/tools/perf/arch/arm64/util/dwarf-regs.c
@@ -10,6 +10,10 @@
  
  #include 

  #include 
+#include 
+#include 
+
+#define PT_REG_SIZE (sizeof(((struct user_pt_regs *)0)->regs[0]))

Why not add an "offset" field to pt_regs_dwarfnum and just calculate
that using offsetof in the GPR_DWARFNUM_NAME macro?

  
  struct pt_regs_dwarfnum {

const char *name;
@@ -78,3 +82,25 @@ const char *get_arch_regstr(unsigned int n)
return roff->name;
return NULL;
  }
+
+#ifdef HAVE_BPF_PROLOGUE
+int arch_get_reg_info(const char *name, int *offset)
+{
+   const struct pt_regs_dwarfnum *roff;
+
+   if (!name || !offset)
+   return -1;
+
+   for (roff = regdwarfnum_table; roff->name != NULL; roff++) {
+   if (!strcmp(roff->name, name)) {
+   if (roff->dwarfnum < 0)

When is this ever true? Do we need up update REG_DWARFNUM_END to use
a negative index?

Will


Thank you for your review.

Yesterday Arnaldo gave some comment on corresponding function on x86_64,
please see discussion in following two threads:

http//lkml.kernel.org/n/20150831204328.ge2...@redhat.com
http://lkml.kernel.org/n/1441090789-108382-2-git-send-email-wangn...@huawei.com

The conclusion is we should try out best to reuse kernel's 
regs_query_register_offset()
function, including providing same name and API and reusing its code. 
However,
we should avoid use it directly from kernel directory. Instead, we 
should copy them

to perf's directory.

Therefore, He Kuang needs to redo his work after a while.

Any suggestion about code reusing?

For ARM: I don't think this is a hard work. In fact we have tested our 
perf BPF code
on an ARM board. Haven't provided this because we haven't meet a real 
profiling task on

ARM which requires fetching argument from BPF program.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 21/31] perf test: Enforce LLVM test, add kbuild test

2015-10-20 Thread Wangnan (F)




On 2015/10/19 22:53, Arnaldo Carvalho de Melo wrote:

Em Mon, Oct 19, 2015 at 11:42:31PM +0900, Namhyung Kim escreveu:

Hi,

On Wed, Oct 14, 2015 at 12:41:32PM +, Wang Nan wrote:

This patch enforces existing LLVM test, makes it compile more than one
BPF source file. The compiled results are stored, can be used for other
testcases. Except the first testcase (named LLVM_TESTCASE_BASE), failures
of other test cases are not considered as failure of the whole test.

Adds a kbuild testcase to check whether kernel headers can be correctly
found.

For example:

  # perf test LLVM

38: Test LLVM searching and compiling: 
(llvm.kbuild-dir can be fixed) Ok

IMHO it'd be better to keep the test result simply as either "Ok" or
"FAILED" and provide details with -v option.


I have to say I'm not the first one to output things like this:

 #perf test dummy attr
 14: struct perf_event_attr setup : 
(omitted) Ok
 23: Test using a dummy software event to keep tracking   : (not 
supported) Ok


Do you think they also need to be fixed?


"Skipped" looks more natural here, with the reason for it skipping being shown
only with -v.


For this specific case, I will change the logic:

 1. When basic test failed (no clang, compiling error...), report failure;

 2. When basic test passed but kbuild test or later test cases failed, 
report

Skip.

 3. Don't output any other information.

Thank you.


Ingo made some comments about 'perf test' output recently, have you read that?


Anyway I found the tracepoint error message is annoying in the current

You mean all these warnings:

[root@zoo ~]# perf test "parse event"
  5: parse events tests   :  Warning: 
[sunrpc:xprt_lookup_rqst] function __builtin_constant_p not defined
   Warning: [sunrpc:xprt_transmit] function __builtin_constant_p not defined
   Warning: [sunrpc:xprt_complete_rqst] function __builtin_constant_p not 
defined

   Warning: [xen:xen_mmu_set_pud] function sizeof not defined
   Warning: [xen:xen_mmu_set_pgd] function sizeof not defined
   Warning: [xen:xen_mmu_ptep_modify_prot_start] function sizeof not defined
   Warning: [xen:xen_mmu_ptep_modify_prot_commit] function sizeof not defined
  Ok
[root@zoo ~]#

i.e. making those warnings appear only under 'perf test -v'? Cool!


perf test output.  Will send a fix soon.

Thanks,
Namhyung



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 21/31] perf test: Enforce LLVM test, add kbuild test

2015-10-20 Thread Wangnan (F)


Hi Namhyung and Arnaldo,

I changed my testing related patches in my local git tree. All
changed combined together is at the end of this mail. I remove
all stderr output, and improve debug message. Please have a look
at it.

Thank you.


diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c
index 5a6290a..453eff8 100644
--- a/tools/perf/tests/bpf.c
+++ b/tools/perf/tests/bpf.c
@@ -49,7 +49,7 @@ static struct bpf_object *prepare_bpf(const char 
*name, void *obj_buf,


 obj = bpf__prepare_load_buffer(obj_buf, obj_buf_sz, name);
 if (IS_ERR(obj)) {
-fprintf(stderr, " (compile failed)");
+pr_debug("Compile BPF program failed.\n");
 return NULL;
 }
 return obj;
@@ -82,7 +82,7 @@ static int do_test(struct bpf_object *obj, int 
(*func)(void), int expect)


 err = parse_events_load_bpf_obj(&parse_evlist, &parse_evlist.list, 
obj);

 if (err || list_empty(&parse_evlist.list)) {
-fprintf(stderr, " (Failed to add events selected by BPF)");
+pr_debug("Failed to add events selected by BPF\n");
 if (!err)
 err = -EINVAL;
 goto out;
@@ -140,7 +140,7 @@ static int do_test(struct bpf_object *obj, int 
(*func)(void), int expect)

 }

 if (count != expect) {
-fprintf(stderr, " (filter result incorrect: %d != %d)", count, 
expect);

+pr_debug("BPF filter result incorrect: %d != %d\n", count, expect);
 err = -EBADF;
 }

@@ -164,16 +164,14 @@ static int __test__bpf(int index, const char *name,

 test_llvm__fetch_bpf_obj(&obj_buf, &obj_buf_sz, index);
 if (!obj_buf || !obj_buf_sz) {
-if (verbose == 0)
-fprintf(stderr, " (%s)", message_compile);
+pr_debug("Failed to compile: %s\n", message_compile);
 return TEST_SKIP;
 }

 obj = prepare_bpf(name, obj_buf, obj_buf_sz);
 if (!obj) {
 err = -EINVAL;
-if ((verbose == 0) && (message_load[0] != '\0'))
-fprintf(stderr, " (%s)", message_load);
+pr_debug("Failed to load: %s", message_load);
 goto out;
 }

@@ -192,7 +190,7 @@ int test__bpf(void)
 int err;

 if (geteuid() != 0) {
-fprintf(stderr, " (try run as root)");
+pr_debug("Only root can run BPF test\n");
 return TEST_SKIP;
 }

@@ -214,7 +212,7 @@ int test__bpf(void)
   (NR_ITERS + 1) / 4);
 return err;
 #else
-fprintf(stderr, " (skip BPF prologue test)");
+pr_debug("BPF prologue is disabled when compiling, skip this test\n");
 return TEST_OK;
 #endif
 }
diff --git a/tools/perf/tests/llvm.c b/tools/perf/tests/llvm.c
index e722e8a..7c3b2c3 100644
--- a/tools/perf/tests/llvm.c
+++ b/tools/perf/tests/llvm.c
@@ -13,19 +13,23 @@
 struct llvm_testcase {
 const char *source;
 const char *errmsg;
+const char *hintmsg;
 struct test_llvm__bpf_result *result;
 bool tried;
 } llvm_testcases[NR_LLVM_TESTCASES + 1] = {
 [LLVM_TESTCASE_BASE]= {.source = test_llvm__bpf_prog,
.errmsg = "Basic LLVM compiling failed",
+   .hintmsg = "Check llvm.clang-path option in 
~/.perfconfig",

.tried = false},
 [LLVM_TESTCASE_KBUILD]= {.source = 
test_llvm__bpf_test_kbuild_prog,

-   .errmsg = "llvm.kbuild-dir can be fixed",
+   .errmsg = "Unable to find usable kbuild dir",
+   .hintmsg = "Check llvm.kbuild-dir option in 
~/.perfconfig",

.tried = false},
 /* Don't output if this one fail. */
 [LLVM_TESTCASE_BPF_PROLOGUE]= {
.source = test_llvm__bpf_test_prologue_prog,
-   .errmsg = "failed for unknown reason",
+   .errmsg = "Unable to compile BPF prologue testing 
program",
+   .hintmsg = "This is an internal error, please report 
it",

.tried = false},
 {.source = NULL}
 };
@@ -43,16 +47,16 @@ static int test__bpf_parsing(void *obj_buf, size_t 
obj_buf_sz)


 obj = bpf_object__open_buffer(obj_buf, obj_buf_sz, NULL);
 if (!obj)
-return -1;
+return TEST_FAIL;
 bpf_object__close(obj);
-return 0;
+return TEST_OK;
 }
 #else
 static int test__bpf_parsing(void *obj_buf __maybe_unused,
  size_t obj_buf_sz __maybe_unused)
 {
-fprintf(stderr, " (skip bpf parsing)");
-return 0;
+pr_debug("BPF support is not compiled, skip BPF parsing\n");
+return TEST_SKIP;
 }
 #endif

@@ -70,8 +74,8 @@ compose_source(const char *raw_source)
 err = sscanf(utsname.release, "%d.%d.%d",
  &version, &patchlevel, &sublevel);
 if (err != 3) {
-fprintf(stderr, " (Can't get kernel version from uname '%s')",
-utsname.release);
+pr_debug("Unablt to get kernel version from uname '%s'\n",
+ utsname.release);
 return NULL;
 }

@@ -104,7 +108,7 @@ static int __test__llvm(int i)
  * and clang is not found

Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-20 Thread Wangnan (F)


Hi Arnaldo,

On 2015/10/6 5:03, Arnaldo Carvalho de Melo wrote:

From: Arnaldo Carvalho de Melo 

Which is the most common default found in other similar tools.


Could you please show me some example about "other similar tools"?
For me, in most of the case I prefer callee order because most of my
task is to explain the reason why some code get executed too much times
than expected.

Also, I think changing default settings should be careful.

This is my story: after switching to new version of perf, in a period of
time there are plenty of perf users in my company be confused by the
first column of 'perf report' because the sum of the percentage listed
there is much higher than 100%. They find me because they think this is
a bug in perf which breaks their routinely profiling work.  The
"problem" is caused by the adding of "--children". New perf makes
'--children' as the default behavior at the first time it support that
option, but the old perf shows things similar to '--no-children'.
However, it is hard to explain the principle of call stack accumulation
and why we need '--children' to those perf users (they learned perf's
command line from others, and don't have enought to read perf
documentations or even help output. Althought the title of the first
column is changed to 'Children', I don't think they can understand the
meaning of it. I think some of them didn't even notice there's an
addition column in their output. They just confused and angry). Also,
and as you can expect, this change breaks some scripts. In those days I
have to make our IM tool response the information of "--no-children"
automatically.

This patch changes the default output again. Similar thing will happen
another time. I think this time I can make some preparation, for example,
prepare new script to restore old behavior?

Thank you.


Requested-by: Ingo Molnar 
Cc: Adrian Hunter 
Cc: Borislav Petkov 
Cc: Chandler Carruth 
Cc: David Ahern 
Cc: Frederic Weisbecker 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: Wang Nan 
Link: https://www.youtube.com/watch?v=nXaxk27zwlk
Link: http://lkml.kernel.org/n/tip-v8lq36aispvdwgxdmt9p9...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
  tools/perf/Documentation/perf-report.txt | 2 +-
  tools/perf/builtin-report.c  | 4 ++--
  tools/perf/util/util.c   | 4 ++--
  3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index ce499035e6d8..e4fdeeb51123 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -192,7 +192,7 @@ OPTIONS
when available. Usually more convenient to use --branch-history
for this.
  
-	Default: fractal,0.5,callee,function.

+   Default: graph,0.5,caller
  
  --children::

Accumulate callchain of children to parent entry so that then can
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index b5623639f67d..3b23b25d1589 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -633,7 +633,7 @@ int cmd_report(int argc, const char **argv, const char 
*prefix __maybe_unused)
bool has_br_stack = false;
int branch_mode = -1;
bool branch_call_mode = false;
-   char callchain_default_opt[] = "fractal,0.5,callee";
+   char callchain_default_opt[] = "graph,0.5,caller";
const char * const report_usage[] = {
"perf report []",
NULL
@@ -701,7 +701,7 @@ int cmd_report(int argc, const char **argv, const char 
*prefix __maybe_unused)
"Only display entries with parent-match"),
OPT_CALLBACK_DEFAULT('g', "call-graph", &report, 
"output_type,min_percent[,print_limit],call_order[,branch]",
 "Display callchains using output_type (graph, flat, fractal, or 
none) , min percent threshold, optional print limit, callchain order, key (function or 
address), add branches. "
-"Default: fractal,0.5,callee,function", 
&report_parse_callchain_opt, callchain_default_opt),
+"Default: graph,0.5,caller", &report_parse_callchain_opt, 
callchain_default_opt),
OPT_BOOLEAN(0, "children", &symbol_conf.cumulate_callchain,
"Accumulate callchains of children and show total overhead as 
well"),
OPT_INTEGER(0, "max-stack", &report.max_stack,
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index ce465b259e52..c1bf9ff210b0 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -17,9 +17,9 @@
  #include "callchain.h"
  
  struct callchain_param	callchain_param = {

-   .mode   = CHAIN_GRAPH_REL,
+   .mode   = CHAIN_GRAPH_ABS,
.min_percent = 0.5,
-   .order  = ORDER_CALLEE,
+   .order  = ORDER_CALLER,
.key= CCKEY_FUNCTION
  };
  



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body

Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-22 Thread Wangnan (F)




On 2015/10/23 8:10, Alexei Starovoitov wrote:

Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
   (do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.

Also fix error path after perf_event_attrs()
and remove redundant 'extern'.

Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the 
selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov 
---
v2->v3:
. refactor checks based on Wangnan's and Peter's feedback
while refactoring realized that these two issues need fixes as well:
. fix perf_event_attrs() error path
. remove redundant extern

v1->v2: fix compile in case of !CONFIG_PERF_EVENTS

Even in the worst case the crash is not possible.
Only warn_on_once, so imo net-next is ok.

  include/linux/bpf.h  |1 -
  kernel/bpf/arraymap.c|   25 -
  kernel/trace/bpf_trace.c |7 ++-
  3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e3a51b74e275..75718fa28260 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -194,7 +194,6 @@ extern const struct bpf_func_proto 
bpf_map_lookup_elem_proto;
  extern const struct bpf_func_proto bpf_map_update_elem_proto;
  extern const struct bpf_func_proto bpf_map_delete_elem_proto;
  
-extern const struct bpf_func_proto bpf_perf_event_read_proto;

  extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
  extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
  extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index e3cfe46b074f..3f4c99e06c6b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -292,16 +292,23 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
  
  	attr = perf_event_attrs(event);

if (IS_ERR(attr))
-   return (void *)attr;
+   goto err;
  
-	if (attr->type != PERF_TYPE_RAW &&

-   !(attr->type == PERF_TYPE_SOFTWARE &&
- attr->config == PERF_COUNT_SW_BPF_OUTPUT) &&
-   attr->type != PERF_TYPE_HARDWARE) {
-   perf_event_release_kernel(event);
-   return ERR_PTR(-EINVAL);
-   }
-   return event;
+   if (attr->inherit)
+   goto err;
+


Since Peter suggest it is pointless for a system-wide perf_event
has inherit bit set [1], I think it should be safe to enable
system-wide perf_event pass this check?

I'll check code to make sure.

[1] 
http://lkml.kernel.org/r/20151022124142.gq17...@twins.programming.kicks-ass.net



+   if (attr->type == PERF_TYPE_RAW)
+   return event;
+
+   if (attr->type == PERF_TYPE_HARDWARE)
+   return event;
+
+   if (attr->type == PERF_TYPE_SOFTWARE &&
+   attr->config == PERF_COUNT_SW_BPF_OUTPUT)
+   return event;
+err:
+   perf_event_release_kernel(event);
+   return ERR_PTR(-EINVAL);
  }
  
  static void perf_event_fd_array_put_ptr(void *ptr)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 47febbe7998e..003df3887287 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -199,6 +199,11 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, 
u64 r4, u64 r5)
if (!event)
return -ENOENT;
  
+	/* make sure event is local and doesn't have pmu::count */

+   if (event->oncpu != smp_processor_id() ||
+   event->pmu->count)
+   return -EINVAL;
+
/*
 * we don't know if the function is run successfully by the
 * return value. It can be judged in other places, such as
@@ -207,7 +212,7 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, 
u64 r4, u64 r5)
return perf_event_read_local(event);
  }
  
-const struct bpf_func_proto bpf_perf_event_read_proto = {

+static const struct bpf_func_proto bpf_perf_event_read_proto = {
.func   = bpf_perf_event_read,
.gpl_only   = false,
.ret_type   = RET_INTEGER,



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-22 Thread Wangnan (F)




On 2015/10/23 8:10, Alexei Starovoitov wrote:

Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
   (do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.

Also fix error path after perf_event_attrs()
and remove redundant 'extern'.

Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the 
selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov 


Tested-by: Wang Nan 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 4/7] perf tools: Enable BPF object configure syntax

2015-10-22 Thread Wangnan (F)




On 2015/10/17 18:48, Wang Nan wrote:

[SNIP]
  
  event_bpf_file:

-PE_BPF_OBJECT
+PE_BPF_OBJECT event_bpf_config
  {
struct parse_events_evlist *data = _data;
struct parse_events_error *error = data->error;
struct list_head *list;
  
  	ALLOC_LIST(list);

-   ABORT_ON(parse_events_load_bpf(data, list, $1, false));
+   ABORT_ON(parse_events_load_bpf(data, list, $1, false, $2));
+   parse_events__free_terms($2);


Here found a bug that $2 is possible to be NULL, but parse_events_free_terms
can't accept NULL param.

Will be fixed in next version.

Thank you.


$$ = list;
  }
  |
-PE_BPF_SOURCE
+PE_BPF_SOURCE event_bpf_config
  {
struct parse_events_evlist *data = _data;
struct list_head *list;
  
  	ALLOC_LIST(list);

-   ABORT_ON(parse_events_load_bpf(data, list, $1, true));
+   ABORT_ON(parse_events_load_bpf(data, list, $1, true, $2));
+   parse_events__free_terms($2);
$$ = list;
  }
  
+event_bpf_config:

+'/' event_config '/'
+{
+   $$ = $2;
+}
+|
+{
+   $$ = NULL;
+}
+
  start_terms: event_config
  {
struct parse_events_terms *data = _data;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next 2/3] bpf: introduce bpf_perf_event_output() helper

2015-10-25 Thread Wangnan (F)




On 2015/10/24 1:25, Alexei Starovoitov wrote:

On 10/23/15 9:42 AM, Peter Zijlstra wrote:

On Fri, Oct 23, 2015 at 08:02:00AM -0700, Alexei Starovoitov wrote:

On 10/23/15 7:39 AM, Peter Zijlstra wrote:

On Tue, Oct 20, 2015 at 08:02:34PM -0700, Alexei Starovoitov wrote:

+static const struct bpf_func_proto bpf_perf_event_output_proto = {
+.func= bpf_perf_event_output,
+.gpl_only= false,

Oh ?


no particular reason. key helper bpf_probe_read() is gpl, so all
bpf for tracing progs have to be gpl.
If you feel strongly about it, I can change it.


All the perf symbols are export GPL, so I suppose this should be true.


ok. will send a patch.



Can we (or have we already) setup some rules for licensing? Which part
should be GPL? Who has the response to decide it?

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [net-next PATCH] bpf: Output error message to logbuf when loading

2015-10-25 Thread Wangnan (F)




On 2015/10/26 14:36, Wang Nan wrote:

Many reason can make bpf_prog_load() return EINVAL. This patch utilizes
logbuf passed from user to deliver the actual reason of failure.

Without this patch, people is easy to forget fixing the "version"
section in their BPF objects.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: David S. Miller 
---
  kernel/bpf/syscall.c | 41 ++---
  1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 687dd6c..3a0e4e7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -574,6 +574,32 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
  }
  EXPORT_SYMBOL_GPL(bpf_prog_get);
  
+static void

+bpf_prog_load_note(union bpf_attr *attr, const char *fmt, ...)
+{
+   u32 log_level, log_size, log_len;
+   char __user *log_ubuf = NULL;
+   /* 64 chars should be long enough for a one line note. */
+   char log_buf[64];
+   va_list args;
+
+   log_ubuf = (char __user *) (unsigned long) attr->log_buf;
+   log_level = attr->log_level;
+   log_size = sizeof(log_buf);
+   if (attr->log_size < log_size)
+   log_size = attr->log_size;
+
+   if (log_level == 0 || !log_size || !log_ubuf)
+   return;
+
+   va_start(args, fmt);
+   log_len = vscnprintf(log_buf, log_size, fmt, args);


Don't need this log_len actually. Will send a v2.

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/31] perf record: Load eBPF object into kernel

2015-10-26 Thread Wangnan (F)


Hi Arnaldo,

On 2015/10/24 8:27, Arnaldo Carvalho de Melo wrote:

libbpf: load bpf program failed: Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:

libbpf: -- END LOG --
libbpf: failed to load program 'fork=_do_fork'
libbpf: failed to load object '/tmp/foo.o'
bpf: load objects failed
event syntax error: '/tmp/foo.o'
  \___ Invalid argument: Are you root and runing a 
CONFIG_BPF_SYSCALL kernel?

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf record [] []
 or: perf record [] --  []

 -e, --eventevent selector. use 'perf list' to list available 
events
Opening /sys/kernel/debug/tracing//kprobe_events write=1
Opening /sys/kernel/debug/tracing//uprobe_events write=1
Parsing probe_events: p:perf_bpf_probe/fork _text+638512
Group:perf_bpf_probe Event:fork probe:p
Writing event: -:perf_bpf_probe/fork
[acme@felicio linux]$ uname -a
Linux felicio.ghostprotocols.net 4.3.0-rc6+ #1 SMP Fri Oct 23 16:40:25 BRT 2015 
x86_64 x86_64 x86_64 GNU/Linux
[acme@felicio linux]$

Please have a look at patch [1]. With that patch kernel will dump 
something into

log buffer. In your case, you will see:

root@sandybridge:~# /tmp/perf record -e /tmp/test_config_base.o ls
libbpf: load bpf program failed: Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Kernel version mismatch: 0x40400 != 0x40300

libbpf: -- END LOG --
libbpf: failed to load program 'func_write=sys_write'
libbpf: failed to load object '/tmp/test_config_base.o'
event syntax error: '/tmp/test_config_base.o'
 \___ Invalid argument: Are you root and runing a 
CONFIG_BPF_SYSCALL kernel?


(add -v to see detail)
Run 'perf list' for a list of valid events
...

And I'd like to improve the error message output by perf.

Do you want me to send a patch for fixing this problem or modifying 
existing patches

in my git tree?

Thank you.

[1] 
http://lkml.kernel.org/r/1445843588-143137-1-git-send-email-wangn...@huawei.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] perf tools: Don't set inherit bit for system wide evsel

2015-10-26 Thread Wangnan (F)




On 2015/10/24 0:17, Arnaldo Carvalho de Melo wrote:

Em Fri, Oct 23, 2015 at 09:58:20PM +0800, pi3orama escreveu:


发自我的 iPhone


在 2015年10月23日，下午9:51，Arnaldo Carvalho de Melo  写道：

Em Fri, Oct 23, 2015 at 10:43:49AM +, Wang Nan escreveu:

Inherit bit is useless for a system wide evsel [1]. Further kernel
improvements are giving more constrain [2] on inherit events. This
patch set inherit bit to 0 to avoid potential constrains.

[1] 
http://lkml.kernel.org/r/20151022124142.gq17...@twins.programming.kicks-ass.net
[2] http://lkml.kernel.org/r/1445559014-4667-1-git-send-email-...@kernel.org

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Peter Zijlstra 
Cc: Li Zefan 
Cc: pi3or...@163.com
Link: http://lkml.kernel.org/n/ebpf-0tgilipxoo6fiebcxu3ft...@git.kernel.org
---

evsel->system_wide doesn't correct reflect whether this evsel is system
wide or not, so checks pid when invoking perf_event_open, and it is
always correct.

Can't we do this at perf_evlist__config() or perf_evsel__config() time?

perf_evlist_config() is excluded because perf record doesn't use it.

Yeah, we need to make it use it :-\


Its my fault that perf record *does* use perf_evlist__config(), but 
'perf stat'

doesn't.

  

We have record_opts at perf_evsel__config() time and I think we should
leave changing the attr at perf_evsel__open() time for feature
fallbacks, i.e. something we will only know when trying to use, which is
different from this inherit-on-syswide case, that we know far in advance
we will not need.

I tried to set this bit based on evsel->system_wide but it seems not reliable
as it should be, so I was wondering whether it is designed for other use. I 
will look
into this next week.


evsel->system_wide is introduced by commit 
bf8e8f4b832972c76d64ab2e2837a48397144887
(perf evlist: Add 'system_wide' option), but Adrian only introduced a 
new field

into perf, doesn't really make it active. Until now the only user of it is
arch/x86/util/intel-pt.c, but I'm not very sure the reason for IPT to 
use that

field.

If I understand correctly, it should be okay for a normal system wide 
evsel to have

this var set. I'll try another RFC for it.

Thank you.


Ok, thanks in advance, lemme go back looking at eBPF :-)

- Arnaldo



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] perf tools: Don't set inherit bit for system wide evsel

2015-10-26 Thread Wangnan (F)




On 2015/10/26 17:25, Adrian Hunter wrote:

On 26/10/15 11:08, Wangnan (F) wrote:


evsel->system_wide is introduced by commit
bf8e8f4b832972c76d64ab2e2837a48397144887
(perf evlist: Add 'system_wide' option), but Adrian only introduced a new field
into perf, doesn't really make it active. Until now the only user of it is
arch/x86/util/intel-pt.c, but I'm not very sure the reason for IPT to use that
field.

If I understand correctly, it should be okay for a normal system wide evsel
to have
this var set. I'll try another RFC for it.

evsel->system_wide is for mixing evsels that aren't system-wide with ones
that are.

It might work to set it for all system-wide evsels but you will have to
check the code and test it, because that would be using it in a new way
that has never been tested.


I have check all occurance of system_wide I can found and found
only one behavior change which I believe should be okay. Please
have a look at [1].

Thank you.

[1] 
http://lkml.kernel.org/g/1445859720-146146-1-git-send-email-wangn...@huawei.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 1/2] perf tools: Set evsel->system_wide field for global system wide recording

2015-10-26 Thread Wangnan (F)




On 2015/10/26 20:24, Adrian Hunter wrote:

On 26/10/15 13:41, Wang Nan wrote:

evsel->system_wide is introduced by commit bf8e8f4b832972c76d64ab2e2837
(perf evlist: Add 'system_wide' option), which is used for mixing evsels
that aren't system-wide with ones that are [1]. However, for global
system wide recording (perf record -a ...), evsel->system_wide is set
to false, which is confusion.

This patch set evsel->system_wide to true if the target.system_wide is
set, which makes evsel->system_wide a reliable way to describe whether
itself is system_wide or not.

[1] http://lkml.kernel.org/r/562df19b.2080...@intel.com

Signed-off-by: Wang Nan 
Cc: Adrian Hunter 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexei Starovoitov 
Cc: Peter Zijlstra 
Cc: Li Zefan 
Cc: pi3or...@163.com
Link: http://lkml.kernel.org/n/ebpf-qm3gtwidc1o5ktjd9tgje...@git.kernel.org
---
  tools/perf/util/evsel.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 3ac4ee9c..36ecf0e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -734,6 +734,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct 
record_opts *opts)
int track = evsel->tracking;
bool per_cpu = opts->target.default_per_cpu && !opts->target.per_thread;
  
+	evsel->system_wide = opts->target.system_wide;

Well that breaks the way evsel->system_wide is used i.e. it is a parameter
to the evsel and here you just overwrote it.


Currently the only user of evsel->system_wide is IPT:

auxtrace_record__options -> intel_pt_recording_options

and it only set it to true.

So I think changing to this should make it safe:

evsel->system_wide = (evsel->system_wide || opt->target.system_wide);

Thought?

If we want to add further config terms we can put it to 
apply_config_terms(),

where we can implement something like:

 # perf record -e cycles/system-wide/ -e instruction/no-system-wide/ ...

But currently I don't have such requirement.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper

2015-10-26 Thread Wangnan (F)




On 2015/10/26 20:32, Peter Zijlstra wrote:

On Sun, Oct 25, 2015 at 09:23:36AM -0700, Alexei Starovoitov wrote:

bpf_perf_event_read() muxes of -EINVAL into return value, but it's non
ambiguous to the program whether it got an error or real counter value.

How can that be, the (u64)-EINVAL value is a valid counter value..
unlikely maybe, but still quite possible.
In our real usecase we simply treat return value larger than 
0x7fff
as error result. We can make it even larger, for example, to 
0x.
Resuling values can be pre-processed by a script to filter potential 
error result

out so it is not a very big problem for our real usecases.

For a better interface, I suggest

 u64 bpf_perf_event_read(bool *perror);

which still returns counter value through its return value but put error 
code

to stack. Then BPF program can pass NULL to the function if BPF problem
doesn't want to deal with error code.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 20/31] perf tools: Allow BPF program attach to uprobe events

2015-10-26 Thread Wangnan (F)




On 2015/10/14 20:41, Wang Nan wrote:

This patch appends new syntax to BPF object section name to support
probing at uprobe event. Now we can use BPF program like this:

  SEC(
  "target=/lib64/libc.so.6\n"
  "libcwrite=__write"
  )
  int libcwrite(void *ctx)
  {
  return 1;
  }

Where, in section name of a program, before the main config string,
we can use 'key=value' style options. Now the only option key "target"
is for uprobe probing.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Brendan Gregg 
Cc: Daniel Borkmann 
Cc: David Ahern 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Kaixu Xia 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
Cc: Arnaldo Carvalho de Melo 
Link: http://lkml.kernel.org/n/ebpf-6yw9eg0ej3l4jnqhinngk...@git.kernel.org
---
  tools/perf/util/bpf-loader.c | 86 
  1 file changed, 80 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index af549ea..73ff9a9 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -111,6 +111,84 @@ bpf_prog_priv__clear(struct bpf_program *prog 
__maybe_unused,
  }
  
  static int

+do_config(const char *key, const char *value,
+ struct perf_probe_event *pev)
+{
+   pr_debug("config bpf program: %s=%s\n", key, value);
+   if (strcmp(key, "target") == 0) {
+   pev->uprobes = true;
+   pev->target = strdup(value);
+   return 0;
+   }
+
+   pr_warning("BPF: WARNING: invalid config option in object: %s=%s\n",
+  key, value);
+   pr_warning("\tHint: Currently only valid option is 'target='\n");
+   return 0;
+}



This part is very easy to be extended to support probing at modules.
I'd like to change "target" to "exec" to make it unify with perf probe,
then add another patch to support module probing in next pull request.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 523 matches

Mail list logo