Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-10 Thread Steven Rostedt
On Mon, 9 Feb 2015 21:47:42 -0800
Alexei Starovoitov  wrote:

> On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov  wrote:
> > I thought we already stated that.
> > Here is the quote from perf_event.h:
> >  *  # The RAW record below is opaque data wrt the ABI
> >  *  #
> >  *  # That is, the ABI doesn't make any promises wrt to
> >  *  # the stability of its content, it may vary depending
> >  *  # on event, hardware, kernel version and phase of
> >  *  # the moon.
> >  *  #
> >  *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> >
> > and this example is reading PERF_SAMPLE_RAW events and
> > uses locally defined structs to print them for simplicity.
> 
> to underline my point once more:
> addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
> delivers to user space.
> so no new ABIs anywhere.

Again, it we give an example of how to hard code the data, definitely
expect this to show up in user space. Users are going to look at this
code to learn how to use eBPF. I really want it to do it the correct
way instead of the 'easy' way. Because whatever way we have it here,
will be the way we will see it out in the wild.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-10 Thread Steven Rostedt
Added Linus because he's the one that would revert changes on breakage.

On Mon, 9 Feb 2015 21:45:21 -0800
Alexei Starovoitov  wrote:

> On Mon, Feb 9, 2015 at 9:16 PM, Steven Rostedt  wrote:
> > On Mon, 9 Feb 2015 23:08:36 -0500
> > Steven Rostedt  wrote:
> >
> >> I don't want to get stuck with pinned kernel data structures again. We
> >> had 4 blank bytes of data for every event, because latency top hard
> >> coded the field. Luckily, the 64 bit / 32 bit interface caused latency
> >> top to have to use the event_parse code to work, and we were able to
> >> remove that field after it was converted.
> 
> I think your main point boils down to:
> 
> > But I still do not want any hard coded event structures. All access to
> > data from the binary code must be parsed by looking at the event/format
> > files. Otherwise you will lock internals of the kernel as userspace
> > ABI, because eBPF programs will break if those internals change, and
> > that could severely limit progress in the future.
> 
> and I completely agree.
> 
> the patch 4 is an example. It doesn't mean in any way
> that structs defined here is an ABI.
> To be compatible across kernels the user space must read
> format file as you mentioned in your other reply.

The thing is, this is a sample. Which means it will be cut and pasted
into other programs. If the sample does not follow the way we want
users to use this, then how can we complain if they hard code it as
well?

> 
> > I'm wondering if we should label eBPF programs as "modules". That is,
> > they have no guarantee of working from one kernel to the next. They
> > execute in the kernel, thus they are very similar to modules.
> >
> > If we can get Linus to say that eBPF programs are not user space, and
> > that they are treated the same as modules (no internal ABI), then I
> > think we can be a bit more free at what we allow.
> 
> I thought we already stated that.
> Here is the quote from perf_event.h:
>  *  # The RAW record below is opaque data wrt the ABI
>  *  #
>  *  # That is, the ABI doesn't make any promises wrt to
>  *  # the stability of its content, it may vary depending
>  *  # on event, hardware, kernel version and phase of
>  *  # the moon.
>  *  #
>  *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.
> 
> and this example is reading PERF_SAMPLE_RAW events and
> uses locally defined structs to print them for simplicity.

As we found out the hard way with latencytop, comments like this does
not matter. If an application does something like this, it's our fault
if it breaks later. We can't say "hey you were suppose to do it this
way". That argument breaks down even more if our own examples do not
follow the way we want others to do things.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-10 Thread Steven Rostedt
Added Linus because he's the one that would revert changes on breakage.

On Mon, 9 Feb 2015 21:45:21 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 On Mon, Feb 9, 2015 at 9:16 PM, Steven Rostedt rost...@goodmis.org wrote:
  On Mon, 9 Feb 2015 23:08:36 -0500
  Steven Rostedt rost...@goodmis.org wrote:
 
  I don't want to get stuck with pinned kernel data structures again. We
  had 4 blank bytes of data for every event, because latency top hard
  coded the field. Luckily, the 64 bit / 32 bit interface caused latency
  top to have to use the event_parse code to work, and we were able to
  remove that field after it was converted.
 
 I think your main point boils down to:
 
  But I still do not want any hard coded event structures. All access to
  data from the binary code must be parsed by looking at the event/format
  files. Otherwise you will lock internals of the kernel as userspace
  ABI, because eBPF programs will break if those internals change, and
  that could severely limit progress in the future.
 
 and I completely agree.
 
 the patch 4 is an example. It doesn't mean in any way
 that structs defined here is an ABI.
 To be compatible across kernels the user space must read
 format file as you mentioned in your other reply.

The thing is, this is a sample. Which means it will be cut and pasted
into other programs. If the sample does not follow the way we want
users to use this, then how can we complain if they hard code it as
well?

 
  I'm wondering if we should label eBPF programs as modules. That is,
  they have no guarantee of working from one kernel to the next. They
  execute in the kernel, thus they are very similar to modules.
 
  If we can get Linus to say that eBPF programs are not user space, and
  that they are treated the same as modules (no internal ABI), then I
  think we can be a bit more free at what we allow.
 
 I thought we already stated that.
 Here is the quote from perf_event.h:
  *  # The RAW record below is opaque data wrt the ABI
  *  #
  *  # That is, the ABI doesn't make any promises wrt to
  *  # the stability of its content, it may vary depending
  *  # on event, hardware, kernel version and phase of
  *  # the moon.
  *  #
  *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.
 
 and this example is reading PERF_SAMPLE_RAW events and
 uses locally defined structs to print them for simplicity.

As we found out the hard way with latencytop, comments like this does
not matter. If an application does something like this, it's our fault
if it breaks later. We can't say hey you were suppose to do it this
way. That argument breaks down even more if our own examples do not
follow the way we want others to do things.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-10 Thread Steven Rostedt
On Mon, 9 Feb 2015 21:47:42 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov a...@plumgrid.com wrote:
  I thought we already stated that.
  Here is the quote from perf_event.h:
   *  # The RAW record below is opaque data wrt the ABI
   *  #
   *  # That is, the ABI doesn't make any promises wrt to
   *  # the stability of its content, it may vary depending
   *  # on event, hardware, kernel version and phase of
   *  # the moon.
   *  #
   *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.
 
  and this example is reading PERF_SAMPLE_RAW events and
  uses locally defined structs to print them for simplicity.
 
 to underline my point once more:
 addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
 delivers to user space.
 so no new ABIs anywhere.

Again, it we give an example of how to hard code the data, definitely
expect this to show up in user space. Users are going to look at this
code to learn how to use eBPF. I really want it to do it the correct
way instead of the 'easy' way. Because whatever way we have it here,
will be the way we will see it out in the wild.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov  wrote:
> I thought we already stated that.
> Here is the quote from perf_event.h:
>  *  # The RAW record below is opaque data wrt the ABI
>  *  #
>  *  # That is, the ABI doesn't make any promises wrt to
>  *  # the stability of its content, it may vary depending
>  *  # on event, hardware, kernel version and phase of
>  *  # the moon.
>  *  #
>  *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.
>
> and this example is reading PERF_SAMPLE_RAW events and
> uses locally defined structs to print them for simplicity.

to underline my point once more:
addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
delivers to user space.
so no new ABIs anywhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 9:16 PM, Steven Rostedt  wrote:
> On Mon, 9 Feb 2015 23:08:36 -0500
> Steven Rostedt  wrote:
>
>> I don't want to get stuck with pinned kernel data structures again. We
>> had 4 blank bytes of data for every event, because latency top hard
>> coded the field. Luckily, the 64 bit / 32 bit interface caused latency
>> top to have to use the event_parse code to work, and we were able to
>> remove that field after it was converted.

I think your main point boils down to:

> But I still do not want any hard coded event structures. All access to
> data from the binary code must be parsed by looking at the event/format
> files. Otherwise you will lock internals of the kernel as userspace
> ABI, because eBPF programs will break if those internals change, and
> that could severely limit progress in the future.

and I completely agree.

the patch 4 is an example. It doesn't mean in any way
that structs defined here is an ABI.
To be compatible across kernels the user space must read
format file as you mentioned in your other reply.

> I'm wondering if we should label eBPF programs as "modules". That is,
> they have no guarantee of working from one kernel to the next. They
> execute in the kernel, thus they are very similar to modules.
>
> If we can get Linus to say that eBPF programs are not user space, and
> that they are treated the same as modules (no internal ABI), then I
> think we can be a bit more free at what we allow.

I thought we already stated that.
Here is the quote from perf_event.h:
 *  # The RAW record below is opaque data wrt the ABI
 *  #
 *  # That is, the ABI doesn't make any promises wrt to
 *  # the stability of its content, it may vary depending
 *  # on event, hardware, kernel version and phase of
 *  # the moon.
 *  #
 *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.

and this example is reading PERF_SAMPLE_RAW events and
uses locally defined structs to print them for simplicity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Steven Rostedt
On Mon, 9 Feb 2015 23:08:36 -0500
Steven Rostedt  wrote:
 
> I don't want to get stuck with pinned kernel data structures again. We
> had 4 blank bytes of data for every event, because latency top hard
> coded the field. Luckily, the 64 bit / 32 bit interface caused latency
> top to have to use the event_parse code to work, and we were able to
> remove that field after it was converted.

I'm wondering if we should label eBPF programs as "modules". That is,
they have no guarantee of working from one kernel to the next. They
execute in the kernel, thus they are very similar to modules.

If we can get Linus to say that eBPF programs are not user space, and
that they are treated the same as modules (no internal ABI), then I
think we can be a bit more free at what we allow.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:57 -0800
Alexei Starovoitov  wrote:

> +static void print_netif_receive_skb(void *data, int size)
> +{
> + struct ftrace_raw_netif_receive_skb {
> + struct trace_entry t;
> + void *skb;
> + __u32 len;
> + __u32 name;
> + } *e = data;

Same here, there's no guarantee that this structure will always match.

cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format 
name: netif_receive_skb
ID: 975
format:
field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

field:void * skbaddr;   offset:8;   size:8; signed:0;
field:unsigned int len; offset:16;  size:4; signed:0;
field:__data_loc char[] name;   offset:20;  size:4; signed:1;

print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr, REC->len

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:57 -0800
Alexei Starovoitov  wrote:
 
> +int perf_event_mmap(int fd);
> +int perf_event_poll(int fd);
> +typedef void (*print_fn)(void *data, int size);
> +void perf_event_read(print_fn fn);
> +struct trace_entry {
> + unsigned short  type;
> + unsigned char   flags;
> + unsigned char   preempt_count;
> + int pid;
> +};
> +

Please do not hard code any structures. This is not a stable ABI, and
it may not even match if you are running 32 bit userspace on top of a
64 bit kernel.

Please parse the format files. libtraceevent does this for you. If need
be, link to that. But if you look at the event format files you'll see
the offsets and sizes in the binary code:

field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

I don't want to get stuck with pinned kernel data structures again. We
had 4 blank bytes of data for every event, because latency top hard
coded the field. Luckily, the 64 bit / 32 bit interface caused latency
top to have to use the event_parse code to work, and we were able to
remove that field after it was converted.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Alexei Starovoitov
tracex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb->dev->name == "lo"
The programs returns 1 to store an event into ring_buffer
and returns 0 - to discard an event.

tracex1_user.c - corresponding user space component that:
- loads bpf program via bpf() syscall
- opens net:netif_receive_skb event via perf_event_open() syscall
- attaches the program to event via ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, 
prog_fd);
- mmaps event_fd
- polls event_fd, walks ring_buffer and prints events

Usage:
$ sudo tracex1
pid 29241 skb 0x88045e58c500 len 84 dev lo
pid 29241 skb 0x88045e58cd00 len 84 dev lo
pid 29241 skb 0x880074c35000 len 84 dev lo
pid 29241 skb 0x880074c35200 len 84 dev lo

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile   |4 ++
 samples/bpf/bpf_helpers.h  |   14 +
 samples/bpf/bpf_load.c |  129 +---
 samples/bpf/bpf_load.h |   12 +
 samples/bpf/tracex1_kern.c |   28 ++
 samples/bpf/tracex1_user.c |   50 +
 6 files changed, 231 insertions(+), 6 deletions(-)
 create mode 100644 samples/bpf/tracex1_kern.c
 create mode 100644 samples/bpf/tracex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 789691374562..da28e1b6d3a6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -7,6 +7,7 @@ hostprogs-y += sock_example
 hostprogs-y += sockex1
 hostprogs-y += sockex2
 hostprogs-y += dropmon
+hostprogs-y += tracex1
 
 dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
@@ -14,17 +15,20 @@ test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
+tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 always += sockex1_kern.o
 always += sockex2_kern.o
+always += tracex1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_sockex1 += -lelf
 HOSTLOADLIBES_sockex2 += -lelf
+HOSTLOADLIBES_tracex1 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index ca0333146006..406e9705d99e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -15,6 +15,20 @@ static int (*bpf_map_update_elem)(void *map, void *key, void 
*value,
(void *) BPF_FUNC_map_update_elem;
 static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static void *(*bpf_fetch_ptr)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_ptr;
+static unsigned long long (*bpf_fetch_u64)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u64;
+static unsigned int (*bpf_fetch_u32)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u32;
+static unsigned short (*bpf_fetch_u16)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u16;
+static unsigned char (*bpf_fetch_u8)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u8;
+static int (*bpf_probe_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) =
+   (void *) BPF_FUNC_probe_memcmp;
+static unsigned long long (*bpf_ktime_get_ns)(void) =
+   (void *) BPF_FUNC_ktime_get_ns;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 1831d236382b..2aece65963e4 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,29 +8,47 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include "libbpf.h"
 #include "bpf_helpers.h"
 #include "bpf_load.h"
 
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
 static char license[128];
 static bool processed_sec[128];
 int map_fd[MAX_MAPS];
 int prog_fd[MAX_PROGS];
+int event_fd[MAX_PROGS];
 int prog_cnt;
 
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
-   int fd;
bool is_socket = strncmp(event, "socket", 6) == 0;
+   enum bpf_prog_type prog_type;
+   char path[256] = DEBUGFS;
+   char buf[32];
+   int fd, efd, err, id;
+   struct perf_event_attr attr;
 
-   if (!is_socket)
-   /* tracing events tbd */
-   return -1;
+   attr.type = PERF_TYPE_TRACEPOINT;
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   if (is_socket)
+   prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+   else
+   prog_type = BPF_PROG_TYPE_TRACEPOINT;
 
-   fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
-  prog, size, license);
+   fd = bpf_prog_load(prog_type, prog, size, license);
 
   

Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:57 -0800
Alexei Starovoitov a...@plumgrid.com wrote:
 
 +int perf_event_mmap(int fd);
 +int perf_event_poll(int fd);
 +typedef void (*print_fn)(void *data, int size);
 +void perf_event_read(print_fn fn);
 +struct trace_entry {
 + unsigned short  type;
 + unsigned char   flags;
 + unsigned char   preempt_count;
 + int pid;
 +};
 +

Please do not hard code any structures. This is not a stable ABI, and
it may not even match if you are running 32 bit userspace on top of a
64 bit kernel.

Please parse the format files. libtraceevent does this for you. If need
be, link to that. But if you look at the event format files you'll see
the offsets and sizes in the binary code:

field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

I don't want to get stuck with pinned kernel data structures again. We
had 4 blank bytes of data for every event, because latency top hard
coded the field. Luckily, the 64 bit / 32 bit interface caused latency
top to have to use the event_parse code to work, and we were able to
remove that field after it was converted.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Alexei Starovoitov
tracex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb-dev-name == lo
The programs returns 1 to store an event into ring_buffer
and returns 0 - to discard an event.

tracex1_user.c - corresponding user space component that:
- loads bpf program via bpf() syscall
- opens net:netif_receive_skb event via perf_event_open() syscall
- attaches the program to event via ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, 
prog_fd);
- mmaps event_fd
- polls event_fd, walks ring_buffer and prints events

Usage:
$ sudo tracex1
pid 29241 skb 0x88045e58c500 len 84 dev lo
pid 29241 skb 0x88045e58cd00 len 84 dev lo
pid 29241 skb 0x880074c35000 len 84 dev lo
pid 29241 skb 0x880074c35200 len 84 dev lo

Signed-off-by: Alexei Starovoitov a...@plumgrid.com
---
 samples/bpf/Makefile   |4 ++
 samples/bpf/bpf_helpers.h  |   14 +
 samples/bpf/bpf_load.c |  129 +---
 samples/bpf/bpf_load.h |   12 +
 samples/bpf/tracex1_kern.c |   28 ++
 samples/bpf/tracex1_user.c |   50 +
 6 files changed, 231 insertions(+), 6 deletions(-)
 create mode 100644 samples/bpf/tracex1_kern.c
 create mode 100644 samples/bpf/tracex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 789691374562..da28e1b6d3a6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -7,6 +7,7 @@ hostprogs-y += sock_example
 hostprogs-y += sockex1
 hostprogs-y += sockex2
 hostprogs-y += dropmon
+hostprogs-y += tracex1
 
 dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
@@ -14,17 +15,20 @@ test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
+tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 always += sockex1_kern.o
 always += sockex2_kern.o
+always += tracex1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_sockex1 += -lelf
 HOSTLOADLIBES_sockex2 += -lelf
+HOSTLOADLIBES_tracex1 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index ca0333146006..406e9705d99e 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -15,6 +15,20 @@ static int (*bpf_map_update_elem)(void *map, void *key, void 
*value,
(void *) BPF_FUNC_map_update_elem;
 static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static void *(*bpf_fetch_ptr)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_ptr;
+static unsigned long long (*bpf_fetch_u64)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u64;
+static unsigned int (*bpf_fetch_u32)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u32;
+static unsigned short (*bpf_fetch_u16)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u16;
+static unsigned char (*bpf_fetch_u8)(void *unsafe_ptr) =
+   (void *) BPF_FUNC_fetch_u8;
+static int (*bpf_probe_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) =
+   (void *) BPF_FUNC_probe_memcmp;
+static unsigned long long (*bpf_ktime_get_ns)(void) =
+   (void *) BPF_FUNC_ktime_get_ns;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 1831d236382b..2aece65963e4 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,29 +8,47 @@
 #include unistd.h
 #include string.h
 #include stdbool.h
+#include stdlib.h
 #include linux/bpf.h
 #include linux/filter.h
+#include linux/perf_event.h
+#include sys/syscall.h
+#include sys/ioctl.h
+#include sys/mman.h
+#include poll.h
 #include libbpf.h
 #include bpf_helpers.h
 #include bpf_load.h
 
+#define DEBUGFS /sys/kernel/debug/tracing/
+
 static char license[128];
 static bool processed_sec[128];
 int map_fd[MAX_MAPS];
 int prog_fd[MAX_PROGS];
+int event_fd[MAX_PROGS];
 int prog_cnt;
 
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
-   int fd;
bool is_socket = strncmp(event, socket, 6) == 0;
+   enum bpf_prog_type prog_type;
+   char path[256] = DEBUGFS;
+   char buf[32];
+   int fd, efd, err, id;
+   struct perf_event_attr attr;
 
-   if (!is_socket)
-   /* tracing events tbd */
-   return -1;
+   attr.type = PERF_TYPE_TRACEPOINT;
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   if (is_socket)
+   prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+   else
+   prog_type = BPF_PROG_TYPE_TRACEPOINT;
 
-   fd = 

Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:57 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 +static void print_netif_receive_skb(void *data, int size)
 +{
 + struct ftrace_raw_netif_receive_skb {
 + struct trace_entry t;
 + void *skb;
 + __u32 len;
 + __u32 name;
 + } *e = data;

Same here, there's no guarantee that this structure will always match.

cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format 
name: netif_receive_skb
ID: 975
format:
field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

field:void * skbaddr;   offset:8;   size:8; signed:0;
field:unsigned int len; offset:16;  size:4; signed:0;
field:__data_loc char[] name;   offset:20;  size:4; signed:1;

print fmt: dev=%s skbaddr=%p len=%u, __get_str(name), REC-skbaddr, REC-len

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 9:45 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 I thought we already stated that.
 Here is the quote from perf_event.h:
  *  # The RAW record below is opaque data wrt the ABI
  *  #
  *  # That is, the ABI doesn't make any promises wrt to
  *  # the stability of its content, it may vary depending
  *  # on event, hardware, kernel version and phase of
  *  # the moon.
  *  #
  *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.

 and this example is reading PERF_SAMPLE_RAW events and
 uses locally defined structs to print them for simplicity.

to underline my point once more:
addition of bpf doesn't change at all what PERF_SAMPLE_RAW already
delivers to user space.
so no new ABIs anywhere.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 9:16 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Mon, 9 Feb 2015 23:08:36 -0500
 Steven Rostedt rost...@goodmis.org wrote:

 I don't want to get stuck with pinned kernel data structures again. We
 had 4 blank bytes of data for every event, because latency top hard
 coded the field. Luckily, the 64 bit / 32 bit interface caused latency
 top to have to use the event_parse code to work, and we were able to
 remove that field after it was converted.

I think your main point boils down to:

 But I still do not want any hard coded event structures. All access to
 data from the binary code must be parsed by looking at the event/format
 files. Otherwise you will lock internals of the kernel as userspace
 ABI, because eBPF programs will break if those internals change, and
 that could severely limit progress in the future.

and I completely agree.

the patch 4 is an example. It doesn't mean in any way
that structs defined here is an ABI.
To be compatible across kernels the user space must read
format file as you mentioned in your other reply.

 I'm wondering if we should label eBPF programs as modules. That is,
 they have no guarantee of working from one kernel to the next. They
 execute in the kernel, thus they are very similar to modules.

 If we can get Linus to say that eBPF programs are not user space, and
 that they are treated the same as modules (no internal ABI), then I
 think we can be a bit more free at what we allow.

I thought we already stated that.
Here is the quote from perf_event.h:
 *  # The RAW record below is opaque data wrt the ABI
 *  #
 *  # That is, the ABI doesn't make any promises wrt to
 *  # the stability of its content, it may vary depending
 *  # on event, hardware, kernel version and phase of
 *  # the moon.
 *  #
 *  # In other words, PERF_SAMPLE_RAW contents are not an ABI.

and this example is reading PERF_SAMPLE_RAW events and
uses locally defined structs to print them for simplicity.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 4/8] samples: bpf: simple tracing example in C

2015-02-09 Thread Steven Rostedt
On Mon, 9 Feb 2015 23:08:36 -0500
Steven Rostedt rost...@goodmis.org wrote:
 
 I don't want to get stuck with pinned kernel data structures again. We
 had 4 blank bytes of data for every event, because latency top hard
 coded the field. Luckily, the 64 bit / 32 bit interface caused latency
 top to have to use the event_parse code to work, and we were able to
 remove that field after it was converted.

I'm wondering if we should label eBPF programs as modules. That is,
they have no guarantee of working from one kernel to the next. They
execute in the kernel, thus they are very similar to modules.

If we can get Linus to say that eBPF programs are not user space, and
that they are treated the same as modules (no internal ABI), then I
think we can be a bit more free at what we allow.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/