Re: [RFC PATCH bpf-next] libbpf: make bpf_object__open default to UNSPEC
On 2018/11/23 5:52, Daniel Borkmann wrote: > [ +Wang ] > > On 11/22/2018 07:03 AM, Nikita V. Shirokov wrote: >> currently by default libbpf's bpf_object__open requires >> bpf's program to specify version in a code because of two things: >> 1) default prog type is set to KPROBE >> 2) KPROBE requires (in kernel/bpf/syscall.c) version to be specified >> >> in this RFC i'm proposing change default to UNSPEC and also changing >> logic of libbpf that it would reflect what we have today in kernel >> (aka only KPROBE type requires for version to be explicitly set). >> >> reason for change: >> currently only libbpf requires by default version to be >> explicitly set. it would be really hard for mainteiners of other custom >> bpf loaders to migrate to libbpf (as they dont control user's code >> and migration to the new loader (libbpf) wont be transparent for end >> user). >> >> what is going to be broken after this change: >> if someone were relying on default to be KPROBE for bpf_object__open >> his code will stop to work. however i'm really doubtfull that anyone >> is using this for kprobe type of programs (instead of, say, bcc or >> other tracing frameworks) >> >> other possible solutions (for discussion, would require more machinery): >> add another function like bpf_object__open w/ default to unspec >> >> Signed-off-by: Nikita V. Shirokov >> --- >> tools/lib/bpf/libbpf.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c >> index 0f14f7c074c2..ed4212a4c5f9 100644 >> --- a/tools/lib/bpf/libbpf.c >> +++ b/tools/lib/bpf/libbpf.c >> @@ -333,7 +333,7 @@ bpf_program__init(void *data, size_t size, char >> *section_name, int idx, >> prog->idx = idx; >> prog->instances.fds = NULL; >> prog->instances.nr = -1; >> -prog->type = BPF_PROG_TYPE_KPROBE; >> +prog->type = BPF_PROG_TYPE_UNSPEC; >> prog->btf_fd = -1; > > Seems this was mostly for historic reasons, but for a generic library this > would indeed be an odd convention for default. Wang, given 5f44e4c810bf > ("tools lib bpf: New API to adjust type of a BPF program"), are you in any > way relying on this default or using things like bpf_program__set_kprobe() > instead which you've added there? If latter, I'd say we should then change > it better now than later when there's even more lib usage (and in particular > before we add official ABI versioning). OK. I don't rely on that now. Thank you.
Re: [PATCH v2 net-next 3/6] tools/lib/bpf: expose bpf_program__set_type()
On 2017/4/1 11:18, Alexei Starovoitov wrote: On 3/31/17 7:29 PM, Wangnan (F) wrote: On 2017/3/31 12:45, Alexei Starovoitov wrote: expose bpf_program__set_type() to set program type Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau --- tools/lib/bpf/libbpf.c | 3 +-- tools/lib/bpf/libbpf.h | 2 ++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index ac6eb863b2a4..1a2c07eb7795 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1618,8 +1618,7 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n) return fd; } -static void bpf_program__set_type(struct bpf_program *prog, - enum bpf_prog_type type) +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type) { prog->type = type; } Since it become a public interface, we need to check if prog is a NULL pointer and check if the value of type is okay, and let it return an errno. I strongly disagree with such defensive programming. It's a cause of bugs for users of this library. I think you're trying to mimic c++ setters/getters, but c++ never checks 'this != null', since passing null into setter is a bug of the user of the library and not the library. The setters also should have 'void' return type when setters cannot fail. That is exactly the case here. If, in the future, we decide that this libbpf shouldn't support all bpf program types then you'd need to change the prototype of this function to return error code and change all places where this function is called to check for error code. It may or may not be the right approach. For example today the only user of bpf_program__set*() methods is perf/util/bpf-loader.c and it calls bpf_program__set_kprobe() and bpf_program__set_tracepoint() _without_ checking the return value which is _correct_ thing to do. Instead the current prototype of 'int bpf_program__set_tracepoint(struct bpf_program *prog); is not correct and I suggest you to fix it. You also need to do other cleanup. Like in bpf_object__elf_finish() you have: if (!obj_elf_valid(obj)) return; if (obj->efile.elf) { which is redundant. It's another example where mistakes creep in due to defensive programming. Another bug in bpf_object__close() which does: if (!obj) return; again defensive programming strikes, since you're not checking IS_ERR(obj) and that's what bpf_object__open() returns, so most users of the library (who don't read the source code and just using it based on .h) will do obj = bpf_object__open(...); bpf_object__close(obj); and current 'if (!obj)' won't help and it will segfault. I hit this issue will developing this patch set. diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index b30394f9947a..32c7252f734e 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -25,6 +25,7 @@ #include #include #include // for size_t +#include enum libbpf_errno { __LIBBPF_ERRNO__START = 4000, @@ -185,6 +186,7 @@ int bpf_program__set_sched_cls(struct bpf_program *prog); int bpf_program__set_sched_act(struct bpf_program *prog); int bpf_program__set_xdp(struct bpf_program *prog); int bpf_program__set_perf_event(struct bpf_program *prog); +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type); The above bpf_program__set_xxx become redundancy. It should be generated using macro as static inline functions. bool bpf_program__is_socket_filter(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); bpf_program__is_xxx should be updated like bpf_program__set_xxx, since enum bpf_prog_type is not a problem now. All of these suggestions is a cleanup for your code that you need to do yourself. I actually suggest you kill all bpf_program__is*() and all but one bpf_program__set*() functions. The current user perf/util/bpf-loader.c should be converted to using bpf_program__set_type() with _void_ return code that I'm introducing here. Overall, I think, tools/lib/bpf/ is a nice library and it can be used by many projects, but I suggest to stop making excuses based on your proprietary usage of it. Also please cc me in the future on changes to the library. It still has my copyrights, though a lot has changed, since last time I looked at it and it's my fault for not pay attention earlier. OK. Then let your patch be merged first then let me do the code cleanup. Thank you.
Re: [PATCH v2 net-next 3/6] tools/lib/bpf: expose bpf_program__set_type()
On 2017/3/31 12:45, Alexei Starovoitov wrote: expose bpf_program__set_type() to set program type Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau --- tools/lib/bpf/libbpf.c | 3 +-- tools/lib/bpf/libbpf.h | 2 ++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index ac6eb863b2a4..1a2c07eb7795 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1618,8 +1618,7 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n) return fd; } -static void bpf_program__set_type(struct bpf_program *prog, - enum bpf_prog_type type) +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type) { prog->type = type; } Since it become a public interface, we need to check if prog is a NULL pointer and check if the value of type is okay, and let it return an errno. diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index b30394f9947a..32c7252f734e 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -25,6 +25,7 @@ #include #include #include // for size_t +#include enum libbpf_errno { __LIBBPF_ERRNO__START = 4000, @@ -185,6 +186,7 @@ int bpf_program__set_sched_cls(struct bpf_program *prog); int bpf_program__set_sched_act(struct bpf_program *prog); int bpf_program__set_xdp(struct bpf_program *prog); int bpf_program__set_perf_event(struct bpf_program *prog); +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type); The above bpf_program__set_xxx become redundancy. It should be generated using macro as static inline functions. bool bpf_program__is_socket_filter(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); bpf_program__is_xxx should be updated like bpf_program__set_xxx, since enum bpf_prog_type is not a problem now. Thank you.
Re: [PATCH v2 net-next 3/6] tools/lib/bpf: expose bpf_program__set_type()
Hi Alexei, Please see the patch I sent. Since we export bpf_program__set_type(), bpf_program__set_xxx() should be built based on it. Thank you. On 2017/3/31 12:45, Alexei Starovoitov wrote: expose bpf_program__set_type() to set program type Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau ---
Re: [PATCH v2 net-next 2/6] tools/lib/bpf: add support for BPF_PROG_TEST_RUN command
On 2017/3/31 12:45, Alexei Starovoitov wrote: add support for BPF_PROG_TEST_RUN command to libbpf.a Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau Acked-by: Wang Nan --- tools/include/uapi/linux/bpf.h | 24 tools/lib/bpf/bpf.c| 24 tools/lib/bpf/bpf.h| 4 +++- 3 files changed, 51 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1ea08ce35567..a1d95386f562 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -81,6 +81,7 @@ enum bpf_cmd { BPF_OBJ_GET, BPF_PROG_ATTACH, BPF_PROG_DETACH, + BPF_PROG_TEST_RUN, }; enum bpf_map_type { @@ -189,6 +190,17 @@ union bpf_attr { __u32 attach_type; __u32 attach_flags; }; + + struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ + __u32 prog_fd; + __u32 retval; + __u32 data_size_in; + __u32 data_size_out; + __aligned_u64 data_in; + __aligned_u64 data_out; + __u32 repeat; + __u32 duration; + } test; } __attribute__((aligned(8))); /* BPF helper function descriptions: @@ -459,6 +471,18 @@ union bpf_attr { * Return: * > 0 length of the string including the trailing NUL on success * < 0 error + * + * u64 bpf_bpf_get_socket_cookie(skb) + * Get the cookie for the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: 8 Bytes non-decreasing number on success or 0 if the socket + * field is missing inside sk_buff + * + * u32 bpf_get_socket_uid(skb) + * Get the owner uid of the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: uid of the socket owner on success or 0 if the socket pointer + * inside sk_buff is NULL */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 9b58d20e8c93..f84c398c11f4 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -209,3 +209,27 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type type) return sys_bpf(BPF_PROG_DETACH, &attr, sizeof(attr)); } + +int bpf_prog_test_run(int prog_fd, int repeat, void *data, __u32 size, + void *data_out, __u32 *size_out, __u32 *retval, + __u32 *duration) +{ + union bpf_attr attr; + int ret; + + bzero(&attr, sizeof(attr)); + attr.test.prog_fd = prog_fd; + attr.test.data_in = ptr_to_u64(data); + attr.test.data_out = ptr_to_u64(data_out); + attr.test.data_size_in = size; + attr.test.repeat = repeat; + + ret = sys_bpf(BPF_PROG_TEST_RUN, &attr, sizeof(attr)); + if (size_out) + *size_out = attr.test.data_size_out; + if (retval) + *retval = attr.test.retval; + if (duration) + *duration = attr.test.duration; + return ret; +} diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 93f021932623..edb4daeff7a5 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -47,6 +47,8 @@ int bpf_obj_get(const char *pathname); int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type, unsigned int flags); int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type); - +int bpf_prog_test_run(int prog_fd, int repeat, void *data, __u32 size, + void *data_out, __u32 *size_out, __u32 *retval, + __u32 *duration); #endif
Re: [PATCH net-next 1/6] bpf: introduce BPF_PROG_TEST_RUN command
On 2017/3/31 11:24, Alexei Starovoitov wrote: On 3/30/17 8:12 PM, Wangnan (F) wrote: On 2017/3/31 10:57, Alexei Starovoitov wrote: On 3/30/17 7:53 PM, Wangnan (F) wrote: I suggest using a CONFIG option to enable/disable code in test_run.o to reduce attack plane. attack plane? what attack do you see and how config helps? I think all testing features are not required to be compiled for a production system. A feature which should never be used looks dangerous to me. It is required on production system, since xdp testing and xdp production has to use the same kernel. We cannot keep rebooting the server back and forth to test and then to run. It's not testing the kernel features, it's testing bpf programs which are technically user space components. Okay. Now I understand it is a production feature. Thank you.
Re: [PATCH net-next 2/6] tools/lib/bpf: add support for BPF_PROG_TEST_RUN command
On 2017/3/31 9:31, Alexei Starovoitov wrote: add support for BPF_PROG_TEST_RUN command to libbpf.a Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau --- tools/include/uapi/linux/bpf.h | 24 tools/lib/bpf/bpf.c| 24 tools/lib/bpf/bpf.h| 4 +++- 3 files changed, 51 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1ea08ce35567..a1d95386f562 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -81,6 +81,7 @@ enum bpf_cmd { BPF_OBJ_GET, BPF_PROG_ATTACH, BPF_PROG_DETACH, + BPF_PROG_TEST_RUN, }; enum bpf_map_type { @@ -189,6 +190,17 @@ union bpf_attr { __u32 attach_type; __u32 attach_flags; }; + + struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ + __u32 prog_fd; + __u32 retval; + __u32 data_size_in; + __u32 data_size_out; + __aligned_u64 data_in; + __aligned_u64 data_out; + __u32 repeat; + __u32 duration; + } test; } __attribute__((aligned(8))); /* BPF helper function descriptions: @@ -459,6 +471,18 @@ union bpf_attr { * Return: * > 0 length of the string including the trailing NUL on success * < 0 error + * + * u64 bpf_bpf_get_socket_cookie(skb) + * Get the cookie for the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: 8 Bytes non-decreasing number on success or 0 if the socket + * field is missing inside sk_buff + * + * u32 bpf_get_socket_uid(skb) + * Get the owner uid of the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: uid of the socket owner on success or 0 if the socket pointer + * inside sk_buff is NULL */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 9b58d20e8c93..b5ca5277e30c 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -209,3 +209,27 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type type) return sys_bpf(BPF_PROG_DETACH, &attr, sizeof(attr)); } + +int bpf_program_test_run(int prog_fd, int repeat, void *data, __u32 size, +void *data_out, __u32 *size_out, __u32 *retval, +__u32 *duration) +{ + union bpf_attr attr; + int ret; + + bzero(&attr, sizeof(attr)); + attr.test.prog_fd = prog_fd; + attr.test.data_in = ptr_to_u64(data); + attr.test.data_out = ptr_to_u64(data_out); + attr.test.data_size_in = size; + attr.test.repeat = repeat; + + ret = sys_bpf(BPF_PROG_TEST_RUN, &attr, sizeof(attr)); + if (size_out) + *size_out = attr.test.data_size_out; + if (retval) + *retval = attr.test.retval; + if (duration) + *duration = attr.test.duration; + return ret; +} diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 93f021932623..adfb320ff21d 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -47,6 +47,8 @@ int bpf_obj_get(const char *pathname); int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type, unsigned int flags); int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type); - +int bpf_program_test_run(int prog_fd, int repeat, void *data, __u32 size, +void *data_out, __u32 *size_out, __u32 *retval, +__u32 *duration); Please call it bpf_prog_test_run() so it looks uniform with others. Thank you.
Re: [PATCH net-next 1/6] bpf: introduce BPF_PROG_TEST_RUN command
On 2017/3/31 10:57, Alexei Starovoitov wrote: On 3/30/17 7:53 PM, Wangnan (F) wrote: I suggest using a CONFIG option to enable/disable code in test_run.o to reduce attack plane. attack plane? what attack do you see and how config helps? I think all testing features are not required to be compiled for a production system. A feature which should never be used looks dangerous to me. I suggest adding a CONFIG option like CONFIG_BPF_PROGRAM_TEST_RUN to control whether the kernel should be compiled with this feature or not. We can enable by default, and give people a chance to turn it off. At least in my company people tends to turn all unneeded features off. If you don't provide a config option they will make one by themselves. Thank you.
Re: [PATCH net-next 1/6] bpf: introduce BPF_PROG_TEST_RUN command
On 2017/3/31 9:31, Alexei Starovoitov wrote: development and testing of networking bpf programs is quite cumbersome. Despite availability of user space bpf interpreters the kernel is the ultimate authority and execution environment. Current test frameworks for TC include creation of netns, veth, qdiscs and use of various packet generators just to test functionality of a bpf program. XDP testing is even more complicated, since qemu needs to be started with gro/gso disabled and precise queue configuration, transferring of xdp program from host into guest, attaching to virtio/eth0 and generating traffic from the host while capturing the results from the guest. Moreover analyzing performance bottlenecks in XDP program is impossible in virtio environment, since cost of running the program is tiny comparing to the overhead of virtio packet processing, so performance testing can only be done on physical nic with another server generating traffic. Furthermore ongoing changes to user space control plane of production applications cannot be run on the test servers leaving bpf programs stubbed out for testing. Last but not least, the upstream llvm changes are validated by the bpf backend testsuite which has no ability to test the code generated. To improve this situation introduce BPF_PROG_TEST_RUN command to test and performance benchmark bpf programs. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau --- include/linux/bpf.h | 7 ++ include/uapi/linux/bpf.h | 12 kernel/bpf/syscall.c | 27 +++- net/Makefile | 2 +- net/bpf/Makefile | 1 + net/bpf/test_run.c | 172 +++ net/core/filter.c| 5 ++ 7 files changed, 223 insertions(+), 3 deletions(-) create mode 100644 net/bpf/Makefile create mode 100644 net/bpf/test_run.c [SNIP] diff --git a/net/Makefile b/net/Makefile index 9b681550e3a3..9086ffbb5085 100644 --- a/net/Makefile +++ b/net/Makefile @@ -12,7 +12,7 @@ obj-$(CONFIG_NET) += $(tmp-y) # LLC has to be linked before the files in net/802/ obj-$(CONFIG_LLC) += llc/ -obj-$(CONFIG_NET) += ethernet/ 802/ sched/ netlink/ +obj-$(CONFIG_NET) += ethernet/ 802/ sched/ netlink/ bpf/ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_INET)+= ipv4/ obj-$(CONFIG_XFRM)+= xfrm/ diff --git a/net/bpf/Makefile b/net/bpf/Makefile new file mode 100644 index ..27b2992a0692 --- /dev/null +++ b/net/bpf/Makefile @@ -0,0 +1 @@ +obj-y := test_run.o I suggest using a CONFIG option to enable/disable code in test_run.o to reduce attack plane. Thank you.
Re: [PATCH net-next 4/6] tools/lib/bpf: expose bpf_program__set_type()
On 2017/3/31 10:37, Alexei Starovoitov wrote: On 3/30/17 7:33 PM, Wangnan (F) wrote: +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type); This makes libbpf.h depend on uapi/linux/bpf.h (because of enum bpf_prog_type), which is not always available. What about defining another enum inside libbpf.h? how about just including bpf.h? or making it 'int' instead of enum? Including either kernel header into libbpf.h makes a lot of trouble, because kernel header and uapi have many other things we don't need and may conflict with existing code. Making it 'int' looks like a backdoor. We still need macro to define each program type. Thank you.
Re: [PATCH net-next 4/6] tools/lib/bpf: expose bpf_program__set_type()
On 2017/3/31 9:31, Alexei Starovoitov wrote: expose bpf_program__set_type() to set program type Signed-off-by: Alexei Starovoitov Acked-by: Daniel Borkmann Acked-by: Martin KaFai Lau --- tools/lib/bpf/libbpf.c | 3 +-- tools/lib/bpf/libbpf.h | 1 + 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index ac6eb863b2a4..1a2c07eb7795 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1618,8 +1618,7 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n) return fd; } -static void bpf_program__set_type(struct bpf_program *prog, - enum bpf_prog_type type) +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type) { prog->type = type; } diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index b30394f9947a..82adde30b696 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -185,6 +185,7 @@ int bpf_program__set_sched_cls(struct bpf_program *prog); int bpf_program__set_sched_act(struct bpf_program *prog); int bpf_program__set_xdp(struct bpf_program *prog); int bpf_program__set_perf_event(struct bpf_program *prog); +void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type); This makes libbpf.h depend on uapi/linux/bpf.h (because of enum bpf_prog_type), which is not always available. What about defining another enum inside libbpf.h? Thank you.
Re: [PATCH net-next v1] bpf: Remove redundant ifdef
On 2017/2/15 1:07, David Miller wrote: From: "Wangnan (F)" Date: Mon, 13 Feb 2017 09:53:49 +0800 On 2017/2/12 3:37, Mickaël Salaün wrote: Remove a useless ifdef __NR_bpf as requested by Wang Nan. Inline one-line static functions as it was in the bpf_sys.h file. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: David S. Miller Cc: Wang Nan Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com --- tools/lib/bpf/bpf.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 50e04cc5..2de9c386989a 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,21 +42,15 @@ # endif #endif -static __u64 ptr_to_u64(const void *ptr) +static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } -static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, - unsigned int size) +static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, + unsigned int size) { -#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); -#else - fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); - errno = ENOSYS; - return -1; -#endif } int bpf_create_map(enum bpf_map_type map_type, int key_size, Acked-by: Wang Nan However, it is better to merge this patch with commit 702498a1426bc95b6f49f9c5fba616110cbd3947. I don't know where this commit ID is. Since this patch is targetting net-next I would expect a commit ID with not context to be in that tree. Please always specify where the commit ID you mention is. It is "bpf: Remove bpf_sys.h from selftests" in net-next. Futhermore, commits in net-next are permanent so it is not possible afterwards to "merge this patch with commit X". I understand. Maintainers sometime reset his/her head to an early version and amend the commit to make the history clean, but clearly net-next never do this. Thank you.
Re: [PATCH net-next v1] bpf: Remove redundant ifdef
On 2017/2/12 3:37, Mickaël Salaün wrote: Remove a useless ifdef __NR_bpf as requested by Wang Nan. Inline one-line static functions as it was in the bpf_sys.h file. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: David S. Miller Cc: Wang Nan Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com --- tools/lib/bpf/bpf.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 50e04cc5..2de9c386989a 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,21 +42,15 @@ # endif #endif -static __u64 ptr_to_u64(const void *ptr) +static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } -static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, - unsigned int size) +static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, + unsigned int size) { -#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); -#else - fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); - errno = ENOSYS; - return -1; -#endif } int bpf_create_map(enum bpf_map_type map_type, int key_size, Acked-by: Wang Nan However, it is better to merge this patch with commit 702498a1426bc95b6f49f9c5fba616110cbd3947. Thank you.
Re: [PATCH v4 0/3] Miscellaneous fixes for BPF (perf tree)
On 2017/2/9 4:27, Mickaël Salaün wrote: This series brings some fixes and small improvements to the BPF samples. This is intended for the perf tree and apply on 7a5980f9c006 ("tools lib bpf: Add missing header to the library"). Changes since v3: * remove applied patch 1/5 * remove patch 2/5 on bpf_load_program() as requested by Wang Nan Changes since v2: * add this cover letter Changes since v1: * exclude patches not intended for the perf tree Regards, Mickaël Salaün (3): samples/bpf: Ignore already processed ELF sections samples/bpf: Reset global variables samples/bpf: Add missing header samples/bpf/bpf_load.c | 7 +++ samples/bpf/tracex5_kern.c | 1 + 2 files changed, 8 insertions(+) Looks good to me. Thank you.
Re: [PATCH net-next v5 04/11] bpf: Use bpf_load_program() from the library
On 2017/2/10 10:25, Wangnan (F) wrote: On 2017/2/10 7:21, Mickaël Salaün wrote: Replace bpf_prog_load() with bpf_load_program() calls. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Shuah Khan --- tools/lib/bpf/bpf.c | 6 +++--- tools/lib/bpf/bpf.h | 4 ++-- tools/testing/selftests/bpf/Makefile| 4 +++- tools/testing/selftests/bpf/bpf_sys.h | 21 - tools/testing/selftests/bpf/test_tag.c | 6 -- tools/testing/selftests/bpf/test_verifier.c | 8 +--- 6 files changed, 17 insertions(+), 32 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 3ddb58a36d3c..58ce252073fa 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,7 +42,7 @@ # endif #endif -static __u64 ptr_to_u64(void *ptr) +static __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } @@ -69,8 +69,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, - size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, + size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz) { int fd; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..bc959a2de023 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -28,8 +28,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, - size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, + size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz); For libbpf changes: And for similar code in patch 5-8: Acked-by Wang Nan Thank you. Acked-by Wang Nan Thank you.
Re: [PATCH net-next v5 10/11] bpf: Remove bpf_sys.h from selftests
On 2017/2/10 7:21, Mickaël Salaün wrote: Add require dependency headers. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Shuah Khan --- tools/lib/bpf/bpf.c | 6 ++ tools/testing/selftests/bpf/bpf_sys.h | 27 --- tools/testing/selftests/bpf/test_lpm_map.c | 1 - tools/testing/selftests/bpf/test_lru_map.c | 1 - tools/testing/selftests/bpf/test_maps.c | 1 - tools/testing/selftests/bpf/test_tag.c | 3 +-- tools/testing/selftests/bpf/test_verifier.c | 4 ++-- 7 files changed, 9 insertions(+), 34 deletions(-) delete mode 100644 tools/testing/selftests/bpf/bpf_sys.h diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index f8a2b7fa7741..50e04cc5 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -50,7 +50,13 @@ static __u64 ptr_to_u64(const void *ptr) static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size) { +#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); +#else + fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); + errno = ENOSYS; + return -1; +#endif } We don't need check __NR_bpf again. It has already been checked at the header of this file: #ifndef __NR_bpf # if defined(__i386__) # define __NR_bpf 357 # elif defined(__x86_64__) # define __NR_bpf 321 # elif defined(__aarch64__) # define __NR_bpf 280 # else # error __NR_bpf not defined. libbpf does not support your arch. # endif #endif Thank you.
Re: [PATCH net-next v5 04/11] bpf: Use bpf_load_program() from the library
On 2017/2/10 7:21, Mickaël Salaün wrote: Replace bpf_prog_load() with bpf_load_program() calls. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Shuah Khan --- tools/lib/bpf/bpf.c | 6 +++--- tools/lib/bpf/bpf.h | 4 ++-- tools/testing/selftests/bpf/Makefile| 4 +++- tools/testing/selftests/bpf/bpf_sys.h | 21 - tools/testing/selftests/bpf/test_tag.c | 6 -- tools/testing/selftests/bpf/test_verifier.c | 8 +--- 6 files changed, 17 insertions(+), 32 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 3ddb58a36d3c..58ce252073fa 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,7 +42,7 @@ # endif #endif -static __u64 ptr_to_u64(void *ptr) +static __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } @@ -69,8 +69,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, -size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, +size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz) { int fd; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..bc959a2de023 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -28,8 +28,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, -size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, +size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz); For libbpf changes: Acked-by Wang Nan Thank you.
Re: [PATCH v2 1/5] bpf: Add missing header to the library
Please add me into the cc list of all of the 5 patches. Thank you. On 2017/2/7 4:40, Mickaël Salaün wrote: Include stddef.h to define size_t. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- tools/lib/bpf/bpf.h | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..df6e186da788 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -22,6 +22,7 @@ #define __BPF_BPF_H #include +#include int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, __u32 map_flags);
Re: [PATCH v3 1/5] bpf: Add missing header to the library
On 2017/2/8 4:56, Mickaël Salaün wrote: Include stddef.h to define size_t. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- tools/lib/bpf/bpf.h | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..df6e186da788 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -22,6 +22,7 @@ #define __BPF_BPF_H #include +#include int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, __u32 map_flags); Looks good to me. Thank you.
Re: [PATCH v3 2/5] bpf: Simplify bpf_load_program() error handling in the library
On 2017/2/8 4:56, Mickaël Salaün wrote: Do not call a second time bpf(2) when a program load failed. BPF_PROG_LOAD should success most of the time. Setting log_level to 0 by default and require log buffer when failure can make it faster in normal case. Thank you. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- tools/lib/bpf/bpf.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 3ddb58a36d3c..fda3f494f1cd 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -73,7 +73,6 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, size_t insns_cnt, char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz) { - int fd; union bpf_attr attr; bzero(&attr, sizeof(attr)); @@ -81,20 +80,15 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, attr.insn_cnt = (__u32)insns_cnt; attr.insns = ptr_to_u64(insns); attr.license = ptr_to_u64(license); - attr.log_buf = ptr_to_u64(NULL); - attr.log_size = 0; - attr.log_level = 0; + attr.log_buf = ptr_to_u64(log_buf); + attr.log_size = log_buf_sz; attr.kern_version = kern_version; - fd = sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); - if (fd >= 0 || !log_buf || !log_buf_sz) - return fd; + if (log_buf && log_buf_sz > 0) { + attr.log_level = 1; + log_buf[0] = 0; + } - /* Try again with log */ - attr.log_buf = ptr_to_u64(log_buf); - attr.log_size = log_buf_sz; - attr.log_level = 1; - log_buf[0] = 0; return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); }
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 2017/1/25 9:16, Joe Stringer wrote: On 24 January 2017 at 17:06, Wangnan (F) wrote: On 2017/1/25 9:04, Wangnan (F) wrote: On 2017/1/23 9:11, Joe Stringer wrote: Add a new API to pin a BPF program to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2', and so on. Signed-off-by: Joe Stringer --- v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- [SNIP] +int bpf_program__pin(struct bpf_program *prog, const char *path) In your next patch please let caller select one instance: int bpf_program__pin(struct bpf_program *prog, int instance, const char *path) (please choose a better name) Then your can wrap it with another function to pin all instances, implement naming schema (%s_%d) there. Then implement naming schema in bpf_object__pin like: %s/objectname_mapname %s/objectname_progname_%d Is it possible to use directory tree instead? %s/object/mapname %s/object/prog/instance I don't think objects have names, so let's assume an object with two program instances named foo, and one map named bar. A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with the following files and directories: /sys/fs/bpf/myobj/foo/1 /sys/fs/bpf/myobj/foo/2 /sys/fs/bpf/myobj/bar Alternatively, if you want to control exactly where you want the progs/maps to be pinned, you can call eg bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that instance will be mounted to /sys/fs/bpf/wherever, or alternatively bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with /sys/fs/bpf/foo/{0,1}. This looks pretty reasonable to me. It looks good to me. Thank you.
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 2017/1/25 9:04, Wangnan (F) wrote: On 2017/1/23 9:11, Joe Stringer wrote: Add a new API to pin a BPF program to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2', and so on. Signed-off-by: Joe Stringer --- v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- [SNIP] +int bpf_program__pin(struct bpf_program *prog, const char *path) In your next patch please let caller select one instance: int bpf_program__pin(struct bpf_program *prog, int instance, const char *path) (please choose a better name) Then your can wrap it with another function to pin all instances, implement naming schema (%s_%d) there. Then implement naming schema in bpf_object__pin like: %s/objectname_mapname %s/objectname_progname_%d Is it possible to use directory tree instead? %s/object/mapname %s/object/prog/instance Thank you.
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 2017/1/23 9:11, Joe Stringer wrote: Add a new API to pin a BPF program to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2', and so on. Signed-off-by: Joe Stringer --- v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- tools/lib/bpf/libbpf.c | 76 ++ tools/lib/bpf/libbpf.h | 1 + 2 files changed, 77 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index e6cd62b1264b..eea5c74808f7 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -4,6 +4,7 @@ * Copyright (C) 2013-2015 Alexei Starovoitov * Copyright (C) 2015 Wang Nan * Copyright (C) 2015 Huawei Inc. + * Copyright (C) 2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -22,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -31,7 +33,10 @@ #include #include #include +#include #include +#include +#include #include #include @@ -1237,6 +1242,77 @@ int bpf_object__load(struct bpf_object *obj) return err; } +static int check_path(const char *path) +{ + struct statfs st_fs; + char *dname, *dir; + int err = 0; + + if (path == NULL) + return -EINVAL; + + dname = strdup(path); + dir = dirname(dname); + if (statfs(dir, &st_fs)) { + pr_warning("failed to statfs %s: %s\n", dir, strerror(errno)); + err = -errno; + } + free(dname); + + if (!err && st_fs.f_type != BPF_FS_MAGIC) { + pr_warning("specified path %s is not on BPF FS\n", path); + err = -EINVAL; + } + + return err; +} + +int bpf_program__pin(struct bpf_program *prog, const char *path) In your next patch please let caller select one instance: int bpf_program__pin(struct bpf_program *prog, int instance, const char *path) (please choose a better name) Then your can wrap it with another function to pin all instances, implement naming schema (%s_%d) there. Then implement naming schema in bpf_object__pin like: %s/objectname_mapname %s/objectname_progname_%d and make a bpf_{map,program}__pin the raw pinning functions with no naming schema. Thank you.
Re: [PATCHv3 perf/core 5/7] samples/bpf: Switch over to libbpf
On 2016/12/9 13:04, Wangnan (F) wrote: On 2016/12/9 10:46, Joe Stringer wrote: [SNIP] diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index 62d89d50fcbd..616bd55f3be8 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -149,6 +149,8 @@ CMD_TARGETS = $(LIB_FILE) TARGETS = $(CMD_TARGETS) +libbpf: all + Why we need this? I tested this patch without it and it seems to work, and this line causes an extra error: $ pwd /home/wn/kernel/tools/lib/bpf $ make libbpf ... gcc -g -Wall -DHAVE_LIBELF_MMAP_SUPPORT -DHAVE_ELF_GETPHDRNUM_SUPPORT -Wbad-function-cast -Wdeclaration-after-statement -Wformat-security -Wformat-y2k -Winit-self -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked -Wredundant-decls -Wshadow -Wstrict-aliasing=3 -Wstrict-prototypes -Wswitch-default -Wswitch-enum -Wundef -Wwrite-strings -Wformat -Werror -Wall -fPIC -I. -I/home/wn/kernel-hydrogen/tools/include -I/home/wn/kernel-hydrogen/tools/arch/x86/include/uapi -I/home/wn/kernel-hydrogen/tools/include/uapilibbpf.c all -o libbpf gcc: error: all: No such file or directory make: *** [libbpf] Error 1 Thank you. It is not 'caused' by your patch. 'make libbpf' fails without your change because it tries to build an executable from libbpf.c, but main() is missing. I think libbpf should never be used as a make target. Your new dependency looks strange. Thank you.
Re: [PATCHv3 perf/core 5/7] samples/bpf: Switch over to libbpf
On 2016/12/9 10:46, Joe Stringer wrote: Now that libbpf under tools/lib/bpf/* is synced with the version from samples/bpf, we can get rid most of the libbpf library here. Signed-off-by: Joe Stringer --- v3: First post. --- samples/bpf/Makefile | 60 +- samples/bpf/README.rst | 4 +- samples/bpf/libbpf.c | 111 - samples/bpf/libbpf.h | 19 + tools/lib/bpf/Makefile | 2 + 5 files changed, 38 insertions(+), 158 deletions(-) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 72c58675973e..c8f7ed37b2de 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -29,35 +29,38 @@ hostprogs-y += trace_event hostprogs-y += sampleip hostprogs-y += tc_l2_redirect -test_verifier-objs := test_verifier.o libbpf.o -test_maps-objs := test_maps.o libbpf.o -sock_example-objs := sock_example.o libbpf.o -fds_example-objs := bpf_load.o libbpf.o fds_example.o -sockex1-objs := bpf_load.o libbpf.o sockex1_user.o -sockex2-objs := bpf_load.o libbpf.o sockex2_user.o -sockex3-objs := bpf_load.o libbpf.o sockex3_user.o -tracex1-objs := bpf_load.o libbpf.o tracex1_user.o -tracex2-objs := bpf_load.o libbpf.o tracex2_user.o -tracex3-objs := bpf_load.o libbpf.o tracex3_user.o -tracex4-objs := bpf_load.o libbpf.o tracex4_user.o -tracex5-objs := bpf_load.o libbpf.o tracex5_user.o -tracex6-objs := bpf_load.o libbpf.o tracex6_user.o -test_probe_write_user-objs := bpf_load.o libbpf.o test_probe_write_user_user.o -trace_output-objs := bpf_load.o libbpf.o trace_output_user.o -lathist-objs := bpf_load.o libbpf.o lathist_user.o -offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o -spintest-objs := bpf_load.o libbpf.o spintest_user.o -map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o -test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o -test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o -xdp1-objs := bpf_load.o libbpf.o xdp1_user.o +# Libbpf dependencies +LIBBPF := libbpf.o ../../tools/lib/bpf/bpf.o + +test_verifier-objs := test_verifier.o $(LIBBPF) +test_maps-objs := test_maps.o $(LIBBPF) +sock_example-objs := sock_example.o $(LIBBPF) +fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o +sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o +sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o +sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o +tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o +tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o +tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o +tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o +tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o +tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o +test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o +trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o +lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o +offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o +spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o +map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o +test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o +test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o +xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o # reuse xdp1 source intentionally -xdp2-objs := bpf_load.o libbpf.o xdp1_user.o -test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \ +xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o +test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) \ test_current_task_under_cgroup_user.o -trace_event-objs := bpf_load.o libbpf.o trace_event_user.o -sampleip-objs := bpf_load.o libbpf.o sampleip_user.o -tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o +trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o +sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o +tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -89,7 +92,7 @@ always += test_current_task_under_cgroup_kern.o always += trace_event_kern.o always += sampleip_kern.o -HOSTCFLAGS += -I$(objtree)/usr/include +HOSTCFLAGS += -I$(objtree)/usr/include -I$(objtree)/tools/lib/ Should use srctree like this: +HOSTCFLAGS += -I$(objtree)/usr/include -I$(srctree)/tools/lib/ Or you will see following failure when doing off-tree build: $ mkdir buildkernel $ cd buildkernel $ make -C ../ O=`pwd` menuconfig $ make -j64 $ make samples/bpf/ CHK include/config/kernel.release Using .. as source for kernel GEN ./Makefile CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHK include/generated/bounds.h CHK include/generated/timeconst.h CHK include/generated/asm-offsets.h CALL../scripts/checksyscalls.sh HOSTCC samples/bpf/test_verifier.o In file included from ../samples/bpf/test_verifier.c:20:0: .
Re: [PATCHv3 perf/core 3/7] tools lib bpf: Add flags to bpf_create_map()
On 2016/12/9 10:46, Joe Stringer wrote: The map_flags argument to bpf_create_map() was previously not exposed. By exposing it, users can access flags such as whether or not to preallocate the map. Signed-off-by: Joe Stringer Please mention commit 6c90598174322b029e40dd84a4eb01f56afe in commit message: Commit 6c905981743 ("bpf: pre-allocate hash map elements") introduces map_flags to bpf_attr for BPF_MAP_CREATE command. Expose this new parameter in libbpf. Acked-by: Wang Nan --- v3: Split from "tools lib bpf: Sync with samples/bpf/libbpf". --- tools/lib/bpf/bpf.c| 3 ++- tools/lib/bpf/bpf.h| 2 +- tools/lib/bpf/libbpf.c | 3 ++- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 89e8e8e5b60e..d0afb26c2e0f 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -54,7 +54,7 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, } int bpf_create_map(enum bpf_map_type map_type, int key_size, - int value_size, int max_entries) + int value_size, int max_entries, __u32 map_flags) { union bpf_attr attr; @@ -64,6 +64,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, attr.key_size = key_size; attr.value_size = value_size; attr.max_entries = max_entries; + attr.map_flags = map_flags; return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 61130170a6ad..7fcdce16fd62 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -24,7 +24,7 @@ #include int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, - int max_entries); + int max_entries, __u32 map_flags); /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 2e974593f3e8..84e6b35da4bd 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -854,7 +854,8 @@ bpf_object__create_maps(struct bpf_object *obj) *pfd = bpf_create_map(def->type, def->key_size, def->value_size, - def->max_entries); + def->max_entries, + 0); if (*pfd < 0) { size_t j; int err = *pfd;
Re: [PATCHv3 perf/core 2/7] tools lib bpf: use __u32 from linux/types.h
On 2016/12/9 10:46, Joe Stringer wrote: Fixes the following issue when building without access to 'u32' type: ./tools/lib/bpf/bpf.h:27:23: error: unknown type name ‘u32’ Signed-off-by: Joe Stringer --- v3: Split from "tools lib bpf: Sync with samples/bpf/libbpf" --- tools/lib/bpf/bpf.c | 4 ++-- tools/lib/bpf/bpf.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) Acked-by: Wang Nan diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 8143536b462a..89e8e8e5b60e 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -70,7 +70,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, size_t insns_cnt, char *license, -u32 kern_version, char *log_buf, size_t log_buf_sz) +__u32 kern_version, char *log_buf, size_t log_buf_sz) { int fd; union bpf_attr attr; @@ -98,7 +98,7 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, } int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags) + __u64 flags) { union bpf_attr attr; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 253c3dbb06b4..61130170a6ad 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -30,11 +30,11 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, #define BPF_LOG_BUF_SIZE 65536 int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, size_t insns_cnt, char *license, -u32 kern_version, char *log_buf, +__u32 kern_version, char *log_buf, size_t log_buf_sz); int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags); + __u64 flags); int bpf_map_lookup_elem(int fd, void *key, void *value); int bpf_map_delete_elem(int fd, void *key);
Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf
On 2016/11/17 10:46, Joe Stringer wrote: On 16 November 2016 at 18:10, Wangnan (F) wrote: I'm also working on improving bpf.c. Please have a look at: https://lkml.org/lkml/2016/11/14/1078 Since bpf.c is simple, I think we can add more functions and fixes gradually, instead of a full copy. See my inline comment below. Ah, I missed this, my apologies. It looks like it will provide much of what I need, I can reassess this patch with your series in mind. One comment though for your patch (I don't have the original thread to respond to unfortunately): The map_pin and map_get functions in your patch series can be used to pin progs too, so maybe there is a better name? You'll see that this patch uses bpf_obj_{pin,get}() - although I wouldn't want those to be confused with the libbpf.c objects so maybe there's a clearer name that could be used. I also have some patches to rework the samples/bpf/* code to use libbpf instead of the sample code that is there, is it worth me submitting that? It will need to wait for your patch to go in, plus a merge with davem's tree. I'm happy to see you are trying to replace samples/bpf 's own libbpf with tools/lib/bpf. I think you can submit your work on libbpf and patches on samples together if they are ready. Arnaldo can pick up patches good for him, and you can improve other patches based on his newest branch. Thank you.
Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf
On 2016/11/17 10:46, Joe Stringer wrote: On 16 November 2016 at 18:10, Wangnan (F) wrote: I'm also working on improving bpf.c. Please have a look at: https://lkml.org/lkml/2016/11/14/1078 Since bpf.c is simple, I think we can add more functions and fixes gradually, instead of a full copy. See my inline comment below. Ah, I missed this, my apologies. It looks like it will provide much of what I need, I can reassess this patch with your series in mind. One comment though for your patch (I don't have the original thread to respond to unfortunately): The map_pin and map_get functions in your patch series can be used to pin progs too, so maybe there is a better name? You'll see that this patch uses bpf_obj_{pin,get}() - although I wouldn't want those to be confused with the libbpf.c objects so maybe there's a clearer name that could be used. The full thread can be found: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1272045.html (lkml.kernel.org is not working for me, sorry) In that patch set, bpf_map_pin/get is linked into perf hooks, so BPF script can pin a map to sysfs. I think this feature would be useful, but I don't have an example to show how to use it. I didn't provide program pinning/get interface because in perf hook they are not useful. After rethinking your suggestion now I think it is okay to provide bpf_obj_{pin,get} in bpf.c and export bpf_map_pin to perf hook. I'll adjust my own patch. I also have some patches to rework the samples/bpf/* code to use libbpf instead of the sample code that is there, is it worth me submitting that? It will need to wait for your patch to go in, plus a merge with davem's tree. On 2016/11/17 1:43, Joe Stringer wrote: [SNIP] /* @@ -53,24 +60,71 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, return syscall(__NR_bpf, cmd, attr, size); } -int bpf_create_map(enum bpf_map_type map_type, int key_size, - int value_size, int max_entries) +int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, + int max_entries, int map_flags) { - union bpf_attr attr; + union bpf_attr attr = { + .map_type = map_type, + .key_size = key_size, + .value_size = value_size, + .max_entries = max_entries, + .map_flags = map_flags, + }; - memset(&attr, '\0', sizeof(attr)); + return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); +} I lost map_flags in original bpf.c. Thanks to your patch. map_flags is useful when creating BPF_MAP_TYPE_PERCPU_HASH: BPF_F_NO_PREALLOC is meanful in this case. Do you want me to resubmit this piece as a separate patch or will you address this? Please send it. Thank you.
Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf
I'm also working on improving bpf.c. Please have a look at: https://lkml.org/lkml/2016/11/14/1078 Since bpf.c is simple, I think we can add more functions and fixes gradually, instead of a full copy. See my inline comment below. On 2016/11/17 1:43, Joe Stringer wrote: Extend the tools/ version of libbpf to include all of the functionality provided in the samples/bpf version. Signed-off-by: Joe Stringer --- v2: Don't shift non-bpf changes across. Various type cleanups, removal of extraneous declarations --- tools/lib/bpf/bpf.c| 107 -- tools/lib/bpf/bpf.h| 202 +++-- tools/lib/bpf/libbpf.c | 3 +- 3 files changed, 279 insertions(+), 33 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 4212ed62235b..5e061851ac00 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -20,10 +20,17 @@ */ #include -#include +#include #include #include +#include +#include #include +#include +#include +#include +#include +#include #include "bpf.h" Why we need these network related headers? /* @@ -53,24 +60,71 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, return syscall(__NR_bpf, cmd, attr, size); } -int bpf_create_map(enum bpf_map_type map_type, int key_size, - int value_size, int max_entries) +int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, + int max_entries, int map_flags) { - union bpf_attr attr; + union bpf_attr attr = { + .map_type = map_type, + .key_size = key_size, + .value_size = value_size, + .max_entries = max_entries, + .map_flags = map_flags, + }; - memset(&attr, '\0', sizeof(attr)); + return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); +} I lost map_flags in original bpf.c. Thanks to your patch. map_flags is useful when creating BPF_MAP_TYPE_PERCPU_HASH: BPF_F_NO_PREALLOC is meanful in this case. Although it is okay in samples, I still prefer a explicit bzero() or memset(), because kernel checks if unused field in this union is zero. However I'll check c standard to see how unused field would be initialized. diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index e8ba54087497..4dba36995771 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -23,16 +23,202 @@ #include +struct bpf_insn; + int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, - int max_entries); + int max_entries, int map_flags); +int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags); +int bpf_lookup_elem(int fd, void *key, void *value); +int bpf_delete_elem(int fd, void *key); +int bpf_get_next_key(int fd, void *key, void *next_key); + +int bpf_load_program(enum bpf_prog_type prog_type, +const struct bpf_insn *insns, int insn_len, +const char *license, int kern_version, +char *log_buf, size_t log_buf_sz); + +int bpf_obj_pin(int fd, const char *pathname); +int bpf_obj_get(const char *pathname); -/* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, -size_t insns_cnt, char *license, -u32 kern_version, char *log_buf, -size_t log_buf_sz); -int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags); +/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */ + +#define BPF_ALU64_REG(OP, DST, SRC)\ + ((struct bpf_insn) {\ + .code = BPF_ALU64 | BPF_OP(OP) | BPF_X,\ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + Should we define these macros here? They are in include/linux/filter.h and duplicated in tools/include/linux/filter.h. Redefining them here would cause conflict. Thank you.
Re: [PATCHv2 perf/core 1/2] tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
On 2016/11/17 1:43, Joe Stringer wrote: The tools version of this header is out of date; update it to the latest version from the kernel headers. Signed-off-by: Joe Stringer --- v2: No change. --- tools/include/uapi/linux/bpf.h | 51 ++ 1 file changed, 51 insertions(+) Acked-by: Wang Nan diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9e5fc168c8a3..f09c70b97eca 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -95,6 +95,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, + BPF_PROG_TYPE_PERF_EVENT, }; #define BPF_PSEUDO_MAP_FD 1 @@ -375,6 +376,56 @@ enum bpf_func_id { */ BPF_FUNC_probe_write_user, + /** +* bpf_current_task_under_cgroup(map, index) - Check cgroup2 membership of current task +* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type +* @index: index of the cgroup in the bpf_map +* Return: +* == 0 current failed the cgroup2 descendant test +* == 1 current succeeded the cgroup2 descendant test +*< 0 error +*/ + BPF_FUNC_current_task_under_cgroup, + + /** +* bpf_skb_change_tail(skb, len, flags) +* The helper will resize the skb to the given new size, +* to be used f.e. with control messages. +* @skb: pointer to skb +* @len: new skb length +* @flags: reserved +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_change_tail, + + /** +* bpf_skb_pull_data(skb, len) +* The helper will pull in non-linear data in case the +* skb is non-linear and not all of len are part of the +* linear section. Only needed for read/write with direct +* packet access. +* @skb: pointer to skb +* @len: len to make read/writeable +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_pull_data, + + /** +* bpf_csum_update(skb, csum) +* Adds csum into skb->csum in case of CHECKSUM_COMPLETE. +* @skb: pointer to skb +* @csum: csum to add +* Return: csum on success or negative error +*/ + BPF_FUNC_csum_update, + + /** +* bpf_set_hash_invalid(skb) +* Invalidate current skb>hash. +* @skb: pointer to skb +*/ + BPF_FUNC_set_hash_invalid, + __BPF_FUNC_MAX_ID, };
Re: [PATCH 7/8] tools lib bpf: fix maps resolution
Hi Eric, During testing this patch I find a segfault, please see inline comment. In addition, since both the BPF map array and map names should be done after symbol table is collected, merging bpf_object__init_maps and bpf_object__init_maps_name would be a good practice, making code simpler. So I prepare a new patch. Please have a look at: http://lkml.kernel.org/g/20161108215734.28905-1-wangn...@huawei.com New version ensure not crashing in any case user provides a corrupted maps section, including array of bpf maps, maps with different definition structures and very short map definition. Thank you. On 2016/10/16 14:18, Eric Leblond wrote: It is not correct to assimilate the elf data of the maps section to an array of map definition. In fact the sizes differ. The offset provided in the symbol section has to be used instead. This patch fixes a bug causing a elf with two maps not to load correctly. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 50 +++--- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 1fe4532..f72628b 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -186,6 +186,7 @@ struct bpf_program { struct bpf_map { int fd; char *name; + size_t offset; struct bpf_map_def def; void *priv; bpf_map_clear_priv_t clear_priv; @@ -529,13 +530,6 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, pr_debug("maps in %s: %zd bytes\n", obj->path, size); - obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); - if (!obj->maps) { - pr_warning("alloc maps for object failed\n"); - return -ENOMEM; - } - obj->nr_maps = nr_maps; - for (i = 0; i < nr_maps; i++) { struct bpf_map_def *def = &obj->maps[i].def; @@ -547,23 +541,42 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, obj->maps[i].fd = -1; /* Save map definition into obj->maps */ - *def = ((struct bpf_map_def *)data)[i]; + *def = *(struct bpf_map_def *)(data + obj->maps[i].offset); } Here, nr_maps is still size / sizeof(struct bpf_map_def), so obj->maps[i] can be invalid. return 0; } static int -bpf_object__init_maps_name(struct bpf_object *obj) +bpf_object__init_maps_symbol(struct bpf_object *obj) { int i; + int nr_maps = 0; Elf_Data *symbols = obj->efile.symbols; + size_t map_idx = 0; if (!symbols || obj->efile.maps_shndx < 0) return -EINVAL; + /* get the number of maps */ + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { + GElf_Sym sym; + + if (!gelf_getsym(symbols, i, &sym)) + continue; + if (sym.st_shndx != obj->efile.maps_shndx) + continue; + nr_maps++; + } + + obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); + if (!obj->maps) { + pr_warning("alloc maps for object failed\n"); + return -ENOMEM; + } + obj->nr_maps = nr_maps; + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { GElf_Sym sym; - size_t map_idx; const char *map_name; if (!gelf_getsym(symbols, i, &sym)) @@ -574,12 +587,12 @@ bpf_object__init_maps_name(struct bpf_object *obj) map_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, sym.st_name); - map_idx = sym.st_value / sizeof(struct bpf_map_def); if (map_idx >= obj->nr_maps) { pr_warning("index of map \"%s\" is buggy: %zu > %zu\n", map_name, map_idx, obj->nr_maps); continue; } + obj->maps[map_idx].offset = sym.st_value; obj->maps[map_idx].name = strdup(map_name); if (!obj->maps[map_idx].name) { pr_warning("failed to alloc map name\n"); @@ -587,6 +600,7 @@ bpf_object__init_maps_name(struct bpf_object *obj) } pr_debug("map %zu is \"%s\"\n", map_idx, obj->maps[map_idx].name); + map_idx++; } return 0; } @@ -647,8 +661,6 @@ static int bpf_object__elf_collect(struct bpf_object *obj) data->d_buf, data->d_size); else if (strcmp(name, "maps") == 0) { - err = bpf_object__init_maps(obj, data->d_buf, - data->d_size); obj->efile.maps_shndx = idx; } else if (sh.sh_type == S
Re: [PATCH 7/8] tools lib bpf: fix maps resolution
Hi Eric, Are you still working in this patch set? Now I know why maps section is not a simple array from a patch set from Joe Stringer: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html So I think this patch is really useful. Are you going to resend the whole patch set? If not, let me collect this patch 7/8 into my local code base and send to Arnaldo with my other patches. Thank you. On 2016/10/17 5:18, Eric Leblond wrote: It is not correct to assimilate the elf data of the maps section to an array of map definition. In fact the sizes differ. The offset provided in the symbol section has to be used instead. This patch fixes a bug causing a elf with two maps not to load correctly. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 50 +++--- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 1fe4532..f72628b 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -186,6 +186,7 @@ struct bpf_program { struct bpf_map { int fd; char *name; + size_t offset; struct bpf_map_def def; void *priv; bpf_map_clear_priv_t clear_priv; @@ -529,13 +530,6 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, pr_debug("maps in %s: %zd bytes\n", obj->path, size); - obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); - if (!obj->maps) { - pr_warning("alloc maps for object failed\n"); - return -ENOMEM; - } - obj->nr_maps = nr_maps; - for (i = 0; i < nr_maps; i++) { struct bpf_map_def *def = &obj->maps[i].def; @@ -547,23 +541,42 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, obj->maps[i].fd = -1; /* Save map definition into obj->maps */ - *def = ((struct bpf_map_def *)data)[i]; + *def = *(struct bpf_map_def *)(data + obj->maps[i].offset); } return 0; } static int -bpf_object__init_maps_name(struct bpf_object *obj) +bpf_object__init_maps_symbol(struct bpf_object *obj) { int i; + int nr_maps = 0; Elf_Data *symbols = obj->efile.symbols; + size_t map_idx = 0; if (!symbols || obj->efile.maps_shndx < 0) return -EINVAL; + /* get the number of maps */ + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { + GElf_Sym sym; + + if (!gelf_getsym(symbols, i, &sym)) + continue; + if (sym.st_shndx != obj->efile.maps_shndx) + continue; + nr_maps++; + } + + obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); + if (!obj->maps) { + pr_warning("alloc maps for object failed\n"); + return -ENOMEM; + } + obj->nr_maps = nr_maps; + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { GElf_Sym sym; - size_t map_idx; const char *map_name; if (!gelf_getsym(symbols, i, &sym)) @@ -574,12 +587,12 @@ bpf_object__init_maps_name(struct bpf_object *obj) map_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, sym.st_name); - map_idx = sym.st_value / sizeof(struct bpf_map_def); if (map_idx >= obj->nr_maps) { pr_warning("index of map \"%s\" is buggy: %zu > %zu\n", map_name, map_idx, obj->nr_maps); continue; } + obj->maps[map_idx].offset = sym.st_value; obj->maps[map_idx].name = strdup(map_name); if (!obj->maps[map_idx].name) { pr_warning("failed to alloc map name\n"); @@ -587,6 +600,7 @@ bpf_object__init_maps_name(struct bpf_object *obj) } pr_debug("map %zu is \"%s\"\n", map_idx, obj->maps[map_idx].name); + map_idx++; } return 0; } @@ -647,8 +661,6 @@ static int bpf_object__elf_collect(struct bpf_object *obj) data->d_buf, data->d_size); else if (strcmp(name, "maps") == 0) { - err = bpf_object__init_maps(obj, data->d_buf, - data->d_size); obj->efile.maps_shndx = idx; } else if (sh.sh_type == SHT_SYMTAB) { if (obj->efile.symbols) { @@ -698,8 +710,16 @@ static int bpf_object__elf_collect(struct bpf_object *obj) pr_warning("Corrupted ELF file: index of strtab invalid\n"); return LIBBPF_ERRNO__FORMAT; } - if (obj->efile.
Re: [PATCH 1/8] tools lib bpf: add error functions
On 2016/10/19 6:52, Joe Stringer wrote: On 16 October 2016 at 14:18, Eric Leblond wrote: The include of err.h is not explicitely needed in exported functions and it was causing include conflict with some existing code due to redefining some macros. To fix this, let's have error handling functions provided by the library. Furthermore this will allow user to have an homogeneous API. Signed-off-by: Eric Leblond Does it need to return the error like this or should we just fix up the bpf_object__open() API to return errors in a simpler form? There's already libbpf_set_print(...) for outputting errors, is it reasonable to just change the library to return NULLs in error cases instead? Returning error code to caller so caller knows what happen. Other subsystems in perf also do this. Perf hides libbpf's error output (make it silent unless -v), so it needs a way for receiving libbpf's error code. I think this patch is good, decouple libbpf.h and kernel headers. Thank you.
Re: [PATCH 7/8] tools lib bpf: fix maps resolution
On 2016/10/17 5:18, Eric Leblond wrote: It is not correct to assimilate the elf data of the maps section to an array of map definition. In fact the sizes differ. The offset provided in the symbol section has to be used instead. This patch fixes a bug causing a elf with two maps not to load correctly. Could you please give an example so we can understand why section 'maps' is not an array? Thank you.
Re: [PATCH 5/8] tools lib bpf: add missing functions
On 2016/10/17 5:18, Eric Leblond wrote: Some functions were missing in the library to be able to use it in the case where the userspace is handling the maps in kernel. The patch also renames functions to have a homogeneous naming convention. Signed-off-by: Eric Leblond --- tools/lib/bpf/bpf.c| 35 ++- tools/lib/bpf/bpf.h| 2 -- tools/lib/bpf/libbpf.h | 5 + 3 files changed, 39 insertions(+), 3 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 4212ed6..c0e07bd 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -25,6 +25,7 @@ #include #include #include "bpf.h" +#include "libbpf.h" /* * When building perf, unistd.h is overrided. __NR_bpf is @@ -97,7 +98,7 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } -int bpf_map_update_elem(int fd, void *key, void *value, +int bpf_map__update_elem(int fd, void *key, void *value, u64 flags) Please don't use '__' style API here. It is easily be confused with bpf_map__*() in libbpf.h. They are APIs at different level. bpf_map__*() are APIs for 'struct bpf_map's, they are object introduced by libbpf, defined in libbpf.h. bpf_map_*() APIs operate on fd, they are objects defined by kernel. bpf_map_*() APIs are declared in bpf.h. In libbpf, bpf.h directly operates on kernel objects (fd), APIs in it are named bpf_map_*(); libbpf.h operates on 'struct bpf_map' object, APIs in it are named using bpf_map__*(). libbpf.h and bpf.h are independent with each other. { union bpf_attr attr; @@ -110,3 +111,35 @@ int bpf_map_update_elem(int fd, void *key, void *value, return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } + +int bpf_map__lookup_elem(int fd, void *key, void *value) +{ + union bpf_attr attr = { + .map_fd = fd, + .key = ptr_to_u64(key), + .value = ptr_to_u64(value), + }; + + return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); +} + +int bpf_map__delete_elem(int fd, void *key) +{ + union bpf_attr attr = { + .map_fd = fd, + .key = ptr_to_u64(key), + }; + + return sys_bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); +} + +int bpf_map__get_next_key(int fd, void *key, void *next_key) +{ + union bpf_attr attr = { + .map_fd = fd, + .key = ptr_to_u64(key), + .next_key = ptr_to_u64(next_key), + }; + + return sys_bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); +} diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index e8ba540..5ca834a 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -33,6 +33,4 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, u32 kern_version, char *log_buf, size_t log_buf_sz); -int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags); #endif diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index a18783b..dfb46d0 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -207,6 +207,11 @@ bpf_map__next(struct bpf_map *map, struct bpf_object *obj); int bpf_map__fd(struct bpf_map *map); const struct bpf_map_def *bpf_map__def(struct bpf_map *map); const char *bpf_map__name(struct bpf_map *map); +int bpf_map__update_elem(int fd, void *key, void *value, + uint64_t flags); +int bpf_map__lookup_elem(int fd, void *key, void *value); +int bpf_map__delete_elem(int fd, void *key); +int bpf_map__get_next_key(int fd, void *key, void *next_key); As what we have discussed, the newly introduced functions should be added in bpf.h. Thank you.
Re: [PATCH 6/8] tools lib bpf: improve warning
On 2016/10/17 5:18, Eric Leblond wrote: Signed-off-by: Eric Leblond Please add some commit messages. Thank you. --- tools/lib/bpf/libbpf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 7cd341e..1fe4532 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -802,7 +802,8 @@ bpf_object__create_maps(struct bpf_object *obj) size_t j; int err = *pfd; - pr_warning("failed to create map: %s\n", + pr_warning("failed to create map (name: '%s'): %s\n", + obj->maps[i].name, strerror(errno)); for (j = 0; j < i; j++) zclose(obj->maps[j].fd);
Re: [PATCH 4/8] tools lib bpf: export function to set type
On 2016/10/17 5:18, Eric Leblond wrote: Current API was not allowing the user to set a type like socket filter. To avoid a setter function for each type, the patch simply exports a set function that takes the type in parameter. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 19 +-- tools/lib/bpf/libbpf.h | 3 +++ 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 90932f1..7cd341e 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1336,26 +1336,25 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n) return fd; } -static void bpf_program__set_type(struct bpf_program *prog, - enum bpf_prog_type type) +int bpf_program__set_type(struct bpf_program *prog, unsigned int type) { + if (!prog) + return -EINVAL; + if (type >= __MAX_BPF_PROG_TYPE) + return -EINVAL; + prog->type = type; + return 0; } int bpf_program__set_tracepoint(struct bpf_program *prog) { - if (!prog) - return -EINVAL; - bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); - return 0; + return bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); } int bpf_program__set_kprobe(struct bpf_program *prog) { - if (!prog) - return -EINVAL; - bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE); - return 0; + return bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE); } static bool bpf_program__is_type(struct bpf_program *prog, diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index e40c8d3..a18783b 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -173,6 +173,9 @@ int bpf_program__set_kprobe(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); bool bpf_program__is_kprobe(struct bpf_program *prog); +int bpf_program__set_type(struct bpf_program *prog, + unsigned int type); + Although you don't include uapi/linux/bpf.h in this patch, logically you add this dependency. Please continously add bpf_program__set_socket_filter() and bpf_program__is_socket_filter() like what we do for tracepoint. This way libbpf.h is indenpendent from kernel header. We can use macro in both .h and .c. Thank you.
Re: [PATCH 3/8] tools: Sync tools/include/uapi/linux/bpf.h with the kernel
On 2016/10/17 5:18, Eric Leblond wrote: Signed-off-by: Eric Leblond Commit message is required. Thank you. --- tools/include/uapi/linux/bpf.h | 52 ++ 1 file changed, 52 insertions(+) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9e5fc16..570287f 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -95,6 +95,8 @@ enum bpf_prog_type { BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, + BPF_PROG_TYPE_PERF_EVENT, + __MAX_BPF_PROG_TYPE, }; #define BPF_PSEUDO_MAP_FD 1 @@ -375,6 +377,56 @@ enum bpf_func_id { */ BPF_FUNC_probe_write_user, + /** +* bpf_current_task_under_cgroup(map, index) - Check cgroup2 membership of current task +* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type +* @index: index of the cgroup in the bpf_map +* Return: +* == 0 current failed the cgroup2 descendant test +* == 1 current succeeded the cgroup2 descendant test +*< 0 error +*/ + BPF_FUNC_current_task_under_cgroup, + + /** +* bpf_skb_change_tail(skb, len, flags) +* The helper will resize the skb to the given new size, +* to be used f.e. with control messages. +* @skb: pointer to skb +* @len: new skb length +* @flags: reserved +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_change_tail, + + /** +* bpf_skb_pull_data(skb, len) +* The helper will pull in non-linear data in case the +* skb is non-linear and not all of len are part of the +* linear section. Only needed for read/write with direct +* packet access. +* @skb: pointer to skb +* @len: len to make read/writeable +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_pull_data, + + /** +* bpf_csum_update(skb, csum) +* Adds csum into skb->csum in case of CHECKSUM_COMPLETE. +* @skb: pointer to skb +* @csum: csum to add +* Return: csum on success or negative error +*/ + BPF_FUNC_csum_update, + + /** +* bpf_set_hash_invalid(skb) +* Invalidate current skb>hash. +* @skb: pointer to skb +*/ + BPF_FUNC_set_hash_invalid, + __BPF_FUNC_MAX_ID, };
Re: [PATCH 1/8] tools lib bpf: add error functions
On 2016/10/17 5:18, Eric Leblond wrote: The include of err.h is not explicitely needed in exported functions and it was causing include conflict with some existing code due to redefining some macros. To fix this, let's have error handling functions provided by the library. Furthermore this will allow user to have an homogeneous API. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 11 +++ tools/lib/bpf/libbpf.h | 4 +++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index b699aea..90932f1 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -1447,3 +1448,13 @@ bpf_object__find_map_by_name(struct bpf_object *obj, const char *name) } return NULL; } + +bool bpf__is_error(const void *ptr) Please use libbpf_is_error(), like libbpf_set_print. We use '__' because we want to use the OO concept. This utility is not OO. +{ + return IS_ERR(ptr); +} + +long bpf__get_error(const void *ptr) Same, please call it libbpf_get_error(). Thank you.
Re: [PATCH net-next 2/2] tools lib bpf: export function to set type
On 2016/8/13 20:17, Eric Leblond wrote: Current API was not allowing the user to set a type like socket filter. To avoid a setter function for each type, the patch simply exports a set function that takes the type in parameter. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 15 ++- tools/lib/bpf/libbpf.h | 3 +++ 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 7872ff6..ff2a8c6 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1336,26 +1336,23 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n) return fd; } -static void bpf_program__set_type(struct bpf_program *prog, +int bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type) { + if (!prog) + return -EINVAL; prog->type = type; + return 0; } We'd better add a sanity check here to prevent setting invalid program type. Plase have a look at https://lkml.org/lkml/2015/12/17/13 int bpf_program__set_tracepoint(struct bpf_program *prog) { - if (!prog) - return -EINVAL; - bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); - return 0; + return bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); } int bpf_program__set_kprobe(struct bpf_program *prog) { - if (!prog) - return -EINVAL; - bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE); - return 0; + return bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE); } static bool bpf_program__is_type(struct bpf_program *prog, diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index a6c5cde..6a84d7a 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -173,6 +173,9 @@ int bpf_program__set_kprobe(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); bool bpf_program__is_kprobe(struct bpf_program *prog); +int bpf_program__set_type(struct bpf_program *prog, + enum bpf_prog_type type); + This makes libbpf.h depend on kernel's uapi/linux/bpf.h. Although I did this in the above patch, I don't like this dependency. This is the reason why I introduce int bpf_program__set_tracepoint(struct bpf_program *prog); int bpf_program__set_kprobe(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); bool bpf_program__is_kprobe(struct bpf_program *prog); and hide bpf_program__set_type inside the .c file. Please add type setting functions, or pass int value to bpf_program__set_type and add sanity checking. Thank you.
Re: [PATCH net-next 1/2] tools lib bpf: suppress useless include
On 2016/8/13 20:17, Eric Leblond wrote: The include of err.h is not explicitely needed in exported functions and it was causing include conflict with some existing code due to redefining some macros. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 1 + tools/lib/bpf/libbpf.h | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index b699aea..7872ff6 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index dd7a513..a6c5cde 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -23,7 +23,6 @@ #include #include -#include Functions declared in this header require PTR_ERR to decode error number. Logically speaking, this header depends linux/err.h, so this include is required. If not, there must have another way for caller to decode error numbers. I know the problem you try to solve. Please have a look at thread https://lkml.org/lkml/2015/12/17/20 At that time I think we should create a wrapper in libbpf.h like this: #ifdef USE_LINUX_ERR #include #endif ... int libbpf_ptr_err(void *ptr); ... It is acceptable but ugly. Can you find a better method? Thank you. enum libbpf_errno { __LIBBPF_ERRNO__START = 4000,
Re: [PATCH net-next 0/2] libbpf: minor fix and API update
I think you should cc linux-ker...@vger.kernel.org for code in tools/ . Thank you. On 2016/8/13 20:17, Eric Leblond wrote: Hello, Here's a small patchset on libbpf fixing two issues I've encountered when adding some eBPF related features to Suricata. Patchset statistics: tools/lib/bpf/libbpf.c | 16 +++- tools/lib/bpf/libbpf.h | 4 +++- 2 files changed, 10 insertions(+), 10 deletions(-) BR, -- Eric Leblond
Re: [RFC PATCH net-next 0/4] perf tools: Support receiving output through BPF programs
On 2015/10/28 18:55, Wang Nan wrote: Alexei provided a patchset to allow BPF programs output data to ring buffer using helper bpf_perf_event_output() [1]. and have been merged into net-next as commit a43eec304259a6c637f4014a6d4767159b6a3aa3 (bpf: introduce bpf_perf_event_output() helper). This patchset introduces perf side code to utilize that helper, This patchset only supports output data to CTF. It is enough for me because my workflow is 'perf record' -> 'convert to CTF' -> 'python'. However, I know some people heavily rely on 'perf script' to parse trace in perf.data. Here I'd like discuss the way to show output data using 'perf script'. Currently the only way to watch the output data using perf script is to use '-D'. [1] http://lkml.kernel.org/r/1445396556-4854-1-git-send-email-...@kernel.org This patchset is based on my perf ebpf support patches. Workable code can be found from github: https://github.com/WangNan0/linux/commits/ebpf It should be merged after net-next merged into mainline. I post them now for demostration and discussion. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/26 20:32, Peter Zijlstra wrote: On Sun, Oct 25, 2015 at 09:23:36AM -0700, Alexei Starovoitov wrote: bpf_perf_event_read() muxes of -EINVAL into return value, but it's non ambiguous to the program whether it got an error or real counter value. How can that be, the (u64)-EINVAL value is a valid counter value.. unlikely maybe, but still quite possible. In our real usecase we simply treat return value larger than 0x7fff as error result. We can make it even larger, for example, to 0x. Resuling values can be pre-processed by a script to filter potential error result out so it is not a very big problem for our real usecases. For a better interface, I suggest u64 bpf_perf_event_read(bool *perror); which still returns counter value through its return value but put error code to stack. Then BPF program can pass NULL to the function if BPF problem doesn't want to deal with error code. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/3] bpf: introduce bpf_perf_event_output() helper
On 2015/10/24 1:25, Alexei Starovoitov wrote: On 10/23/15 9:42 AM, Peter Zijlstra wrote: On Fri, Oct 23, 2015 at 08:02:00AM -0700, Alexei Starovoitov wrote: On 10/23/15 7:39 AM, Peter Zijlstra wrote: On Tue, Oct 20, 2015 at 08:02:34PM -0700, Alexei Starovoitov wrote: +static const struct bpf_func_proto bpf_perf_event_output_proto = { +.func= bpf_perf_event_output, +.gpl_only= false, Oh ? no particular reason. key helper bpf_probe_read() is gpl, so all bpf for tracing progs have to be gpl. If you feel strongly about it, I can change it. All the perf symbols are export GPL, so I suppose this should be true. ok. will send a patch. Can we (or have we already) setup some rules for licensing? Which part should be GPL? Who has the response to decide it? Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/23 8:10, Alexei Starovoitov wrote: Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Also fix error path after perf_event_attrs() and remove redundant 'extern'. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Alexei Starovoitov Tested-by: Wang Nan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/23 8:10, Alexei Starovoitov wrote: Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Also fix error path after perf_event_attrs() and remove redundant 'extern'. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Alexei Starovoitov --- v2->v3: . refactor checks based on Wangnan's and Peter's feedback while refactoring realized that these two issues need fixes as well: . fix perf_event_attrs() error path . remove redundant extern v1->v2: fix compile in case of !CONFIG_PERF_EVENTS Even in the worst case the crash is not possible. Only warn_on_once, so imo net-next is ok. include/linux/bpf.h |1 - kernel/bpf/arraymap.c| 25 - kernel/trace/bpf_trace.c |7 ++- 3 files changed, 22 insertions(+), 11 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e3a51b74e275..75718fa28260 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -194,7 +194,6 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto; extern const struct bpf_func_proto bpf_map_update_elem_proto; extern const struct bpf_func_proto bpf_map_delete_elem_proto; -extern const struct bpf_func_proto bpf_perf_event_read_proto; extern const struct bpf_func_proto bpf_get_prandom_u32_proto; extern const struct bpf_func_proto bpf_get_smp_processor_id_proto; extern const struct bpf_func_proto bpf_tail_call_proto; diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e3cfe46b074f..3f4c99e06c6b 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -292,16 +292,23 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd) attr = perf_event_attrs(event); if (IS_ERR(attr)) - return (void *)attr; + goto err; - if (attr->type != PERF_TYPE_RAW && - !(attr->type == PERF_TYPE_SOFTWARE && - attr->config == PERF_COUNT_SW_BPF_OUTPUT) && - attr->type != PERF_TYPE_HARDWARE) { - perf_event_release_kernel(event); - return ERR_PTR(-EINVAL); - } - return event; + if (attr->inherit) + goto err; + Since Peter suggest it is pointless for a system-wide perf_event has inherit bit set [1], I think it should be safe to enable system-wide perf_event pass this check? I'll check code to make sure. [1] http://lkml.kernel.org/r/20151022124142.gq17...@twins.programming.kicks-ass.net + if (attr->type == PERF_TYPE_RAW) + return event; + + if (attr->type == PERF_TYPE_HARDWARE) + return event; + + if (attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) + return event; +err: + perf_event_release_kernel(event); + return ERR_PTR(-EINVAL); } static void perf_event_fd_array_put_ptr(void *ptr) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 47febbe7998e..003df3887287 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -199,6 +199,11 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5) if (!event) return -ENOENT; + /* make sure event is local and doesn't have pmu::count */ + if (event->oncpu != smp_processor_id() || + event->pmu->count) + return -EINVAL; + /* * we don't know if the function is run successfully by the * return value. It can be judged in other places, such as @@ -207,7 +212,7 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5) return perf_event_read_local(event); } -const struct bpf_func_proto bpf_perf_event_read_proto = { +static const struct bpf_func_proto bpf_perf_event_read_proto = { .func = bpf_perf_event_read, .gpl_only = false, .ret_type = RET_INTEGER, -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/22 6:58, Alexei Starovoitov wrote: [SNIP] diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e3cfe46b074f..75529cc94304 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -294,10 +294,11 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd) if (IS_ERR(attr)) return (void *)attr; - if (attr->type != PERF_TYPE_RAW && - !(attr->type == PERF_TYPE_SOFTWARE && - attr->config == PERF_COUNT_SW_BPF_OUTPUT) && - attr->type != PERF_TYPE_HARDWARE) { + if ((attr->type != PERF_TYPE_RAW && +!(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && +attr->type != PERF_TYPE_HARDWARE) || + attr->inherit) { perf_event_release_kernel(event); return ERR_PTR(-EINVAL); } I have a question on inherit, not related to this patch: Is it safe for perf to disable attr->inherit if the event is system wide? I haven't read relate code completely. In my current knowledge the behavior of a system wide perf event should be same whether inherit is set or not. Is that true? Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/22 17:06, Peter Zijlstra wrote: On Wed, Oct 21, 2015 at 02:19:49PM -0700, Alexei Starovoitov wrote: Urgh, that's still horridly inconsistent. Can we please come up with a consistent interface to perf? My suggestion was to do ioctl(enable/disable) of events from userspace after receiving notification from kernel via my bpf_perf_event_output() helper. Wangnan's argument was that display refresh happens often and it's fast, so the time taken by user space to enable events on all cpus is too slow and ioctl does ipi and disturbs other cpus even more. So soft_disable done by the program to enable/disable particular events on all cpus kinda makes sense. And this all makes me think I still have no clue what you're all trying to do here. Who cares about display updates and why. And why should there be an active userspace part to eBPF programs? So you want the background story? OK, let me describe it. This mail is not short so please be patient. On a smartphone, if time between two frames is longer than 16ms, user can aware it. This is a display glitch. We want to check those glitches with perf to find what cause them. The basic idea is: use 'perf record' to collect enough information, find those samples generated just before the glitch form perf.data then analysis them offline. There are many works need to be done before perf can support such analysis. One improtant thing is to reduce the overhead from perf to avoid perf itself become the reason of glitches. We can do this by reduce the number of events perf collects, but then we won't have enough information to analysis when glitch happen. Another way we are trying to implement now is to dynamically turn events on and off, or at least enable/disable sampling dynamically because the overhead of copying those samples is a big part of perf's total overhead. After that we can trace as many event as possible, but only fetch data from them when we detect a glitch. BPF program is the best choice to describe our relative complex glitch detection model. For simplicity, you can think our glitch detection model contain 3 points at user programs: point A means display refreshing begin, point B means display refreshing has last for 16ms, point C means display refreshing finished. They can be found in user programs and probed through uprobe, on which we can attach BPF programs on. Then we want to use perf this way: Sample 'cycles' event to collect callgraph so we know what all CPUs are doing during the refreshing, but only start sampling when we detect a frame refreshing last for 16ms. We can translate the above logic into BPF programs: at point B, enable 'cycles' event to generate samples; at point C, disable 'cycles' event to avoid useless samples to be generated. Then, make 'perf record -e cycles' to collect call graph and other information through cycles event. From perf.data, we can use 'perf script' and other tools to analysis resuling data. We have consider some potential solution and find them inapproate or need too much work to do: 1. As you may prefer, create BPF functions to call pmu->stop() / pmu->start() for perf event on the CPU on which BPF programs get triggered. The shortcoming of this method is we can only turn on the perf event on the CPU execute point B. We are unable to know what other CPU are doing during glitching. But what we want is system-wide information. In addition, point C and point B are not necessarily be executed at one core, so we may shut down wrong event if scheduler decide to run point C on another core. 2. As Alexei's suggestion, output something through his bpf_perf_event_output(), let perf disable and enable those events using ioctl in userspace. This is a good idea, but introduces asynchronization problem. bpf_perf_event_output() output something to perf's ring buffer, but perf get noticed about this message when epoll_wait() return. We tested a tracepoint event and found that an event may awared by perf sereval seconds after it generated. One solution is to use --no-buffer, but it still need perf to be scheduled on time and parse its ringbuffer quickly. Also, please note that point C is possible to appear very shortly after point B because some APPs are optimized to make their refreshing time very near to 16ms. This is the background story. Looks like whether we implement some corss-CPU controling or we satisified with coarse grained controlling. The best method we can think is to use atomic operation to soft enable/disable perf events. We believe it is simple enough and won't cause problem. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/22 15:39, Ingo Molnar wrote: * Wangnan (F) wrote: [SNIP] In summary, your either-or logic doesn't hold in BPF world. A BPF program can only access perf event in a highly restricted way. We don't allow it calling perf_event_read_local() across core, so it can't. Urgh, that's still horridly inconsistent. Can we please come up with a consistent interface to perf? BPF program and kernel module are two different worlds as I said before. I don't think making them to share a common interface is a good idea because such sharing will give BPF programs too much freedom than it really need, then it will be hard prevent them to do something bad. If we really need kernel interface, I think what we need is kernel module, not BPF program. What do you mean, as this does not parse for me. Because I'm not very sure what the meaning of "inconsistent" in Peter's words... I think what Peter want us to do is to provide similar (consistent) interface between kernel and eBPF that, if kernel reads from a perf_event through perf_event_read_local(struct perf_event *), BPF program should do this work with similar code, or at least similar logic, so we need to create handler for a perf event, and provide a BPF function called BPF_FUNC_perf_event_read_local then pass such handler to it. I don't think like this because if we want kernel interface we'd better use kernel module, not eBPF so I mentioned kernel module here. Ingo, do you think BPF inerface should be *consistent* with anything? Thank you. We obviously can (and very likely should) make certain perf functionality available to BPF programs. It should still be a well defined yet flexible iterface, with safe behavior, obviously - all in line with existing BPF sandboxing principles. 'Kernel modules' don't enter this consideration at all, not sure why you mention them - all this functionality is also available if CONFIG_MODULES is turned off completely. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/22 14:21, Alexei Starovoitov wrote: On 10/21/15 10:31 PM, Wangnan (F) wrote: +if ((attr->type != PERF_TYPE_RAW && + !(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && + attr->type != PERF_TYPE_HARDWARE) || +attr->inherit) { This 'if' statement is so complex. What about using a inline function instead? hmm. don't see how inline function will help readability. For example (not tested): static inline bool perf_event_can_insert_to_map(struct perf_event_attr *attr) { /* is inherit? */ if (attr->inherit) return false; /* is software event? */ if (attr->type == PERF_TYPE_SOFTWARE) if (attr->config == PERF_COUNT_SW_BPF_OUTPUT) return true; else return false; /* Comment... */ if (attr->type == PERF_TYPE_RAW) return true; if (attr->type == PERF_TYPE_HARDWARE) return true; return false; } ... if (!perf_event_can_insert_to_map(attr)) Do you think redability is improved? Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/22 6:58, Alexei Starovoitov wrote: Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Alexei Starovoitov --- v1->v2: fix compile in case of !CONFIG_PERF_EVENTS This patch is on top of http://patchwork.ozlabs.org/patch/533585/ to avoid conflicts. Even in the worst case the crash is not possible. Only warn_on_once, so imo net-next is ok. kernel/bpf/arraymap.c |9 + kernel/events/core.c | 16 ++-- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e3cfe46b074f..75529cc94304 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -294,10 +294,11 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd) if (IS_ERR(attr)) return (void *)attr; - if (attr->type != PERF_TYPE_RAW && - !(attr->type == PERF_TYPE_SOFTWARE && - attr->config == PERF_COUNT_SW_BPF_OUTPUT) && - attr->type != PERF_TYPE_HARDWARE) { + if ((attr->type != PERF_TYPE_RAW && +!(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && +attr->type != PERF_TYPE_HARDWARE) || + attr->inherit) { This 'if' statement is so complex. What about using a inline function instead? Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
On 2015/10/22 13:00, Alexei Starovoitov wrote: On 10/21/15 9:49 PM, Wangnan (F) wrote: After applying this patch I'm unable to use perf passing perf_event again like this: please do not top post and trim your replies. # perf record -a -e evt=cycles -e ./test_config_map.c/maps.pmu_map.event=evt/ --exclude-perf ls With -v it output: ... adding perf_bpf_probe:func_write adding perf_bpf_probe:func_write to 0x367d6a0 add bpf event perf_bpf_probe:func_write_return and attach bpf program 6 adding perf_bpf_probe:func_write_return adding perf_bpf_probe:func_write_return to 0x3a7fc40 mmap size 528384B ERROR: failed to insert value to pmu_map[0] ERROR: Apply config to BPF failed: Invalid option for map, add -v to see detail Opening /sys/kernel/debug/tracing//kprobe_events write= ... Looks like perf sets attr.inherit for cycles? I'll look into this problem. yes. that's perf default. How did it even work before?! I was testing with your samples/bpf/tracex6 that sets inherit to zero. Tested perf record -i option and it works for me: # echo "" > /sys/kernel/debug/tracing/trace # perf record -i -a -e evt=cycles -e ./test_config_map.c/maps.pmu_map.event=evt/ --exclude-perf ls # cat /sys/kernel/debug/tracing/trace | grep ls ls-8227 [001] dN.. 2526.184611: : pmu inc: 82270 ls-8227 [001] dN.. 2526.184626: : pmu inc: 40951 ls-8227 [001] dN.. 2526.184642: : pmu inc: 50659 ls-8227 [001] dN.. 2526.184657: : pmu inc: 43511 ls-8227 [001] dN.. 2526.184675: : pmu inc: 56921 ... And no warning message found in dmesg. So I think your fix is good, we should improve perf. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] bpf: fix bpf_perf_event_read() helper
After applying this patch I'm unable to use perf passing perf_event again like this: # perf record -a -e evt=cycles -e ./test_config_map.c/maps.pmu_map.event=evt/ --exclude-perf ls With -v it output: ... adding perf_bpf_probe:func_write adding perf_bpf_probe:func_write to 0x367d6a0 add bpf event perf_bpf_probe:func_write_return and attach bpf program 6 adding perf_bpf_probe:func_write_return adding perf_bpf_probe:func_write_return to 0x3a7fc40 mmap size 528384B ERROR: failed to insert value to pmu_map[0] ERROR: Apply config to BPF failed: Invalid option for map, add -v to see detail Opening /sys/kernel/debug/tracing//kprobe_events write= ... Looks like perf sets attr.inherit for cycles? I'll look into this problem. Thank you. On 2015/10/22 6:58, Alexei Starovoitov wrote: Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Alexei Starovoitov --- v1->v2: fix compile in case of !CONFIG_PERF_EVENTS This patch is on top of http://patchwork.ozlabs.org/patch/533585/ to avoid conflicts. Even in the worst case the crash is not possible. Only warn_on_once, so imo net-next is ok. kernel/bpf/arraymap.c |9 + kernel/events/core.c | 16 ++-- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e3cfe46b074f..75529cc94304 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -294,10 +294,11 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd) if (IS_ERR(attr)) return (void *)attr; - if (attr->type != PERF_TYPE_RAW && - !(attr->type == PERF_TYPE_SOFTWARE && - attr->config == PERF_COUNT_SW_BPF_OUTPUT) && - attr->type != PERF_TYPE_HARDWARE) { + if ((attr->type != PERF_TYPE_RAW && +!(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && +attr->type != PERF_TYPE_HARDWARE) || + attr->inherit) { perf_event_release_kernel(event); return ERR_PTR(-EINVAL); } diff --git a/kernel/events/core.c b/kernel/events/core.c index 64754bfecd70..0b6333265872 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3258,7 +3258,7 @@ static inline u64 perf_event_count(struct perf_event *event) u64 perf_event_read_local(struct perf_event *event) { unsigned long flags; - u64 val; + u64 val = -EINVAL; /* * Disabling interrupts avoids all counter scheduling (context @@ -3267,12 +3267,14 @@ u64 perf_event_read_local(struct perf_event *event) local_irq_save(flags); /* If this is a per-task event, it must be for current */ - WARN_ON_ONCE((event->attach_state & PERF_ATTACH_TASK) && -event->hw.target != current); + if ((event->attach_state & PERF_ATTACH_TASK) && + event->hw.target != current) + goto out; /* If this is a per-CPU event, it must be for this CPU */ - WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK) && -event->cpu != smp_processor_id()); + if (!(event->attach_state & PERF_ATTACH_TASK) && + event->cpu != smp_processor_id()) + goto out; /* * It must not be an event with inherit set, we cannot read @@ -3284,7 +3286,8 @@ u64 perf_event_read_local(struct perf_event *event) * It must not have a pmu::count method, those are not * NMI safe. */ - WARN_ON_ONCE(event->pmu->count); + if (event->pmu->count) + goto out; /* * If the event is currently on this CPU, its either a per-task event, @@ -3295,6 +3298,7 @@ u64 perf_event_read_local(struct perf_event *event) event->pmu->read(event); val = local64_read(&event->count); +out: local_irq_restore(flags); return val; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/22 11:09, Alexei Starovoitov wrote: On 10/21/15 6:56 PM, Wangnan (F) wrote: One alternative solution I can image is to attach a BPF program at sampling like kprobe, and return 0 if we don't want sampling take action. Thought? Do you think attaching BPF programs to sampling is an acceptable idea? If you mean to extend 'filter' concept to sampling events? So instead of soft_disable of non-local events, you'll attach bpf program to sampling events and use map lookup to decide whether to filter out or not such sampling event? Yes. What pt_regs would be in such case? Sampling is based on interruption. We can use pt_reg captured by the IRQ handler, or we can simply pass NULL to those BPF program. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/22 0:57, Peter Zijlstra wrote: On Wed, Oct 21, 2015 at 11:06:47PM +0800, pi3orama wrote: So explain; how does this eBPF stuff work. I think I get your point this time, and let me explain the eBPF stuff to you. You are aware that BPF programmer can break the system in this way: A=get_non_local_perf_event() perf_event_read_local(A) BOOM! However the above logic is impossible because BPF program can't work this way. First of all, it is impossible for a BPF program directly invoke a kernel function. Doesn't like kernel module, BPF program can only invoke functions designed for them, like what this patch does. So the ability of BPF programs is strictly restricted by kernel. If we don't allow BPF program call perf_event_read_local() across core, we can check this and return error in function we provide for them. Second: there's no way for a BPF program directly access a perf event. All perf events have to be wrapped by a map and be accessed by BPF functions described above. We don't allow BPF program fetch array element from that map. So pointers of perf event is safely protected from BPF program. In summary, your either-or logic doesn't hold in BPF world. A BPF program can only access perf event in a highly restricted way. We don't allow it calling perf_event_read_local() across core, so it can't. Urgh, that's still horridly inconsistent. Can we please come up with a consistent interface to perf? BPF program and kernel module are two different worlds as I said before. I don't think making them to share a common interface is a good idea because such sharing will give BPF programs too much freedom than it really need, then it will be hard prevent them to do something bad. If we really need kernel interface, I think what we need is kernel module, not BPF program. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
Hi Alexei, On 2015/10/21 21:42, Wangnan (F) wrote: One alternative solution I can image is to attach a BPF program at sampling like kprobe, and return 0 if we don't want sampling take action. Thought? Do you think attaching BPF programs to sampling is an acceptable idea? Thank you. Actually speaking I don't like it very much because the principle of soft-disable is much simpler and safe, but if you really like it I think we can try. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/21 20:17, Peter Zijlstra wrote: On Wed, Oct 21, 2015 at 07:49:34PM +0800, Wangnan (F) wrote: If our task is sampling cycle events during a function is running, and if two cores start that function overlap: Time: ...A Core 0: sys_write\ \ \ Core 1: sys_write%return Core 2: sys_write Then without counter at time A it is highly possible that BPF program on core 1 and core 2 get conflict with each other. The final result is we make some of those events be turned on and others turned off. Using atomic counter can avoid this problem. But but, how and why can an eBPF program access a !local event? I thought we had hard restrictions on that. How can an eBPF program access a !local event: when creating perf event array we don't care which perf event is for which CPU, so perf program can access any perf event in that array. Which is straightforward. And in soft disabling/enabling, what need to be done is an atomic operation on something in 'perf_event' structure, which is safe enough. Why we need an eBPF program access a !local event: I think I have explained why we need an eBPF program to access a !local event. In summary, without this ability we can't speak in user's language because they are focus on higher level principles (display refreshing, application booting, http processing...) and 'on which CPU' is not in their dictionaries most of the time. Without cross-core soft-enabling/disabling it is hard to translate requirements like "start sampling when display refreshing begin" and "stop sampling when application booted" into eBPF programs and perf cmdline. Don't you think it is useful for reducing sampling data and needs to be considered? One alternative solution I can image is to attach a BPF program at sampling like kprobe, and return 0 if we don't want sampling take action. Thought? Actually speaking I don't like it very much because the principle of soft-disable is much simpler and safe, but if you really like it I think we can try. Do you think soft-disable/enable perf events on other cores makes any real problem? Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/21 19:56, Peter Zijlstra wrote: On Wed, Oct 21, 2015 at 07:34:28PM +0800, Wangnan (F) wrote: If you want to actually disable the event: pmu->stop() will make it stop, and you can restart using pmu->start().xiezuo I also prefer totally disabling event because our goal is to reduce sampling overhead as mush as possible. However, events in perf is CPU bounded, one event in perf cmdline becomes multiple 'perf_event' in kernel in multi-core system. Disabling/enabling events on all CPUs by a BPF program a hard task due to racing, NMI, ... But eBPF perf events must already be local afaik. Look at the constraints perf_event_read_local() places on the events. I think soft disabling/enabling is free of this constraint, because technically speaking a soft-disabled perf event is still running. What we want to disable is only sampling action to avoid being overwhelming by sampling and reduce the overhead which output those unneeded sampling data to perf.data. I don't care whether the PMU counter is stopped or still running too much. Even if it still generate interrupts I think it should be acceptable because interruption handling can be fast. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/21 19:33, Peter Zijlstra wrote: On Wed, Oct 21, 2015 at 06:31:04PM +0800, xiakaixu wrote: The RFC patch set contains the necessary commit log [1]. That's of course the wrong place, this should be in the patch's Changelog. It doesn't become less relevant. In some scenarios we don't want to output trace data when perf sampling in order to reduce overhead. For example, perf can be run as daemon to dump trace data when necessary, such as the system performance goes down. Just like the example given in the cover letter, we only receive the samples within sys_write() syscall. The helper bpf_perf_event_control() in this patch set can control the data output process and get the samples we are most interested in. The cpu_function_call is probably too much to do from bpf program, so I choose current design that like 'soft_disable'. So, IIRC, we already require eBPF perf events to be CPU-local, which obviates the entire need for IPIs. But soft-disable/enable don't require IPI because it is only a memory store operation. So calling pmu->stop() seems entirely possible (its even NMI safe). But we need to turn off sampling across CPUs. Please have a look at my another email. This, however, does not explain if you need nesting, your patch seemed to have a counter, which suggest you do. To avoid reacing. If our task is sampling cycle events during a function is running, and if two cores start that function overlap: Time: ...A Core 0: sys_write\ \ \ Core 1: sys_write%return Core 2: sys_write Then without counter at time A it is highly possible that BPF program on core 1 and core 2 get conflict with each other. The final result is we make some of those events be turned on and others turned off. Using atomic counter can avoid this problem. Thank you. In any case, you could add perf_event_{stop,start}_local() to mirror the existing perf_event_read_local(), no? That would stop the entire thing and reduce even more overhead than simply skipping the overflow handler. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
On 2015/10/21 17:12, Peter Zijlstra wrote: On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote: On 10/20/15 12:22 AM, Kaixu Xia wrote: diff --git a/kernel/events/core.c b/kernel/events/core.c index b11756f..5219635 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event, irq_work_queue(&event->pending); } + if (unlikely(!atomic_read(&event->soft_enable))) + return 0; + if (event->overflow_handler) event->overflow_handler(event, data, regs); else Peter, does this part look right or it should be moved right after if (unlikely(!is_sampling_event(event))) return 0; or even to other function? It feels to me that it should be moved, since we probably don't want to active throttling, period adjust and event_limit for events that are in soft_disabled state. Depends on what its meant to do. As long as you let the interrupt happen, I think we should in fact do those things (maybe not the event_limit), but period adjustment and esp. throttling are important when the event is enabled. If you want to actually disable the event: pmu->stop() will make it stop, and you can restart using pmu->start().xiezuo I also prefer totally disabling event because our goal is to reduce sampling overhead as mush as possible. However, events in perf is CPU bounded, one event in perf cmdline becomes multiple 'perf_event' in kernel in multi-core system. Disabling/enabling events on all CPUs by a BPF program a hard task due to racing, NMI, ... Think about an example scenario: we want to sample cycles in a system width way to see what the whole system does during a smart phone refreshing its display, and don't want other samples when display freezing. We probe at the entry and exit points of Display.refresh() (a fictional user function), then let two BPF programs to enable 'cycle' sampling PMU at the entry point and disable it at the exit point. In this task, we need to start all 'cycles' perf_events when display start refreshing, and disable all of those events when refreshing is finished. Only enable the event on the core which executes the entry point of Display.refresh() is not enough because real workers are running on other cores, we need them to do the computation cooperativly. Also, scheduler is possible to schedule the exit point of Display.refresh() on another core, so we can't simply disable the perf_event on that core and let other core keel sampling after refreshing finishes. I have thought a method which can disable sampling in a safe way: we can call pmu->stop() inside the PMU IRQ handler, so we can ensure that pmu->stop() always be called by core its event resides. However, I don't know how to reenable them when safely. Maybe need something in scheduler? Thank you. And I suppose you can wrap that with a counter if you need nesting. I'm not sure if any of that is a viable solution, because the patch description is somewhat short on the problem statement. As is, I'm not too charmed with the patch, but lacking a better understanding of what exactly we're trying to achieve I'm struggling with proposing alternatives. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/3] bpf: introduce bpf_perf_event_output() helper
On 2015/10/21 18:01, He Kuang wrote: hi, Alexei I've tested the sample in next patch and it works well. I think more work on the perf side needs to be done for parsing PERF_COUNT_SW_BPF_OUTPUT event type, are you working on that? Thank you. We need to add something into parse-event.y/l to let it know the new software event. We can do this simple task. However, as I gussed before, perf is unable to print this new event so there is some work need to be done to let 'perf report' and 'perf script' know it. One thing I can still remember is finding a way to inject format information through those output so 'perf script' and 'perf data convert --to-ctf' can decode the raw data without extra work. Can you remember that we have discussed a solution which connects debuginfo and output raw data using a builtin function in LLVM/clang? We already have clang/LLVM patch on it, but on perf side I haven't do start working on it. I think we can make perf output raw data hexadamically at first. I'll do the above work and put related patch at the end of my eBPF patchset because they shouldn't be merged until this patchset upstreamed. Thank you. On 2015/10/21 11:02, Alexei Starovoitov wrote: This helper is used to send raw data from eBPF program into special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event. User space needs to perf_event_open() it (either for one or all cpus) and store FD into perf_event_array (similar to bpf_perf_event_read() helper) before eBPF program can send data into it. Today the programs triggered by kprobe collect the data and either store it into the maps or print it via bpf_trace_printk() where latter is the debug facility and not suitable to stream the data. This new helper replaces such bpf_trace_printk() usage and allows programs to have dedicated channel into user space for post-processing of the raw data collected. Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h| 11 ++ include/uapi/linux/perf_event.h |1 + kernel/bpf/arraymap.c |2 ++ kernel/bpf/verifier.c |3 ++- kernel/trace/bpf_trace.c| 46 +++ 5 files changed, 62 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 564f1f091991..2e032426cfb7 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -287,6 +287,17 @@ enum bpf_func_id { * Return: realm if != 0 */ BPF_FUNC_get_route_realm, + +/** + * bpf_perf_event_output(ctx, map, index, data, size) - output perf raw sample + * @ctx: struct pt_regs* + * @map: pointer to perf_event_array map + * @index: index of event in the map + * @data: data on stack to be output as raw data + * @size: size of data + * Return: 0 on success + */ +BPF_FUNC_perf_event_output, __BPF_FUNC_MAX_ID, }; diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 2881145cda86..d3c417615361 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -110,6 +110,7 @@ enum perf_sw_ids { PERF_COUNT_SW_ALIGNMENT_FAULTS= 7, PERF_COUNT_SW_EMULATION_FAULTS= 8, PERF_COUNT_SW_DUMMY= 9, +PERF_COUNT_SW_BPF_OUTPUT= 10, PERF_COUNT_SW_MAX,/* non-ABI */ }; diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index f2d9e698c753..e3cfe46b074f 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -295,6 +295,8 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd) return (void *)attr; if (attr->type != PERF_TYPE_RAW && +!(attr->type == PERF_TYPE_SOFTWARE && + attr->config == PERF_COUNT_SW_BPF_OUTPUT) && attr->type != PERF_TYPE_HARDWARE) { perf_event_release_kernel(event); return ERR_PTR(-EINVAL); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1d6b97be79e1..b56cf51f8d42 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -245,6 +245,7 @@ static const struct { } func_limit[] = { {BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call}, {BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read}, +{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_output}, }; static void print_verifier_state(struct verifier_env *env) @@ -910,7 +911,7 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id) * don't allow any other map type to be passed into * the special func; */ -if (bool_map != bool_func) +if (bool_func && bool_map != bool_func) return -EINVAL; } diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 0fe96c7c8803..47febbe7998e 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -215,6 +215,50 @@ const struct bpf_func_proto bpf_perf_event_read_prot
Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
On 2015/10/13 18:54, He Kuang wrote: hi, Alexei What about using similar implementation like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like PERF_EVENT_IOC_SET_ENABLER, then let perf to select an event as 'enabler', then BPF can still control one atomic variable to enable/disable a set of events. you lost me on that last sentence. How this 'enabler' will work? Also I'm still missing what's wrong with perf doing ioctl() on events on all cpus manually when bpf program tells it to do so. Is it speed you concerned about or extra work in perf ? For not having too much wakeups, perf ringbuffer has a watermark limit to cache events and reduce the wakeups, which causes perf userspace tool can not receive perf events immediately. Here's a simple demo expamle to prove it, 'sleep_exec' does some writes and prints a timestamp every second, and an lable is printed when perf poll gets events. $ perf record -m 2 -e syscalls:sys_enter_write sleep_exec 1000 userspace sleep time: 0 seconds userspace sleep time: 1 seconds userspace sleep time: 2 seconds userspace sleep time: 3 seconds perf record wakeup onetime 0 userspace sleep time: 4 seconds userspace sleep time: 5 seconds userspace sleep time: 6 seconds userspace sleep time: 7 seconds perf record wakeup onetime 1 userspace sleep time: 8 seconds perf record wakeup onetime 2 .. $ perf record -m 1 -e syscalls:sys_enter_write sleep_exec 1000 userspace sleep time: 0 seconds userspace sleep time: 1 seconds perf record wakeup onetime 0 userspace sleep time: 2 seconds userspace sleep time: 3 seconds perf record wakeup onetime 1 userspace sleep time: 4 seconds userspace sleep time: 5 seconds .. By default, if no mmap_pages is specified, perf tools wakeup only when the target executalbe finished: $ perf record -e syscalls:sys_enter_write sleep_exec 5 userspace sleep time: 0 seconds userspace sleep time: 1 seconds userspace sleep time: 2 seconds userspace sleep time: 3 seconds userspace sleep time: 4 seconds perf record wakeup onetime 0 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.006 MB perf.data (54 samples) ] If we want perf to reflect as soon as our sample event be generated, --no-buffering should be used, but this option has a greater impact on performance. $ perf record --no-buffering -e syscalls:sys_enter_write sleep_exec 1000 userspace sleep time: 0 seconds perf record wakeup onetime 0 perf record wakeup onetime 1 perf record wakeup onetime 2 perf record wakeup onetime 3 perf record wakeup onetime 4 perf record wakeup onetime 5 perf record wakeup onetime 6 .. Hi Alexei, Based on He Kuang's test result, if we choose to use perf to control perf event and output trigger event through bof_output_data, with default setting we have to wait for sereval seconds until perf can get first trigger event if the trigger event's frequency is low. In my display refreshing example, it causes losting of event triggering. From user's view, random frames would miss. With --no-buffering, things can become faster, but --no-buffering causes perf to be scheduled in faster than normal, which is conflict to the goal of event disabling that we want to reduce recording overhead as much as possible. Thank you. Thank you -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
On 2015/10/13 13:15, Alexei Starovoitov wrote: On 10/12/15 9:34 PM, Wangnan (F) wrote: On 2015/10/13 12:16, Alexei Starovoitov wrote: On 10/12/15 8:51 PM, Wangnan (F) wrote: why 'set disable' is needed ? the example given in cover letter shows the use case where you want to receive samples only within sys_write() syscall. The example makes sense, but sys_write() is running on this cpu, so just disabling it on the current one is enough. Our real use case is control of the system-wide sampling. For example, we need sampling all CPUs when smartphone start refershing its display. We need all CPUs because in Android system there are plenty of threads get involed into this behavior. We can't achieve this by controling sampling on only one CPU. This is the reason we need 'set enable' and 'set disable'. ok, but that use case may have different enable/disable pattern. In sys_write example ultra-fast enable/disable is must have, since the whole syscall is fast and overhead should be minimal. but for display refresh? we're talking milliseconds, no? Can you just ioctl() it from user space? If cost of enable/disable is high or the time range between toggling is long, then doing it from the bpf program doesn't make sense. Instead the program can do bpf_perf_event_output() to send a notification to user space that condition is met and the user space can ioctl() events. OK. I think I understand your design principle that, everything inside BPF should be as fast as possible. Make userspace control events using ioctl make things harder. You know that 'perf record' itself doesn't care too much about events it reveived. It only copies data to perf.data, but what we want is to use perf record simply like this: # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100 And in control.o we create uprobe point to mark the start and finish of a frame: SEC("target=/a/b/c.o\nstartFrame=0x123456") int startFrame(void *) { bpf_pmu_enable(pmu); return 1; } SEC("target=/a/b/c.o\nfinishFrame=0x234568") int finishFrame(void *) { bpf_pmu_disable(pmu); return 1; } I think it is make sence also. yes. that looks quite useful, but did you consider re-entrant startFrame() ? start << here sampling starts start finish << here all samples disabled?! finish and startFrame()/finishFrame() running on all cpus of that user app ? One cpu entering into startFrame() while another cpu doing finishFrame what behavior should be? sampling is still enabled on all cpus? or off? Either case doesn't seem to work with simple enable/disable. Few emails in this thread back, I mentioned inc/dec of a flag to solve that. Correct. What about using similar implementation like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like PERF_EVENT_IOC_SET_ENABLER, then let perf to select an event as 'enabler', then BPF can still control one atomic variable to enable/disable a set of events. you lost me on that last sentence. How this 'enabler' will work? Like what we did in this patchset: add an atomic flag to perf_event, make all perf_event connected to this enabler by PERF_EVENT_IOC_SET_ENABLER. During running, check the enabler's atomic flag. So we use one atomic variable to control a set of perf_event. Finally create a BPF helper function to control that atomic variable. Also I'm still missing what's wrong with perf doing ioctl() on events on all cpus manually when bpf program tells it to do so. Is it speed you concerned about or extra work in perf ? I think both speed and extra work need be concerned. Say we use perf to enable/disable sampling. Use the above example to describe, when smartphone starting refresing display, we write something into ringbuffer, then display refreshing start. We have to wait for perf be scheduled in, parse event it get (perf record doesn't do this currently), discover the trigger event then enable sampling perf events on all cpus. We make trigger and action asynchronous. I'm not sure how many ns or ms it need, and I believe asynchronization itself introduces complexity, which I think need to be avoided except we can explain the advantages asynchronization can bring. But yes, perf based implementation can shut down the PMU completely, which is better than current light-weight implementation. In summary: - In next version we will use counter based flag instead of current 0/1 switcher in considering of reentering problem. - I think we both agree we need a light weight solution in which we can enable/disable sampling in function level. This light-weight solution can be applied to only one perf-event. - Our disagreement is whether to introduce a heavy-weight solution based on perf to enable/disable a group of perf event. For me, perf-based solution can shut down PMU completly, which is good. However, it
Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
On 2015/10/13 12:16, Alexei Starovoitov wrote: On 10/12/15 8:51 PM, Wangnan (F) wrote: why 'set disable' is needed ? the example given in cover letter shows the use case where you want to receive samples only within sys_write() syscall. The example makes sense, but sys_write() is running on this cpu, so just disabling it on the current one is enough. Our real use case is control of the system-wide sampling. For example, we need sampling all CPUs when smartphone start refershing its display. We need all CPUs because in Android system there are plenty of threads get involed into this behavior. We can't achieve this by controling sampling on only one CPU. This is the reason we need 'set enable' and 'set disable'. ok, but that use case may have different enable/disable pattern. In sys_write example ultra-fast enable/disable is must have, since the whole syscall is fast and overhead should be minimal. but for display refresh? we're talking milliseconds, no? Can you just ioctl() it from user space? If cost of enable/disable is high or the time range between toggling is long, then doing it from the bpf program doesn't make sense. Instead the program can do bpf_perf_event_output() to send a notification to user space that condition is met and the user space can ioctl() events. OK. I think I understand your design principle that, everything inside BPF should be as fast as possible. Make userspace control events using ioctl make things harder. You know that 'perf record' itself doesn't care too much about events it reveived. It only copies data to perf.data, but what we want is to use perf record simply like this: # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100 And in control.o we create uprobe point to mark the start and finish of a frame: SEC("target=/a/b/c.o\nstartFrame=0x123456") int startFrame(void *) { bpf_pmu_enable(pmu); return 1; } SEC("target=/a/b/c.o\nfinishFrame=0x234568") int finishFrame(void *) { bpf_pmu_disable(pmu); return 1; } I think it is make sence also. I still think perf is not necessary be independent each other. You know we have PERF_EVENT_IOC_SET_OUTPUT which can set multiple events output through one ringbuffer. This way perf events are connected. I think the 'set disable/enable' design in this patchset satisify the design goal that in BPF program we only do simple and fast things. The only inconvience is we add something into map, which is ugly. What about using similar implementation like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like PERF_EVENT_IOC_SET_ENABLER, then let perf to select an event as 'enabler', then BPF can still control one atomic variable to enable/disable a set of events. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
On 2015/10/13 11:39, Alexei Starovoitov wrote: On 10/12/15 8:27 PM, Wangnan (F) wrote: Then how to avoid racing? For example, when one core disabling all events in a map, another core is enabling all of them. This racing may causes sereval perf events in a map dump samples while other events not. To avoid such racing I think some locking must be introduced, then cost is even higher. The reason why we introduce an atomic pointer is because each operation should controls a set of events, not one event, due to the per-cpu manner of perf events. why 'set disable' is needed ? the example given in cover letter shows the use case where you want to receive samples only within sys_write() syscall. The example makes sense, but sys_write() is running on this cpu, so just disabling it on the current one is enough. Our real use case is control of the system-wide sampling. For example, we need sampling all CPUs when smartphone start refershing its display. We need all CPUs because in Android system there are plenty of threads get involed into this behavior. We can't achieve this by controling sampling on only one CPU. This is the reason we need 'set enable' and 'set disable'. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
On 2015/10/13 3:29, Alexei Starovoitov wrote: On 10/12/15 2:02 AM, Kaixu Xia wrote: +extern const struct bpf_func_proto bpf_perf_event_sample_enable_proto; +extern const struct bpf_func_proto bpf_perf_event_sample_disable_proto; externs are unnecessary. Just make them static. Also I prefer single helper that takes a flag, so we can extend it instead of adding func_id for every little operation. To avoid conflicts if you touch kernel/bpf/* or bpf.h please always base your patches of net-next. > +atomic_set(&map->perf_sample_disable, 0); global flag per map is no go. events are independent and should be treated as such. Then how to avoid racing? For example, when one core disabling all events in a map, another core is enabling all of them. This racing may causes sereval perf events in a map dump samples while other events not. To avoid such racing I think some locking must be introduced, then cost is even higher. The reason why we introduce an atomic pointer is because each operation should controls a set of events, not one event, due to the per-cpu manner of perf events. Thank you. Please squash these two patches, since they're part of one logical feature. Splitting them like this only makes review harder. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
On 2015/10/12 20:02, Peter Zijlstra wrote: On Mon, Oct 12, 2015 at 09:02:42AM +, Kaixu Xia wrote: --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -483,6 +483,8 @@ struct perf_event { perf_overflow_handler_t overflow_handler; void*overflow_handler_context; + atomic_t *sample_disable; + #ifdef CONFIG_EVENT_TRACING struct trace_event_call *tp_event; struct event_filter *filter; diff --git a/kernel/events/core.c b/kernel/events/core.c index b11756f..f6ef45c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event, irq_work_queue(&event->pending); } + if ((event->sample_disable) && atomic_read(event->sample_disable)) + return ret; + if (event->overflow_handler) event->overflow_handler(event, data, regs); else Try and guarantee sample_disable lives in the same cacheline as overflow_handler. Could you please explain why we need them to be in a same cacheline? Thank you. I think we should at the very least replace the kzalloc() currently used with a cacheline aligned alloc, and check the structure layout to verify these two do in fact share a cacheline. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] bpf: fix a bug in verification logic when SUB operation taken on FRAME_PTR
On 2015/6/19 0:00, Alexei Starovoitov wrote: On Thu, Jun 18, 2015 at 08:31:45AM +, Wang Nan wrote: Original code has a problem, cause following code failed to pass verifier: r1 <- r10 r1 -= 8 r2 = 8 r3 = unsafe pointer call BPF_FUNC_probe_read <-- R1 type=inv expected=fp However, by replacing 'r1 -= 8' to 'r1 += -8' the above program can be loaded successfully. This is because the verifier allows only BPF_ADD instruction on a FRAME_PTR reigster to forge PTR_TO_STACK register, but makes BPF_SUB on FRAME_PTR reigster to get a UNKNOWN_VALUE register. This patch fix it by adding BPF_SUB in stack_relative checking. It's not a bug. It's catching ADD only by design. If we let it recognize SUB then one might argue we should let it recognize multiply, shifts and all other arithmetic on pointers. verifier will be getting bigger and bigger. Where do we stop? llvm only emits canonical ADD. If you've seen llvm doing SUB, let's fix it there. So what piece generated this 'r1 -= 8' ? I hit this problem when writing code of automatical parameter generator. The instruction is generated by myself. Now I have corrected my code. Thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html