[PATCH bpf-next v2 1/7] perf/core: add perf_get_event() to return perf_event given a struct file
A new extern function, perf_get_event(), is added to return a perf event given a struct file. This function will be used in later patches. Signed-off-by: Yonghong Song--- include/linux/perf_event.h | 5 + kernel/events/core.c | 8 2 files changed, 13 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index e71e99e..b5c1ad3 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -868,6 +868,7 @@ extern void perf_event_exit_task(struct task_struct *child); extern void perf_event_free_task(struct task_struct *task); extern void perf_event_delayed_put(struct task_struct *task); extern struct file *perf_event_get(unsigned int fd); +extern struct perf_event *perf_get_event(struct file *file); extern const struct perf_event_attr *perf_event_attrs(struct perf_event *event); extern void perf_event_print_debug(void); extern void perf_pmu_disable(struct pmu *pmu); @@ -1289,6 +1290,10 @@ static inline void perf_event_exit_task(struct task_struct *child) { } static inline void perf_event_free_task(struct task_struct *task) { } static inline void perf_event_delayed_put(struct task_struct *task){ } static inline struct file *perf_event_get(unsigned int fd) { return ERR_PTR(-EINVAL); } +static inline struct perf_event *perf_get_event(struct file *file) +{ + return ERR_PTR(-EINVAL); +} static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *event) { return ERR_PTR(-EINVAL); diff --git a/kernel/events/core.c b/kernel/events/core.c index 67612ce..1e3cddb 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11212,6 +11212,14 @@ struct file *perf_event_get(unsigned int fd) return file; } +struct perf_event *perf_get_event(struct file *file) +{ + if (file->f_op != _fops) + return ERR_PTR(-EINVAL); + + return file->private_data; +} + const struct perf_event_attr *perf_event_attrs(struct perf_event *event) { if (!event) -- 2.9.5
Re: [PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT
On Thu, May 17, 2018 at 2:45 PM, Sean Youngwrote: > Hi, > > Again thanks for a thoughtful review. This will definitely will improve > the code. > > On Thu, May 17, 2018 at 10:02:52AM -0700, Y Song wrote: >> On Wed, May 16, 2018 at 2:04 PM, Sean Young wrote: >> > Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call >> > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report >> > that the last key should be repeated. >> > >> > The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; >> > the target_fd must be the /dev/lircN device. >> > >> > Signed-off-by: Sean Young >> > --- >> > drivers/media/rc/Kconfig | 13 ++ >> > drivers/media/rc/Makefile | 1 + >> > drivers/media/rc/bpf-rawir-event.c | 363 + >> > drivers/media/rc/lirc_dev.c| 24 ++ >> > drivers/media/rc/rc-core-priv.h| 24 ++ >> > drivers/media/rc/rc-ir-raw.c | 14 +- >> > include/linux/bpf_rcdev.h | 30 +++ >> > include/linux/bpf_types.h | 3 + >> > include/uapi/linux/bpf.h | 55 - >> > kernel/bpf/syscall.c | 7 + >> > 10 files changed, 531 insertions(+), 3 deletions(-) >> > create mode 100644 drivers/media/rc/bpf-rawir-event.c >> > create mode 100644 include/linux/bpf_rcdev.h >> > >> > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig >> > index eb2c3b6eca7f..2172d65b0213 100644 >> > --- a/drivers/media/rc/Kconfig >> > +++ b/drivers/media/rc/Kconfig >> > @@ -25,6 +25,19 @@ config LIRC >> >passes raw IR to and from userspace, which is needed for >> >IR transmitting (aka "blasting") and for the lirc daemon. >> > >> > +config BPF_RAWIR_EVENT >> > + bool "Support for eBPF programs attached to lirc devices" >> > + depends on BPF_SYSCALL >> > + depends on RC_CORE=y >> > + depends on LIRC >> > + help >> > + Allow attaching eBPF programs to a lirc device using the bpf(2) >> > + syscall command BPF_PROG_ATTACH. This is supported for raw IR >> > + receivers. >> > + >> > + These eBPF programs can be used to decode IR into scancodes, for >> > + IR protocols not supported by the kernel decoders. >> > + >> > menuconfig RC_DECODERS >> > bool "Remote controller decoders" >> > depends on RC_CORE >> > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile >> > index 2e1c87066f6c..74907823bef8 100644 >> > --- a/drivers/media/rc/Makefile >> > +++ b/drivers/media/rc/Makefile >> > @@ -5,6 +5,7 @@ obj-y += keymaps/ >> > obj-$(CONFIG_RC_CORE) += rc-core.o >> > rc-core-y := rc-main.o rc-ir-raw.o >> > rc-core-$(CONFIG_LIRC) += lirc_dev.o >> > +rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o >> > obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o >> > obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o >> > obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o >> > diff --git a/drivers/media/rc/bpf-rawir-event.c >> > b/drivers/media/rc/bpf-rawir-event.c >> > new file mode 100644 >> > index ..7cb48b8d87b5 >> > --- /dev/null >> > +++ b/drivers/media/rc/bpf-rawir-event.c >> > @@ -0,0 +1,363 @@ >> > +// SPDX-License-Identifier: GPL-2.0 >> > +// bpf-rawir-event.c - handles bpf >> > +// >> > +// Copyright (C) 2018 Sean Young >> > + >> > +#include >> > +#include >> > +#include >> > +#include "rc-core-priv.h" >> > + >> > +/* >> > + * BPF interface for raw IR >> > + */ >> > +const struct bpf_prog_ops rawir_event_prog_ops = { >> > +}; >> > + >> > +BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event) >> > +{ >> > + struct ir_raw_event_ctrl *ctrl; >> > + >> > + ctrl = container_of(event, struct ir_raw_event_ctrl, >> > bpf_rawir_event); >> > + >> > + rc_repeat(ctrl->dev); >> > + >> > + return 0; >> > +} >> > + >> > +static const struct bpf_func_proto rc_repeat_proto = { >> > + .func = bpf_rc_repeat, >> > + .gpl_only = true, /* rc_repeat is EXPORT_SYMBOL_GPL */ >> > + .ret_type = RET_INTEGER, >> > + .arg1_type = ARG_PTR_TO_CTX, >> > +}; >> > + >> > +BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol, >> > + u32, scancode, u32, toggle) >> > +{ >> > + struct ir_raw_event_ctrl *ctrl; >> > + >> > + ctrl = container_of(event, struct ir_raw_event_ctrl, >> > bpf_rawir_event); >> > + >> > + rc_keydown(ctrl->dev, protocol, scancode, toggle != 0); >> > + >> > + return 0; >> > +} >> > + >> > +static const struct bpf_func_proto rc_keydown_proto = { >> > + .func = bpf_rc_keydown, >> > + .gpl_only = true, /* rc_keydown is EXPORT_SYMBOL_GPL */ >> > + .ret_type = RET_INTEGER, >> > + .arg1_type = ARG_PTR_TO_CTX, >> > + .arg2_type = ARG_ANYTHING, >> > + .arg3_type = ARG_ANYTHING, >> > + .arg4_type = ARG_ANYTHING, >> > +}; >> > + >> > +static
[RFC PATCH net-next] tcp: tcp_rack_reo_wnd() can be static
Fixes: 20b654dfe1be ("tcp: support DUPACK threshold in RACK") Signed-off-by: kbuild test robot--- tcp_recovery.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index 30cbfb6..71593e4 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -21,7 +21,7 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2) return t1 > t2 || (t1 == t2 && after(seq1, seq2)); } -u32 tcp_rack_reo_wnd(const struct sock *sk) +static u32 tcp_rack_reo_wnd(const struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
[net-next:master 1200/1233] net/ipv4/tcp_recovery.c:24:5: sparse: symbol 'tcp_rack_reo_wnd' was not declared. Should it be static?
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: 538e2de104cfb4ef1acb35af42427bff42adbe4d commit: 20b654dfe1beaca60ab51894ff405a049248433d [1200/1233] tcp: support DUPACK threshold in RACK reproduce: # apt-get install sparse git checkout 20b654dfe1beaca60ab51894ff405a049248433d make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) net/ipv4/tcp_recovery.c:46:16: sparse: expression using sizeof(void) net/ipv4/tcp_recovery.c:46:16: sparse: expression using sizeof(void) >> net/ipv4/tcp_recovery.c:24:5: sparse: symbol 'tcp_rack_reo_wnd' was not >> declared. Should it be static? include/net/tcp.h:738:16: sparse: expression using sizeof(void) net/ipv4/tcp_recovery.c:102:40: sparse: expression using sizeof(void) net/ipv4/tcp_recovery.c:102:40: sparse: expression using sizeof(void) include/net/tcp.h:738:16: sparse: expression using sizeof(void) net/ipv4/tcp_recovery.c:210:42: sparse: expression using sizeof(void) Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
[PATCH bpf-next v2 2/7] bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, if theis associated with a tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Signed-off-by: Yonghong Song --- include/linux/trace_events.h | 15 ++ include/uapi/linux/bpf.h | 27 ++ kernel/bpf/syscall.c | 124 +++ kernel/trace/bpf_trace.c | 48 + kernel/trace/trace_kprobe.c | 29 ++ kernel/trace/trace_uprobe.c | 22 6 files changed, 265 insertions(+) diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 2bde3ef..bd08e11 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -473,6 +473,9 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info); int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog); int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog); struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name); +int bpf_get_perf_event_info(struct perf_event *event, u32 *prog_id, + u32 *attach_info, const char **buf, + u64 *probe_offset, u64 *probe_addr); #else static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) { @@ -504,6 +507,12 @@ static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name { return NULL; } +static inline int bpf_get_perf_event_info(struct file *file, u32 *prog_id, + u32 *attach_info, const char **buf, + u64 *probe_offset, u64 *probe_addr) +{ + return -EOPNOTSUPP; +} #endif enum { @@ -560,10 +569,16 @@ extern void perf_trace_del(struct perf_event *event, int flags); #ifdef CONFIG_KPROBE_EVENTS extern int perf_kprobe_init(struct perf_event *event, bool is_retprobe); extern void perf_kprobe_destroy(struct perf_event *event); +extern int bpf_get_kprobe_info(struct perf_event *event, u32 *attach_info, + const char **symbol, u64 *probe_offset, + u64 *probe_addr, bool perf_type_tracepoint); #endif #ifdef CONFIG_UPROBE_EVENTS extern int perf_uprobe_init(struct perf_event *event, bool is_retprobe); extern void perf_uprobe_destroy(struct perf_event *event); +extern int bpf_get_uprobe_info(struct perf_event *event, u32 *attach_info, + const char **filename, u64 *probe_offset, + bool perf_type_tracepoint); #endif extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d94d333..6a22ad4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -97,6 +97,7 @@ enum bpf_cmd { BPF_RAW_TRACEPOINT_OPEN, BPF_BTF_LOAD, BPF_BTF_GET_FD_BY_ID, + BPF_TASK_FD_QUERY, }; enum bpf_map_type { @@ -379,6 +380,22 @@ union bpf_attr { __u32 btf_log_size; __u32 btf_log_level; }; + + struct { + int pid;/* input: pid */ + int fd; /* input: fd */ + __u32 flags; /* input: flags */ + __u32 buf_len;/* input: buf len */ + __aligned_u64 buf;/* input/output: +* tp_name for tracepoint +* symbol for kprobe +* filename for uprobe +*/ + __u32 prog_id;/* output: prod_id */ + __u32 attach_info;
[PATCH bpf-next v2 4/7] tools/bpf: add ksym_get_addr() in trace_helpers
Given a kernel function name, ksym_get_addr() will return the kernel address for this function, or 0 if it cannot find this function name in /proc/kallsyms. This function will be used later when a kernel address is used to initiate a kprobe perf event. Signed-off-by: Yonghong Song--- tools/testing/selftests/bpf/trace_helpers.c | 12 tools/testing/selftests/bpf/trace_helpers.h | 1 + 2 files changed, 13 insertions(+) diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c index 8fb4fe8..3868dcb 100644 --- a/tools/testing/selftests/bpf/trace_helpers.c +++ b/tools/testing/selftests/bpf/trace_helpers.c @@ -72,6 +72,18 @@ struct ksym *ksym_search(long key) return [0]; } +long ksym_get_addr(const char *name) +{ + int i; + + for (i = 0; i < sym_cnt; i++) { + if (strcmp(syms[i].name, name) == 0) + return syms[i].addr; + } + + return 0; +} + static int page_size; static int page_cnt = 8; static struct perf_event_mmap_page *header; diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h index 36d90e3..3b4bcf7 100644 --- a/tools/testing/selftests/bpf/trace_helpers.h +++ b/tools/testing/selftests/bpf/trace_helpers.h @@ -11,6 +11,7 @@ struct ksym { int load_kallsyms(void); struct ksym *ksym_search(long key); +long ksym_get_addr(const char *name); typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size); -- 2.9.5
[PATCH bpf-next v2 0/7] bpf: implement BPF_TASK_FD_QUERY
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, this command will return bpf related information to user space. Right now it only supports tracepoint/kprobe/uprobe perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Patch #1 adds function perf_get_event() in kernel/events/core.c. Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY. Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query() in the libbpf library for samples/selftests/bpftool to use. Patch #4 adds ksym_get_addr() utility function. Patch #5 add a test in samples/bpf for querying k[ret]probes and u[ret]probes. Patch #6 add a test in tools/testing/selftests/bpf for querying raw_tracepoint and tracepoint. Patch #7 add a new subcommand "perf" to bpftool. Changelogs: v1 -> v2: . changed bpf subcommand name from BPF_PERF_EVENT_QUERY to BPF_TASK_FD_QUERY. . fixed various "bpftool perf" issues and added documentation and auto-completion. Yonghong Song (7): perf/core: add perf_get_event() to return perf_event given a struct file bpf: introduce bpf subcommand BPF_TASK_FD_QUERY tools/bpf: sync kernel header bpf.h and add bpf_trace_event_query in libbpf tools/bpf: add ksym_get_addr() in trace_helpers samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs tools/bpftool: add perf subcommand include/linux/perf_event.h | 5 + include/linux/trace_events.h | 15 + include/uapi/linux/bpf.h | 27 ++ kernel/bpf/syscall.c | 124 kernel/events/core.c | 8 + kernel/trace/bpf_trace.c | 48 +++ kernel/trace/trace_kprobe.c | 29 ++ kernel/trace/trace_uprobe.c | 22 ++ samples/bpf/Makefile | 4 + samples/bpf/task_fd_query_kern.c | 19 ++ samples/bpf/task_fd_query_user.c | 379 +++ tools/bpf/bpftool/Documentation/bpftool-perf.rst | 81 + tools/bpf/bpftool/Documentation/bpftool.rst | 5 +- tools/bpf/bpftool/bash-completion/bpftool| 9 + tools/bpf/bpftool/main.c | 3 +- tools/bpf/bpftool/main.h | 1 + tools/bpf/bpftool/perf.c | 200 tools/include/uapi/linux/bpf.h | 27 ++ tools/lib/bpf/bpf.c | 24 ++ tools/lib/bpf/bpf.h | 3 + tools/testing/selftests/bpf/test_progs.c | 133 tools/testing/selftests/bpf/trace_helpers.c | 12 + tools/testing/selftests/bpf/trace_helpers.h | 1 + 23 files changed, 1177 insertions(+), 2 deletions(-) create mode 100644 samples/bpf/task_fd_query_kern.c create mode 100644 samples/bpf/task_fd_query_user.c create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst create mode 100644 tools/bpf/bpftool/perf.c -- 2.9.5
[PATCH bpf-next v2 7/7] tools/bpftool: add perf subcommand
The new command "bpftool perf [show | list]" will traverse all processes under /proc, and if any fd is associated with a perf event, it will print out related perf event information. Documentation is also added. Below is an example to show the results using bcc commands. Running the following 4 bcc commands: kprobe: trace.py '__x64_sys_nanosleep' kretprobe: trace.py 'r::__x64_sys_nanosleep' tracepoint: trace.py 't:syscalls:sys_enter_nanosleep' uprobe: trace.py 'p:/home/yhs/a.out:main' The bpftool command line and result: $ bpftool perf pid 21711 fd 5: prog_id 5 kprobe func __x64_sys_write offset 0 pid 21765 fd 5: prog_id 7 kretprobe func __x64_sys_nanosleep offset 0 pid 21767 fd 5: prog_id 8 tracepoint sys_enter_nanosleep pid 21800 fd 5: prog_id 9 uprobe filename /home/yhs/a.out offset 1159 $ bpftool -j perf {"pid":21711,"fd":5,"prog_id":5,"attach_info":"kprobe","func":"__x64_sys_write","offset":0}, \ {"pid":21765,"fd":5,"prog_id":7,"attach_info":"kretprobe","func":"__x64_sys_nanosleep","offset":0}, \ {"pid":21767,"fd":5,"prog_id":8,"attach_info":"tracepoint","tracepoint":"sys_enter_nanosleep"}, \ {"pid":21800,"fd":5,"prog_id":9,"attach_info":"uprobe","filename":"/home/yhs/a.out","offset":1159} $ bpftool prog 5: kprobe name probe___x64_sys tag e495a0c82f2c7a8d gpl loaded_at 2018-05-15T04:46:37-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 4 7: kprobe name probe___x64_sys tag f2fdee479a503abf gpl loaded_at 2018-05-15T04:48:32-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 7 8: tracepoint name tracepoint__sys tag 5390badef2395fcf gpl loaded_at 2018-05-15T04:48:48-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 8 9: kprobe name probe_main_1 tag 0a87bdc2e2953b6d gpl loaded_at 2018-05-15T04:49:52-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 9 $ ps ax | grep "python ./trace.py" 21711 pts/0T 0:03 python ./trace.py __x64_sys_write 21765 pts/0S+ 0:00 python ./trace.py r::__x64_sys_nanosleep 21767 pts/2S+ 0:00 python ./trace.py t:syscalls:sys_enter_nanosleep 21800 pts/3S+ 0:00 python ./trace.py p:/home/yhs/a.out:main 22374 pts/1S+ 0:00 grep --color=auto python ./trace.py Signed-off-by: Yonghong Song--- tools/bpf/bpftool/Documentation/bpftool-perf.rst | 81 + tools/bpf/bpftool/Documentation/bpftool.rst | 5 +- tools/bpf/bpftool/bash-completion/bpftool| 9 + tools/bpf/bpftool/main.c | 3 +- tools/bpf/bpftool/main.h | 1 + tools/bpf/bpftool/perf.c | 200 +++ 6 files changed, 297 insertions(+), 2 deletions(-) create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst create mode 100644 tools/bpf/bpftool/perf.c diff --git a/tools/bpf/bpftool/Documentation/bpftool-perf.rst b/tools/bpf/bpftool/Documentation/bpftool-perf.rst new file mode 100644 index 000..3e65375 --- /dev/null +++ b/tools/bpf/bpftool/Documentation/bpftool-perf.rst @@ -0,0 +1,81 @@ + +bpftool-perf + +--- +tool for inspection of perf related bpf prog attachments +--- + +:Manual section: 8 + +SYNOPSIS + + + **bpftool** [*OPTIONS*] **perf** *COMMAND* + + *OPTIONS* := { [{ **-j** | **--json** }] [{ **-p** | **--pretty** }] } + + *COMMANDS* := + { **show** | **list** | **help** } + +PERF COMMANDS += + +| **bpftool** **perf { show | list }** +| **bpftool** **perf help** + +DESCRIPTION +=== + **bpftool perf { show | list }** + List all raw_tracepoint, tracepoint, kprobe attachment in the system. + + Output will start with process id and file descriptor in that process, + followed by bpf program id, attachment information, and attachment point. + The attachment point for raw_tracepoint/tracepoint is the trace probe name. + The attachment point for k[ret]probe is either symbol name and offset, + or a kernel virtual address. + The attachment point for u[ret]probe is the file name and the file offset. + + **bpftool perf help** + Print short help message. + +OPTIONS +=== + -h, --help + Print short generic help message (similar to **bpftool help**). + + -v, --version + Print version number (similar to **bpftool version**). + + -j, --json + Generate JSON output. For commands that cannot produce JSON, this + option has no effect. + + -p, --pretty + Generate
[PATCH bpf-next v2 3/7] tools/bpf: sync kernel header bpf.h and add bpf_trace_event_query in libbpf
Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and implement bpf_trace_event_query() in libbpf. The test programs in samples/bpf and tools/testing/selftests/bpf, and later bpftool will use this libbpf function to query kernel. Signed-off-by: Yonghong Song--- tools/include/uapi/linux/bpf.h | 27 +++ tools/lib/bpf/bpf.c| 24 tools/lib/bpf/bpf.h| 3 +++ 3 files changed, 54 insertions(+) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index d94d333..6a22ad4 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -97,6 +97,7 @@ enum bpf_cmd { BPF_RAW_TRACEPOINT_OPEN, BPF_BTF_LOAD, BPF_BTF_GET_FD_BY_ID, + BPF_TASK_FD_QUERY, }; enum bpf_map_type { @@ -379,6 +380,22 @@ union bpf_attr { __u32 btf_log_size; __u32 btf_log_level; }; + + struct { + int pid;/* input: pid */ + int fd; /* input: fd */ + __u32 flags; /* input: flags */ + __u32 buf_len;/* input: buf len */ + __aligned_u64 buf;/* input/output: +* tp_name for tracepoint +* symbol for kprobe +* filename for uprobe +*/ + __u32 prog_id;/* output: prod_id */ + __u32 attach_info;/* output: BPF_ATTACH_* */ + __u64 probe_offset; /* output: probe_offset */ + __u64 probe_addr; /* output: probe_addr */ + } task_fd_query; } __attribute__((aligned(8))); /* The description below is an attempt at providing documentation to eBPF @@ -2450,4 +2467,14 @@ struct bpf_fib_lookup { __u8dmac[6]; /* ETH_ALEN */ }; +/* used by based query */ +enum { + BPF_ATTACH_RAW_TRACEPOINT, /* tp name */ + BPF_ATTACH_TRACEPOINT, /* tp name */ + BPF_ATTACH_KPROBE, /* (symbol + offset) or addr */ + BPF_ATTACH_KRETPROBE, /* (symbol + offset) or addr */ + BPF_ATTACH_UPROBE, /* filename + offset */ + BPF_ATTACH_URETPROBE, /* filename + offset */ +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 6a8a000..da3f336 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -643,3 +643,27 @@ int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size, return fd; } + +int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf, __u32 buf_len, + __u32 *prog_id, __u32 *attach_info, + __u64 *probe_offset, __u64 *probe_addr) +{ + union bpf_attr attr = {}; + int err; + + attr.task_fd_query.pid = pid; + attr.task_fd_query.fd = fd; + attr.task_fd_query.flags = flags; + attr.task_fd_query.buf = ptr_to_u64(buf); + attr.task_fd_query.buf_len = buf_len; + + err = sys_bpf(BPF_TASK_FD_QUERY, , sizeof(attr)); + if (!err) { + *prog_id = attr.task_fd_query.prog_id; + *attach_info = attr.task_fd_query.attach_info; + *probe_offset = attr.task_fd_query.probe_offset; + *probe_addr = attr.task_fd_query.probe_addr; + } + + return err; +} diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 15bff77..9adfde6 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -107,4 +107,7 @@ int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags, int bpf_raw_tracepoint_open(const char *name, int prog_fd); int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size, bool do_log); +int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf, __u32 buf_len, + __u32 *prog_id, __u32 *prog_info, + __u64 *probe_offset, __u64 *probe_addr); #endif -- 2.9.5
[PATCH bpf-next v2 5/7] samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
This is mostly to test kprobe/uprobe which needs kernel headers. Signed-off-by: Yonghong Song--- samples/bpf/Makefile | 4 + samples/bpf/task_fd_query_kern.c | 19 ++ samples/bpf/task_fd_query_user.c | 379 +++ 3 files changed, 402 insertions(+) create mode 100644 samples/bpf/task_fd_query_kern.c create mode 100644 samples/bpf/task_fd_query_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 62d1aa1..7dc85ed 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -51,6 +51,7 @@ hostprogs-y += cpustat hostprogs-y += xdp_adjust_tail hostprogs-y += xdpsock hostprogs-y += xdp_fwd +hostprogs-y += task_fd_query # Libbpf dependencies LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a @@ -105,6 +106,7 @@ cpustat-objs := bpf_load.o cpustat_user.o xdp_adjust_tail-objs := xdp_adjust_tail_user.o xdpsock-objs := bpf_load.o xdpsock_user.o xdp_fwd-objs := bpf_load.o xdp_fwd_user.o +task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS) # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -160,6 +162,7 @@ always += cpustat_kern.o always += xdp_adjust_tail_kern.o always += xdpsock_kern.o always += xdp_fwd_kern.o +always += task_fd_query_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -175,6 +178,7 @@ HOSTCFLAGS_offwaketime_user.o += -I$(srctree)/tools/lib/bpf/ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/ HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/ HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/ +HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/ HOST_LOADLIBES += $(LIBBPF) -lelf HOSTLOADLIBES_tracex4 += -lrt diff --git a/samples/bpf/task_fd_query_kern.c b/samples/bpf/task_fd_query_kern.c new file mode 100644 index 000..f4b0a9e --- /dev/null +++ b/samples/bpf/task_fd_query_kern.c @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include "bpf_helpers.h" + +SEC("kprobe/blk_start_request") +int bpf_prog1(struct pt_regs *ctx) +{ + return 0; +} + +SEC("kretprobe/blk_account_io_completion") +int bpf_prog2(struct pt_regs *ctx) +{ + return 0; +} +char _license[] SEC("license") = "GPL"; +u32 _version SEC("version") = LINUX_VERSION_CODE; diff --git a/samples/bpf/task_fd_query_user.c b/samples/bpf/task_fd_query_user.c new file mode 100644 index 000..792ef24 --- /dev/null +++ b/samples/bpf/task_fd_query_user.c @@ -0,0 +1,379 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "libbpf.h" +#include "bpf_load.h" +#include "bpf_util.h" +#include "perf-sys.h" +#include "trace_helpers.h" + +#define CHECK_PERROR_RET(condition) ({ \ + int __ret = !!(condition); \ + if (__ret) {\ + printf("FAIL: %s:\n", __func__);\ + perror(""); \ + return -1; \ + } \ +}) + +#define CHECK_AND_RET(condition) ({\ + int __ret = !!(condition); \ + if (__ret) \ + return -1; \ +}) + +static __u64 ptr_to_u64(void *ptr) +{ + return (__u64) (unsigned long) ptr; +} + +#define PMU_TYPE_FILE "/sys/bus/event_source/devices/%s/type" +static int bpf_find_probe_type(const char *event_type) +{ + char buf[256]; + int fd, ret; + + ret = snprintf(buf, sizeof(buf), PMU_TYPE_FILE, event_type); + CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf)); + + fd = open(buf, O_RDONLY); + CHECK_PERROR_RET(fd < 0); + + ret = read(fd, buf, sizeof(buf)); + close(fd); + CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf)); + + errno = 0; + ret = (int)strtol(buf, NULL, 10); + CHECK_PERROR_RET(errno); + return ret; +} + +#define PMU_RETPROBE_FILE "/sys/bus/event_source/devices/%s/format/retprobe" +static int bpf_get_retprobe_bit(const char *event_type) +{ + char buf[256]; + int fd, ret; + + ret = snprintf(buf, sizeof(buf), PMU_RETPROBE_FILE, event_type); + CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf)); + + fd = open(buf, O_RDONLY); + CHECK_PERROR_RET(fd < 0); + + ret = read(fd, buf, sizeof(buf)); + close(fd); + CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf)); + CHECK_PERROR_RET(strlen(buf) < strlen("config:")); + + errno = 0; + ret = (int)strtol(buf + strlen("config:"), NULL, 10); + CHECK_PERROR_RET(errno); + return ret; +} + +static int test_debug_fs_kprobe(int fd_idx, const char *fn_name, +
[PATCH bpf-next v2 6/7] tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
The new tests are added to query perf_event information for raw_tracepoint and tracepoint attachment. For tracepoint, both syscalls and non-syscalls tracepoints are queries as they are treated slightly differently inside the kernel. Signed-off-by: Yonghong Song--- tools/testing/selftests/bpf/test_progs.c | 133 +++ 1 file changed, 133 insertions(+) diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c index 3ecf733..f7ede03 100644 --- a/tools/testing/selftests/bpf/test_progs.c +++ b/tools/testing/selftests/bpf/test_progs.c @@ -1542,6 +1542,137 @@ static void test_get_stack_raw_tp(void) bpf_object__close(obj); } +static void test_task_fd_query_rawtp(void) +{ + const char *file = "./test_get_stack_rawtp.o"; + struct perf_event_attr attr = {}; + __u64 probe_offset, probe_addr; + int efd, err, prog_fd, pmu_fd; + __u32 prog_id, attach_info; + struct bpf_object *obj; + __u32 duration = 0; + char buf[256]; + + err = bpf_prog_load(file, BPF_PROG_TYPE_RAW_TRACEPOINT, , _fd); + if (CHECK(err, "prog_load raw tp", "err %d errno %d\n", err, errno)) + return; + + efd = bpf_raw_tracepoint_open("sys_enter", prog_fd); + if (CHECK(efd < 0, "raw_tp_open", "err %d errno %d\n", efd, errno)) + goto close_prog; + + attr.sample_type = PERF_SAMPLE_RAW; + attr.type = PERF_TYPE_SOFTWARE; + attr.config = PERF_COUNT_SW_BPF_OUTPUT; + pmu_fd = syscall(__NR_perf_event_open, , getpid(), -1, -1, 0); + if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd, + errno)) + goto close_prog; + + err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0); + if (CHECK(err < 0, "ioctl PERF_EVENT_IOC_ENABLE", "err %d errno %d\n", + err, errno)) + goto close_prog; + + /* query (getpid(), efd */ + err = bpf_task_fd_query(getpid(), efd, 0, buf, 256, _id, + _info, _offset, _addr); + if (CHECK(err < 0, "bpf_trace_event_query", "err %d errno %d\n", err, + errno)) + goto close_prog; + + err = (attach_info == BPF_ATTACH_RAW_TRACEPOINT) && + (strcmp(buf, "sys_enter") == 0); + if (CHECK(!err, "check_results", "attach_info %d tp_name %s\n", + attach_info, buf)) + goto close_prog; + + goto close_prog_noerr; +close_prog: + error_cnt++; +close_prog_noerr: + bpf_object__close(obj); +} + +static void test_task_fd_query_tp_core(const char *probe_name, + const char *tp_name) +{ + const char *file = "./test_tracepoint.o"; + int err, bytes, efd, prog_fd, pmu_fd; + struct perf_event_attr attr = {}; + __u64 probe_offset, probe_addr; + __u32 prog_id, attach_info; + struct bpf_object *obj; + __u32 duration = 0; + char buf[256]; + + err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, , _fd); + if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno)) + goto close_prog; + + snprintf(buf, sizeof(buf), +"/sys/kernel/debug/tracing/events/%s/id", probe_name); + efd = open(buf, O_RDONLY, 0); + if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno)) + goto close_prog; + bytes = read(efd, buf, sizeof(buf)); + close(efd); + if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read", + "bytes %d errno %d\n", bytes, errno)) + goto close_prog; + + attr.config = strtol(buf, NULL, 0); + attr.type = PERF_TYPE_TRACEPOINT; + attr.sample_type = PERF_SAMPLE_RAW; + attr.sample_period = 1; + attr.wakeup_events = 1; + pmu_fd = syscall(__NR_perf_event_open, , -1 /* pid */, +0 /* cpu 0 */, -1 /* group id */, +0 /* flags */); + if (CHECK(err, "perf_event_open", "err %d errno %d\n", err, errno)) + goto close_pmu; + + err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0); + if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err, + errno)) + goto close_pmu; + + err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); + if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err, + errno)) + goto close_pmu; + + /* query (getpid(), pmu_fd */ + err = bpf_task_fd_query(getpid(), pmu_fd, 0, buf, 256, _id, + _info, _offset, _addr); + if (CHECK(err < 0, "bpf_trace_event_query", "err %d errno %d\n", err, + errno)) + goto close_pmu; + + err = (attach_info == BPF_ATTACH_TRACEPOINT) && !strcmp(buf, tp_name); + if (CHECK(!err, "check_results", "attach_info %d tp_name %s\n", +
Re: [PATCH bpf 5/6] tools: bpftool: resolve calls without using imm field
Hi Jakub, On 05/18/2018 12:21 AM, Jakub Kicinski wrote: > On Thu, 17 May 2018 12:05:47 +0530, Sandipan Das wrote: >> Currently, we resolve the callee's address for a JITed function >> call by using the imm field of the call instruction as an offset >> from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further >> use this address to get the callee's kernel symbol's name. >> >> For some architectures, such as powerpc64, the imm field is not >> large enough to hold this offset. So, instead of assigning this >> offset to the imm field, the verifier now assigns the subprog >> id. Also, a list of kernel symbol addresses for all the JITed >> functions is provided in the program info. We now use the imm >> field as an index for this list to lookup a callee's symbol's >> address and resolve its name. >> >> Suggested-by: Daniel Borkmann>> Signed-off-by: Sandipan Das > > A few nit-picks below, thank you for the patch! > >> tools/bpf/bpftool/prog.c | 31 +++ >> tools/bpf/bpftool/xlated_dumper.c | 24 +--- >> tools/bpf/bpftool/xlated_dumper.h | 2 ++ >> 3 files changed, 50 insertions(+), 7 deletions(-) >> >> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c >> index 9bdfdf2d3fbe..ac2f62a97e84 100644 >> --- a/tools/bpf/bpftool/prog.c >> +++ b/tools/bpf/bpftool/prog.c >> @@ -430,6 +430,10 @@ static int do_dump(int argc, char **argv) >> unsigned char *buf; >> __u32 *member_len; >> __u64 *member_ptr; >> +unsigned int nr_addrs; >> +unsigned long *addrs = NULL; >> +__u32 *ksyms_len; >> +__u64 *ksyms_ptr; > > nit: please try to keep the variables ordered longest to shortest like > we do in networking code (please do it in all functions). > >> ssize_t n; >> int err; >> int fd; >> @@ -437,6 +441,8 @@ static int do_dump(int argc, char **argv) >> if (is_prefix(*argv, "jited")) { >> member_len = _prog_len; >> member_ptr = _prog_insns; >> +ksyms_len = _jited_ksyms; >> +ksyms_ptr = _ksyms; >> } else if (is_prefix(*argv, "xlated")) { >> member_len = _prog_len; >> member_ptr = _prog_insns; >> @@ -496,10 +502,23 @@ static int do_dump(int argc, char **argv) >> return -1; >> } >> >> +nr_addrs = *ksyms_len; > > Here and ... > >> +if (nr_addrs) { >> +addrs = malloc(nr_addrs * sizeof(__u64)); >> +if (!addrs) { >> +p_err("mem alloc failed"); >> +free(buf); >> +close(fd); >> +return -1; > > You can just jump to err_free here. > >> +} >> +} >> + >> memset(, 0, sizeof(info)); >> >> *member_ptr = ptr_to_u64(buf); >> *member_len = buf_size; >> +*ksyms_ptr = ptr_to_u64(addrs); >> +*ksyms_len = nr_addrs; > > ... here - this function is getting long, so maybe I'm not seeing > something, but are ksyms_ptr and ksyms_len guaranteed to be initialized? > >> err = bpf_obj_get_info_by_fd(fd, , ); >> close(fd); >> @@ -513,6 +532,11 @@ static int do_dump(int argc, char **argv) >> goto err_free; >> } >> >> +if (*ksyms_len > nr_addrs) { >> +p_err("too many addresses returned"); >> +goto err_free; >> +} >> + >> if ((member_len == _prog_len && >> info.jited_prog_insns == 0) || >> (member_len == _prog_len && >> @@ -558,6 +582,9 @@ static int do_dump(int argc, char **argv) >> dump_xlated_cfg(buf, *member_len); >> } else { >> kernel_syms_load(); >> +dd.jited_ksyms = ksyms_ptr; >> +dd.nr_jited_ksyms = *ksyms_len; >> + >> if (json_output) >> dump_xlated_json(, buf, *member_len, opcodes); >> else >> @@ -566,10 +593,14 @@ static int do_dump(int argc, char **argv) >> } >> >> free(buf); >> +if (addrs) >> +free(addrs); > > Free can deal with NULL pointers, no need for an if. > >> return 0; >> >> err_free: >> free(buf); >> +if (addrs) >> +free(addrs); >> return -1; >> } >> >> diff --git a/tools/bpf/bpftool/xlated_dumper.c >> b/tools/bpf/bpftool/xlated_dumper.c >> index 7a3173b76c16..dc8e4eca0387 100644 >> --- a/tools/bpf/bpftool/xlated_dumper.c >> +++ b/tools/bpf/bpftool/xlated_dumper.c >> @@ -178,8 +178,12 @@ static const char *print_call_pcrel(struct dump_data >> *dd, >> snprintf(dd->scratch_buff, sizeof(dd->scratch_buff), >> "%+d#%s", insn->off, sym->name); >> else > > else if (address) > > saves us the indentation. > >> -snprintf(dd->scratch_buff, sizeof(dd->scratch_buff), >> - "%+d#0x%lx", insn->off, address); >> +if (address) >> +snprintf(dd->scratch_buff,
Re: [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter
On Thu, May 17, 2018 at 4:23 AM, Toke Høiland-Jørgensenwrote: > Eric Dumazet writes: > >> On 05/16/2018 01:29 PM, Toke Høiland-Jørgensen wrote: >>> The ACK filter is an optional feature of CAKE which is designed to improve >>> performance on links with very asymmetrical rate limits. On such links >>> (which are unfortunately quite prevalent, especially for DSL and cable >>> subscribers), the downstream throughput can be limited by the number of >>> ACKs capable of being transmitted in the *upstream* direction. >>> >> >> ... >> >>> >>> Signed-off-by: Toke Høiland-Jørgensen >>> --- >>> net/sched/sch_cake.c | 260 >>> ++ >>> 1 file changed, 258 insertions(+), 2 deletions(-) >>> >>> >> >> I have decided to implement ACK compression in TCP stack itself. > > Awesome! Will look forward to seeing that! +1 It is really odd to put into a TC qdisc, TCP stack is a much better place.
Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
2018-05-17 23:31 GMT+02:00 Jesper Dangaard Brouer: > > On Tue, 15 May 2018 21:06:15 +0200 Björn Töpel wrote: > >> From: Magnus Karlsson >> >> Here, the zero-copy ndo is implemented. As a shortcut, the existing >> XDP Tx rings are used for zero-copy. This means that and XDP program >> cannot redirect to an AF_XDP enabled XDP Tx ring. > > This "shortcut" is not acceptable, and completely broken. The > XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily > clash with the configured XSK queue_index. Provided a bit more code > context below... > Yes, and this is the reason we need to go for a solution with dedicated Tx rings. Again, we chose not to, and simply drops XDP_REDIRECT where the AF_XDP queue id clashes with the processor id. The queue id hijacked by AF_XDP's egress side. > On Tue, 15 May 2018 21:06:15 +0200 > Björn Töpel wrote: > > int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf) > { > struct i40e_netdev_priv *np = netdev_priv(dev); > unsigned int queue_index = smp_processor_id(); > struct i40e_vsi *vsi = np->vsi; > int err; > > if (test_bit(__I40E_VSI_DOWN, vsi->state)) > return -ENETDOWN; > >> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct >> xdp_frame *xdpf) >> if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs) >> return -ENXIO; >> >> + if (vsi->xdp_rings[queue_index]->xsk_umem) >> + return -ENXIO; >> + > > Using the sane errno makes this impossible to debug (via the tracepoints). > The rationale was that the situation was similar to an incorrectly configured receiving (from an XDP_REDIRECT perspective) interface. We'll rework this! Thanks for looking into this, Jesper! Björn >> err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]); >> if (err != I40E_XDP_TX) >> return -ENOSPC; >> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev) >> if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs) >> return; >> >> + if (vsi->xdp_rings[queue_index]->xsk_umem) >> + return; >> + >> i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]); >> } > > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: [net-next PATCH v2 0/4] Symmetric queue selection using XPS for Rx queues
On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiarwrote: > This patch series implements support for Tx queue selection based on > Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue > using sysfs attribute. If the user configuration for Rx queues does > not apply, then the Tx queue selection falls back to XPS using CPUs and > finally to hashing. > > XPS is refactored to support Tx queue selection based on either the > CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be > enabled. By default no receive queues are configured for the Tx queue. > > - /sys/class/net//queues/tx-*/xps_rxqs > > This is to enable sending packets on the same Tx-Rx queue pair as this If I'm reading the patch correctly, isn't this mapping rxq to a set of txqs (in other words not strictly queue pair which has other connotations in NIC HW). It is important to make it clear that this feature is no HW dependent. > is useful for busy polling multi-threaded workloads where it is not > possible to pin the threads to a CPU. This is a rework of Sridhar's > patch for symmetric queueing via socket option: > https://www.spinics.net/lists/netdev/msg453106.html > Please add something about how this was tested and what the performance gain is to justify the feature. > v2: > - Added documentation in networking/scaling.txt > - Added a simple routine to replace multiple ifdef blocks. > > --- > > Amritha Nambiar (4): > net: Refactor XPS for CPUs and Rx queues > net: Enable Tx queue selection based on Rx queues > net-sysfs: Add interface for Rx queue map per Tx queue > Documentation: Add explanation for XPS using Rx-queue map > > > Documentation/networking/scaling.txt | 58 +++- > include/linux/cpumask.h | 11 +- > include/linux/netdevice.h| 72 ++ > include/net/sock.h | 18 +++ > net/core/dev.c | 242 > +++--- > net/core/net-sysfs.c | 85 > net/core/sock.c |5 + > net/ipv4/tcp_input.c |7 + > net/ipv4/tcp_ipv4.c |1 > net/ipv4/tcp_minisocks.c |1 > 10 files changed, 404 insertions(+), 96 deletions(-) > > --
Re: [Cake] [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter
On 05/17/2018 07:36 PM, Ryan Mounce wrote: > On 17 May 2018 at 22:41, Toke Høiland-Jørgensenwrote: >> Eric Dumazet writes: >> >>> On 05/17/2018 04:23 AM, Toke Høiland-Jørgensen wrote: >>> We don't do full parsing of SACKs, no; we were trying to keep things simple... We do detect the presence of SACK options, though, and the presence of SACK options on an ACK will make previous ACKs be considered redundant. >>> >>> But they are not redundant in some cases, particularly when reorders >>> happen in the network. >> >> Huh. I was under the impression that SACKs were basically cumulative >> until cleared. >> >> I.e., in packet sequence ABCDE where B and D are lost, C would have >> SACK(B) and E would have SACK(B,D). Are you saying that E would only >> have SACK(D)? > > SACK works by acknowledging additional ranges above those that have > been ACKed, rather than ACKing up to the largest seen sequence number > and reporting missing ranges before that. > > A - ACK(A) > B - lost > C - ACK(A) + SACK(C) > D - lost > E - ACK(A) + SACK(C, E) > > Cake does check that the ACK sequence number is greater, or if it is > equal and the 'newer' ACK has the SACK option present. It doesn't > compare the sequence numbers inside two SACKs. If the two SACKs in the > above example had been reordered before reaching cake's ACK filter in > aggressive mode, the wrong one will be filtered. > > This is a limitation of my naive SACK handling in cake. The default > 'conservative' mode happens to mitigate the problem in the above > scenario, but the issue could still present itself in more > pathological cases. It's fixable, however I'm not sure this corner > case is sufficiently common or severe to warrant the extra complexity. The extra complexity is absolutely requested for inclusion in upstream linux. I recommend reading rfc 2018, whole section 4 (Generating Sack Options: Data Receiver Behavior ) Proposed ACK filter in Cake is messing the protocol, since the first rule is not respected * The first SACK block (i.e., the one immediately following the kind and length fields in the option) MUST specify the contiguous block of data containing the segment which triggered this ACK, unless that segment advanced the Acknowledgment Number field in the header. This assures that the ACK with the SACK option reflects the most recent change in the data receiver's buffer queue. An ACK filter must either : Not merge ACK if they contain different SACK blocks. Or make a precise analysis of the SACK blocks to determine if the merge is allowed, ie no useful information is lost. The sender should get all the information as which segments were received correctly, assuming no ACK are dropped because of congestion on return path.
Re: [net-next PATCH v2 1/4] net: Refactor XPS for CPUs and Rx queues
On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiarwrote: > Refactor XPS code to support Tx queue selection based on > CPU map or Rx queue map. > > Signed-off-by: Amritha Nambiar > --- > include/linux/cpumask.h | 11 ++ > include/linux/netdevice.h | 72 +++- > net/core/dev.c| 208 > + > net/core/net-sysfs.c |4 - > 4 files changed, 215 insertions(+), 80 deletions(-) > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > index bf53d89..57f20a0 100644 > --- a/include/linux/cpumask.h > +++ b/include/linux/cpumask.h > @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask; > #define cpu_active(cpu)((cpu) == 0) > #endif > > -/* verify cpu argument to cpumask_* operators */ > -static inline unsigned int cpumask_check(unsigned int cpu) > +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits) > { > #ifdef CONFIG_DEBUG_PER_CPU_MAPS > - WARN_ON_ONCE(cpu >= nr_cpumask_bits); > + WARN_ON_ONCE(cpu >= bits); > #endif /* CONFIG_DEBUG_PER_CPU_MAPS */ > +} > + > +/* verify cpu argument to cpumask_* operators */ > +static inline unsigned int cpumask_check(unsigned int cpu) > +{ > + cpu_max_bits_warn(cpu, nr_cpumask_bits); > return cpu; > } > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 03ed492..c2eeb36 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -730,10 +730,21 @@ struct xps_map { > */ > struct xps_dev_maps { > struct rcu_head rcu; > - struct xps_map __rcu *cpu_map[0]; > + struct xps_map __rcu *attr_map[0]; > }; > -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ > + > +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ > (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *))) > + > +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\ > + (_rxqs * (_tcs) * sizeof(struct xps_map *))) > + > +enum xps_map_type { > + XPS_MAP_RXQS, > + XPS_MAP_CPUS, > + __XPS_MAP_MAX > +}; > + > #endif /* CONFIG_XPS */ > > #define TC_MAX_QUEUE 16 > @@ -1891,7 +1902,7 @@ struct net_device { > int watchdog_timeo; > > #ifdef CONFIG_XPS > - struct xps_dev_maps __rcu *xps_maps; > + struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX]; > #endif > #ifdef CONFIG_NET_CLS_ACT > struct mini_Qdisc __rcu *miniq_egress; > @@ -3229,6 +3240,61 @@ static inline void netif_wake_subqueue(struct > net_device *dev, u16 queue_index) > #ifdef CONFIG_XPS > int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, > u16 index); > +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, > + u16 index, enum xps_map_type type); > + > +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask, > + unsigned int nr_bits) > +{ > + cpu_max_bits_warn(j, nr_bits); > + return test_bit(j, mask); > +} > + > +static inline bool attr_test_online(unsigned long j, > + const unsigned long *online_mask, > + unsigned int nr_bits) > +{ > + cpu_max_bits_warn(j, nr_bits); > + > + if (online_mask) > + return test_bit(j, online_mask); > + > + if (j >= 0 && j < nr_bits) > + return true; > + > + return false; > +} > + > +static inline unsigned int attrmask_next(int n, const unsigned long *srcp, > +unsigned int nr_bits) > +{ > + /* -1 is a legal arg here. */ > + if (n != -1) > + cpu_max_bits_warn(n, nr_bits); > + > + if (srcp) > + return find_next_bit(srcp, nr_bits, n + 1); > + > + return n + 1; > +} > + > +static inline int attrmask_next_and(int n, const unsigned long *src1p, > + const unsigned long *src2p, > + unsigned int nr_bits) > +{ > + /* -1 is a legal arg here. */ > + if (n != -1) > + cpu_max_bits_warn(n, nr_bits); > + > + if (src1p && src2p) > + return find_next_and_bit(src1p, src2p, nr_bits, n + 1); > + else if (src1p) > + return find_next_bit(src1p, nr_bits, n + 1); > + else if (src2p) > + return find_next_bit(src2p, nr_bits, n + 1); > + > + return n + 1; > +} > #else > static inline int netif_set_xps_queue(struct net_device *dev, > const struct cpumask *mask, > diff --git a/net/core/dev.c b/net/core/dev.c > index 9f43901..7e5dfdb 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps > *dev_maps, >
Re: [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues
On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiarwrote: > This patch adds support to pick Tx queue based on the Rx queue map > configuration set by the admin through the sysfs attribute > for each Tx queue. If the user configuration for receive > queue map does not apply, then the Tx queue selection falls back > to CPU map based selection and finally to hashing. > > Signed-off-by: Amritha Nambiar > Signed-off-by: Sridhar Samudrala > --- > include/net/sock.h | 18 ++ > net/core/dev.c | 36 +--- > net/core/sock.c |5 + > net/ipv4/tcp_input.c |7 +++ > net/ipv4/tcp_ipv4.c |1 + > net/ipv4/tcp_minisocks.c |1 + > 6 files changed, 61 insertions(+), 7 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 4f7c584..0613f63 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair; > * @skc_node: main hash linkage for various protocol lookup tables > * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol > * @skc_tx_queue_mapping: tx queue number for this connection > + * @skc_rx_queue_mapping: rx queue number for this connection > + * @skc_rx_ifindex: rx ifindex for this connection > * @skc_flags: place holder for sk_flags > * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, > * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings > @@ -215,6 +217,10 @@ struct sock_common { > struct hlist_nulls_node skc_nulls_node; > }; > int skc_tx_queue_mapping; > +#ifdef CONFIG_XPS > + int skc_rx_queue_mapping; > + int skc_rx_ifindex; Isn't this increasing size of sock_common for a narrow use case functionality? > +#endif > union { > int skc_incoming_cpu; > u32 skc_rcv_wnd; > @@ -326,6 +332,10 @@ struct sock { > #define sk_nulls_node __sk_common.skc_nulls_node > #define sk_refcnt __sk_common.skc_refcnt > #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping > +#ifdef CONFIG_XPS > +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping > +#define sk_rx_ifindex __sk_common.skc_rx_ifindex > +#endif > > #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin > #define sk_dontcopy_end__sk_common.skc_dontcopy_end > @@ -1696,6 +1706,14 @@ static inline int sk_tx_queue_get(const struct sock > *sk) > return sk ? sk->sk_tx_queue_mapping : -1; > } > > +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) > +{ > +#ifdef CONFIG_XPS > + sk->sk_rx_ifindex = skb->skb_iif; > + sk->sk_rx_queue_mapping = skb_get_rx_queue(skb); > +#endif > +} > + > static inline void sk_set_socket(struct sock *sk, struct socket *sock) > { > sk_tx_queue_clear(sk); > diff --git a/net/core/dev.c b/net/core/dev.c > index 7e5dfdb..4030368 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3458,18 +3458,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, > struct net_device *dev) > } > #endif /* CONFIG_NET_EGRESS */ > > -static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) > -{ > #ifdef CONFIG_XPS > - struct xps_dev_maps *dev_maps; > +static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, > + struct xps_dev_maps *dev_maps, unsigned int > tci) > +{ > struct xps_map *map; > int queue_index = -1; > > - rcu_read_lock(); > - dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]); > if (dev_maps) { > - unsigned int tci = skb->sender_cpu - 1; > - > if (dev->num_tc) { > tci *= dev->num_tc; > tci += netdev_get_prio_tc_map(dev, skb->priority); > @@ -3486,6 +3482,32 @@ static inline int get_xps_queue(struct net_device > *dev, struct sk_buff *skb) > queue_index = -1; > } > } > + return queue_index; > +} > +#endif > + > +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb) > +{ > +#ifdef CONFIG_XPS > + enum xps_map_type i = XPS_MAP_RXQS; > + struct xps_dev_maps *dev_maps; > + struct sock *sk = skb->sk; > + int queue_index = -1; > + unsigned int tci = 0; > + > + if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues && > + dev->ifindex == sk->sk_rx_ifindex) > + tci = sk->sk_rx_queue_mapping; > + > + rcu_read_lock(); > + while (queue_index < 0 && i < __XPS_MAP_MAX) { > + if (i == XPS_MAP_CPUS) This while loop typifies exactly why I don't think the XPS maps should be
Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
On 5/16/18 11:46 PM, Björn Töpel wrote: 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov: On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: On 05/02/2018 01:01 PM, Björn Töpel wrote: From: Björn Töpel This patch set introduces a new address family called AF_XDP that is optimized for high performance packet processing and, in upcoming patch sets, zero-copy semantics. In this patch set, we have removed all zero-copy related code in order to make it smaller, simpler and hopefully more review friendly. This patch set only supports copy-mode for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX using the XDP_DRV path. Zero-copy support requires XDP and driver changes that Jesper Dangaard Brouer is working on. Some of his work has already been accepted. We will publish our zero-copy support for RX and TX on top of his patch sets at a later point in time. +1, would be great to see it land this cycle. Saw few minor nits here and there but nothing to hold it up, for the series: Acked-by: Daniel Borkmann Thanks everyone! Great stuff! Applied to bpf-next, with one condition. Upcoming zero-copy patches for both RX and TX need to be posted and reviewed within this release window. If netdev community as a whole won't be able to agree on the zero-copy bits we'd need to revert this feature before the next merge window. Few other minor nits: patch 3: +struct xdp_ring { + __u32 producer __attribute__((aligned(64))); + __u32 consumer __attribute__((aligned(64))); +}; It kinda begs for cacheline_aligned_in_smp to be introduced for uapi headers. Hmm, I need some guidance on what a sane uapi variant would be. We can't have the uapi depend on the kernel build. ARM64, e.g., can have both 64B and 128B according to the specs. Contemporary IA processors have 64B. The simplest, and maybe most future-proof, would be 128B aligned for all. Another is having 128B for ARM and 64B for all IA. A third option is having a hand-shaking API (I think virtio has that) for determine the cache line size, but I'd rather not go down that route. Thoughts/ideas on how a uapi cacheline_aligned_in_smp version would look like? I suspect i40e+arm combination wasn't tested anyway. The api may have endianness issues too on something like sparc. I think the way to be backwards compatible in this area is to make the api usable on x86 only by adding to include/uapi/linux/if_xdp.h #if defined(__x86_64__) #define AF_XDP_CACHE_BYTES 64 #else #error "AF_XDP support is not yet available for this architecture" #endif and doing: __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); And progressively add to this for arm64 and few other archs. Eventually removing #error and adding some generic define that's good enough for long tail of architectures that we really cannot test.
Re: pull-request: bpf 2018-05-18
From: Daniel BorkmannDate: Fri, 18 May 2018 02:26:17 +0200 > The following pull-request contains BPF updates for your *net* tree. > > The main changes are: ... > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Pulled. > When this gets later merged into net-next there are a two trivial > BPF conflicts to resolve: ... Thanks a lot for the conflict guidance.
[PATCH v2] net: qcom/emac: Allocate buffers from local node
Currently we use non-NUMA aware allocation for TPD and RRD buffers, this patch modifies to use NUMA friendly allocation. Signed-off-by: Hemanth Puranik--- Change since v1: - Addressed comments related to ordering drivers/net/ethernet/qualcomm/emac/emac-mac.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c b/drivers/net/ethernet/qualcomm/emac/emac-mac.c index 092718a..031f6e6 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c +++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c @@ -683,10 +683,11 @@ static int emac_tx_q_desc_alloc(struct emac_adapter *adpt, struct emac_tx_queue *tx_q) { struct emac_ring_header *ring_header = >ring_header; + int node = dev_to_node(adpt->netdev->dev.parent); size_t size; size = sizeof(struct emac_buffer) * tx_q->tpd.count; - tx_q->tpd.tpbuff = kzalloc(size, GFP_KERNEL); + tx_q->tpd.tpbuff = kzalloc_node(size, GFP_KERNEL, node); if (!tx_q->tpd.tpbuff) return -ENOMEM; @@ -723,11 +724,12 @@ static void emac_rx_q_bufs_free(struct emac_adapter *adpt) static int emac_rx_descs_alloc(struct emac_adapter *adpt) { struct emac_ring_header *ring_header = >ring_header; + int node = dev_to_node(adpt->netdev->dev.parent); struct emac_rx_queue *rx_q = >rx_q; size_t size; size = sizeof(struct emac_buffer) * rx_q->rfd.count; - rx_q->rfd.rfbuff = kzalloc(size, GFP_KERNEL); + rx_q->rfd.rfbuff = kzalloc_node(size, GFP_KERNEL, node); if (!rx_q->rfd.rfbuff) return -ENOMEM; -- Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[net-next:master 1230/1233] arch/sparc/include/asm/io_64.h:177:20: note: in expansion of macro 'writel'
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: 538e2de104cfb4ef1acb35af42427bff42adbe4d commit: 2652113ff043ca2ce1cb3be529b5ca9270c421d4 [1230/1233] net: ethernet: ti: Allow most drivers with COMPILE_TEST config: sparc64-allyesconfig (attached as .config) compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 2652113ff043ca2ce1cb3be529b5ca9270c421d4 # save the attached .config to linux build tree make.cross ARCH=sparc64 All warnings (new ones prefixed by >>): In file included from arch/sparc/include/asm/bug.h:25:0, from include/linux/bug.h:5, from include/linux/thread_info.h:12, from include/asm-generic/preempt.h:5, from ./arch/sparc/include/generated/asm/preempt.h:1, from include/linux/preempt.h:81, from include/linux/spinlock.h:51, from drivers/net/ethernet/ti/davinci_cpdma.c:16: drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_desc_pool_destroy': drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format '%d' expects argument of type 'int', but argument 4 has type 'size_t {aka long unsigned int}' [-Wformat=] "cpdma_desc_pool size %d != avail %d", ^ gen_pool_size(pool->gen_pool), ~ include/asm-generic/bug.h:92:69: note: in definition of macro '__WARN_printf' #define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg) ^~~ drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro 'WARN' WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool), ^~~~ drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format '%d' expects argument of type 'int', but argument 5 has type 'size_t {aka long unsigned int}' [-Wformat=] "cpdma_desc_pool size %d != avail %d", ^ drivers/net/ethernet/ti/davinci_cpdma.c:196:7: gen_pool_avail(pool->gen_pool)); ~~ include/asm-generic/bug.h:92:69: note: in definition of macro '__WARN_printf' #define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg) ^~~ drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro 'WARN' WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool), ^~~~ drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit': drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 of 'writel' makes integer from pointer without a cast [-Wint-conversion] writel_relaxed(token, >sw_token); ^ In file included from arch/sparc/include/asm/io.h:5:0, from include/linux/scatterlist.h:9, from include/linux/dma-mapping.h:11, from drivers/net/ethernet/ti/davinci_cpdma.c:21: arch/sparc/include/asm/io_64.h:175:16: note: expected 'u32 {aka unsigned int}' but argument is of type 'void *' #define writel writel ^ >> arch/sparc/include/asm/io_64.h:177:20: note: in expansion of macro 'writel' static inline void writel(u32 l, volatile void __iomem *addr) ^~ drivers/net/ethernet/ti/davinci_cpdma.c: In function '__cpdma_chan_free': drivers/net/ethernet/ti/davinci_cpdma.c:1126:15: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] token = (void *)desc_read(desc, sw_token); ^ -- In file included from arch/sparc/include/asm/bug.h:25:0, from include/linux/bug.h:5, from include/linux/thread_info.h:12, from include/asm-generic/preempt.h:5, from ./arch/sparc/include/generated/asm/preempt.h:1, from include/linux/preempt.h:81, from include/linux/spinlock.h:51, from drivers/net//ethernet/ti/davinci_cpdma.c:16: drivers/net//ethernet/ti/davinci_cpdma.c: In function 'cpdma_desc_pool_destroy': drivers/net//ethernet/ti/davinci_cpdma.c:194:7: warning: format '%d' expects argument of type 'int', but argument 4 has type 'size_t {aka long unsigned int}' [-Wformat=] "cpdma_desc_pool size %d != avail %d", ^ gen_pool_size(pool->gen_pool), ~ include/asm-generic/bug.h:92:69: note: in definition of macro '__WARN_printf' #define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg)
Re: [Cake] [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter
On 17 May 2018 at 22:41, Toke Høiland-Jørgensenwrote: > Eric Dumazet writes: > >> On 05/17/2018 04:23 AM, Toke Høiland-Jørgensen wrote: >> >>> >>> We don't do full parsing of SACKs, no; we were trying to keep things >>> simple... We do detect the presence of SACK options, though, and the >>> presence of SACK options on an ACK will make previous ACKs be considered >>> redundant. >>> >> >> But they are not redundant in some cases, particularly when reorders >> happen in the network. > > Huh. I was under the impression that SACKs were basically cumulative > until cleared. > > I.e., in packet sequence ABCDE where B and D are lost, C would have > SACK(B) and E would have SACK(B,D). Are you saying that E would only > have SACK(D)? SACK works by acknowledging additional ranges above those that have been ACKed, rather than ACKing up to the largest seen sequence number and reporting missing ranges before that. A - ACK(A) B - lost C - ACK(A) + SACK(C) D - lost E - ACK(A) + SACK(C, E) Cake does check that the ACK sequence number is greater, or if it is equal and the 'newer' ACK has the SACK option present. It doesn't compare the sequence numbers inside two SACKs. If the two SACKs in the above example had been reordered before reaching cake's ACK filter in aggressive mode, the wrong one will be filtered. This is a limitation of my naive SACK handling in cake. The default 'conservative' mode happens to mitigate the problem in the above scenario, but the issue could still present itself in more pathological cases. It's fixable, however I'm not sure this corner case is sufficiently common or severe to warrant the extra complexity. Ryan.
Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports
> >>> ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS); > >>> > >>> It is allocating a switch with 12 ports. However only 4 of them have > >>> names. So the core only creates slave devices for those 4. > >>> > >>> This is a useful test. Real hardware often has unused ports. A WiFi AP > >>> with a 7 port switch which only uses 6 ports is often seen. > >> > >> The following patch should fix this: > >> > >> > >> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c > >> index adf50fbc4c13..a06c29ec91f0 100644 > >> --- a/net/dsa/dsa2.c > >> +++ b/net/dsa/dsa2.c > >> @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp) > >> > >> memset(>devlink_port, 0, sizeof(dp->devlink_port)); > >> > >> + if (dp->type == DSA_PORT_TYPE_UNUSED) > >> + return 0; > >> + > >> err = devlink_port_register(ds->devlink, >devlink_port, > >> dp->index); > > > > Hi Florian, Jiri > > > > Maybe it is better to add a devlink port type unused? > > The port does not exist on the switch, so it should not even be > registered IMHO. Hi Florian The ports do exist, when you called dsa_switch_alloc() you said the switch has 12 ports. Andrew
[for-next 11/15] net/mlx5e: Add ingress/egress indication for offloaded TC flows
From: Or GerlitzWhen an e-switch TC rule is offloaded through the egdev (egress device) mechanism, we treat this as egress, all other cases (NIC and e-switch) are considred ingress. This is preparation step that will allow us to identify "wrong" stat/del offload calls made by the TC core on egdev based flows and ignore them. Signed-off-by: Or Gerlitz Signed-off-by: Jiri Pirko Reviewed-by: Paul Blakey Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 3 -- .../net/ethernet/mellanox/mlx5/core/en_main.c | 15 .../net/ethernet/mellanox/mlx5/core/en_rep.c | 32 .../net/ethernet/mellanox/mlx5/core/en_tc.c | 38 ++- .../net/ethernet/mellanox/mlx5/core/en_tc.h | 13 +-- 5 files changed, 70 insertions(+), 31 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 7c930088e96e..51a1d36a56c5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -1118,9 +1118,6 @@ int mlx5e_ethtool_get_ts_info(struct mlx5e_priv *priv, int mlx5e_ethtool_flash_device(struct mlx5e_priv *priv, struct ethtool_flash *flash); -int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data, - void *cb_priv); - /* mlx5e generic netdev management API */ struct net_device* mlx5e_create_netdev(struct mlx5_core_dev *mdev, const struct mlx5e_profile *profile, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 417bf2e8ab85..27e8375a476b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3136,22 +3136,23 @@ static int mlx5e_setup_tc_mqprio(struct net_device *netdev, #ifdef CONFIG_MLX5_ESWITCH static int mlx5e_setup_tc_cls_flower(struct mlx5e_priv *priv, -struct tc_cls_flower_offload *cls_flower) +struct tc_cls_flower_offload *cls_flower, +int flags) { switch (cls_flower->command) { case TC_CLSFLOWER_REPLACE: - return mlx5e_configure_flower(priv, cls_flower); + return mlx5e_configure_flower(priv, cls_flower, flags); case TC_CLSFLOWER_DESTROY: - return mlx5e_delete_flower(priv, cls_flower); + return mlx5e_delete_flower(priv, cls_flower, flags); case TC_CLSFLOWER_STATS: - return mlx5e_stats_flower(priv, cls_flower); + return mlx5e_stats_flower(priv, cls_flower, flags); default: return -EOPNOTSUPP; } } -int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data, - void *cb_priv) +static int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data, + void *cb_priv) { struct mlx5e_priv *priv = cb_priv; @@ -3160,7 +3161,7 @@ int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data, switch (type) { case TC_SETUP_CLSFLOWER: - return mlx5e_setup_tc_cls_flower(priv, type_data); + return mlx5e_setup_tc_cls_flower(priv, type_data, MLX5E_TC_INGRESS); default: return -EOPNOTSUPP; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index a689f4c90fe3..182b636552a6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -723,15 +723,31 @@ static int mlx5e_rep_get_phys_port_name(struct net_device *dev, static int mlx5e_rep_setup_tc_cls_flower(struct mlx5e_priv *priv, - struct tc_cls_flower_offload *cls_flower) + struct tc_cls_flower_offload *cls_flower, int flags) { switch (cls_flower->command) { case TC_CLSFLOWER_REPLACE: - return mlx5e_configure_flower(priv, cls_flower); + return mlx5e_configure_flower(priv, cls_flower, flags); case TC_CLSFLOWER_DESTROY: - return mlx5e_delete_flower(priv, cls_flower); + return mlx5e_delete_flower(priv, cls_flower, flags); case TC_CLSFLOWER_STATS: - return mlx5e_stats_flower(priv, cls_flower); + return mlx5e_stats_flower(priv, cls_flower, flags); + default: + return -EOPNOTSUPP; + } +} + +static int mlx5e_rep_setup_tc_cb_egdev(enum tc_setup_type type, void *type_data, + void *cb_priv) +{ + struct mlx5e_priv *priv = cb_priv; + + if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data)) +
[for-next 14/15] net/mlx5e: Ignore attempts to offload multiple times a TC flow
From: Or GerlitzFor VF->VF and uplink->VF rules, the TC core (cls_api) attempts to offload the same flow multiple times into the driver, b/c we registered to the egdev callback. Use the flow cookie to ignore attempts to add such flows, we can't reject them (return error), b/c this will fail the offload attempt, so we ignore that. We indentify wrong stat/del calls using the flow ingress/egress flags, here we do return error to the core. Signed-off-by: Or Gerlitz Signed-off-by: Jiri Pirko Reviewed-by: Paul Blakey Signed-off-by: Saeed Mahameed --- .../net/ethernet/mellanox/mlx5/core/en_tc.c | 21 +-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 05c90b4f8a31..674f1d7d2737 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -2666,6 +2666,12 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, get_flags(flags, _flags); + flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params); + if (flow) { + netdev_warn_once(priv->netdev, "flow cookie %lx already exists, ignoring\n", f->cookie); + return 0; + } + if (esw && esw->mode == SRIOV_OFFLOADS) { flow_flags |= MLX5E_TC_FLOW_ESWITCH; attr_size = sizeof(struct mlx5_esw_flow_attr); @@ -2728,6 +2734,17 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, return err; } +#define DIRECTION_MASK (MLX5E_TC_INGRESS | MLX5E_TC_EGRESS) +#define FLOW_DIRECTION_MASK (MLX5E_TC_FLOW_INGRESS | MLX5E_TC_FLOW_EGRESS) + +static bool same_flow_direction(struct mlx5e_tc_flow *flow, int flags) +{ + if ((flow->flags & FLOW_DIRECTION_MASK) == (flags & DIRECTION_MASK)) + return true; + + return false; +} + int mlx5e_delete_flower(struct mlx5e_priv *priv, struct tc_cls_flower_offload *f, int flags) { @@ -2735,7 +2752,7 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv, struct mlx5e_tc_flow *flow; flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params); - if (!flow) + if (!flow || !same_flow_direction(flow, flags)) return -EINVAL; rhashtable_remove_fast(tc_ht, >node, tc_ht_params); @@ -2758,7 +2775,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv, u64 lastuse; flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params); - if (!flow) + if (!flow || !same_flow_direction(flow, flags)) return -EINVAL; if (!(flow->flags & MLX5E_TC_FLOW_OFFLOADED)) -- 2.17.0
[for-next 12/15] net/mlx5e: Prepare for shared table to keep TC eswitch flows
From: Or GerlitzThis is a refactoring step to be able and store the hash table which keeps track of offloaded TC flows in a different location for NIC vs e-switch rules. Signed-off-by: Or Gerlitz Signed-off-by: Jiri Pirko Reviewed-by: Paul Blakey Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 - .../net/ethernet/mellanox/mlx5/core/en_tc.c | 39 ++- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 51a1d36a56c5..bc91a7335c93 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -634,7 +634,6 @@ struct mlx5e_flow_table { struct mlx5e_tc_table { struct mlx5_flow_table *t; - struct rhashtable_paramsht_params; struct rhashtable ht; DECLARE_HASHTABLE(mod_hdr_tbl, 8); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 26a1312ec9f8..1c90586d7f58 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -2634,12 +2634,24 @@ static void get_flags(int flags, u8 *flow_flags) *flow_flags = __flow_flags; } +static const struct rhashtable_params tc_ht_params = { + .head_offset = offsetof(struct mlx5e_tc_flow, node), + .key_offset = offsetof(struct mlx5e_tc_flow, cookie), + .key_len = sizeof(((struct mlx5e_tc_flow *)0)->cookie), + .automatic_shrinking = true, +}; + +static struct rhashtable *get_tc_ht(struct mlx5e_priv *priv) +{ + return >fs.tc.ht; +} + int mlx5e_configure_flower(struct mlx5e_priv *priv, struct tc_cls_flower_offload *f, int flags) { struct mlx5_eswitch *esw = priv->mdev->priv.eswitch; struct mlx5e_tc_flow_parse_attr *parse_attr; - struct mlx5e_tc_table *tc = >fs.tc; + struct rhashtable *tc_ht = get_tc_ht(priv); struct mlx5e_tc_flow *flow; int attr_size, err = 0; u8 flow_flags = 0; @@ -2693,8 +2705,7 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, !(flow->esw_attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)) kvfree(parse_attr); - err = rhashtable_insert_fast(>ht, >node, -tc->ht_params); + err = rhashtable_insert_fast(tc_ht, >node, tc_ht_params); if (err) { mlx5e_tc_del_flow(priv, flow); kfree(flow); @@ -2711,15 +2722,14 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, int mlx5e_delete_flower(struct mlx5e_priv *priv, struct tc_cls_flower_offload *f, int flags) { + struct rhashtable *tc_ht = get_tc_ht(priv); struct mlx5e_tc_flow *flow; - struct mlx5e_tc_table *tc = >fs.tc; - flow = rhashtable_lookup_fast(>ht, >cookie, - tc->ht_params); + flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params); if (!flow) return -EINVAL; - rhashtable_remove_fast(>ht, >node, tc->ht_params); + rhashtable_remove_fast(tc_ht, >node, tc_ht_params); mlx5e_tc_del_flow(priv, flow); @@ -2731,15 +2741,14 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv, int mlx5e_stats_flower(struct mlx5e_priv *priv, struct tc_cls_flower_offload *f, int flags) { - struct mlx5e_tc_table *tc = >fs.tc; + struct rhashtable *tc_ht = get_tc_ht(priv); struct mlx5e_tc_flow *flow; struct mlx5_fc *counter; u64 bytes; u64 packets; u64 lastuse; - flow = rhashtable_lookup_fast(>ht, >cookie, - tc->ht_params); + flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params); if (!flow) return -EINVAL; @@ -2757,13 +2766,6 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv, return 0; } -static const struct rhashtable_params mlx5e_tc_flow_ht_params = { - .head_offset = offsetof(struct mlx5e_tc_flow, node), - .key_offset = offsetof(struct mlx5e_tc_flow, cookie), - .key_len = sizeof(((struct mlx5e_tc_flow *)0)->cookie), - .automatic_shrinking = true, -}; - int mlx5e_tc_init(struct mlx5e_priv *priv) { struct mlx5e_tc_table *tc = >fs.tc; @@ -2771,8 +2773,7 @@ int mlx5e_tc_init(struct mlx5e_priv *priv) hash_init(tc->mod_hdr_tbl); hash_init(tc->hairpin_tbl); - tc->ht_params = mlx5e_tc_flow_ht_params; - return rhashtable_init(>ht, >ht_params); + return rhashtable_init(>ht, _ht_params); } static void _mlx5e_tc_del_flow(void *ptr, void *arg) -- 2.17.0
[pull request][for-next 00/15] Mellanox, mlx5 core and netdev updates 2018-05-17
Hi Dave and Doug, Below you can find two pull requests, 1. mlx5 core updates to be shared for both netdev and RDMA, (patches 1..9) which is based on the last mlx5-next pull request The following changes since commit a8408f4e6db775e245f20edf12b13fd58cc03a1c: net/mlx5: fix spelling mistake: "modfiy" -> "modify" (2018-05-04 12:11:51 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git tags/mlx5-updates-2018-05-17 for you to fetch changes up to 10ff5359f883412728ba816046ee3a696625ca02: net/mlx5e: Explicitly set source e-switch in offloaded TC rules (2018-05-17 14:17:35 -0700) 2. mlx5e netdev updates only for net-next branch (patches 10..15) based on net-next and the above pull request. The following changes since commit 538e2de104cfb4ef1acb35af42427bff42adbe4d: Merge branch 'net-Allow-more-drivers-with-COMPILE_TEST' (2018-05-17 17:11:07 -0400) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5e-updates-2018-05-17 for you to fetch changes up to a228060a7c9ab88597eeac131e4578595d5d46ae: net/mlx5e: Add HW vport counters to representor ethtool stats (2018-05-17 17:48:54 -0700) Dave, for your convenience you can either pull 1. and then 2. or pull 2. directly. For more information please see tags logs below. Please pull and let me know if there's any problem. Thanks, Saeed. mlx5-updates-2018-05-17 mlx5 core driver updates for both net-next and rdma-next branches. >From Christophe JAILLET, first three patches to use kvfree where needed. From: Or GerlitzNext six patches from Roi and Co adds support for merged sriov e-switch which comes to serve cases where both PFs, VFs set on them and both uplinks are to be used in single v-switch SW model. When merged e-switch is supported, the per-port e-switch is logically merged into one e-switch that spans both physical ports and all the VFs. This model allows to offload TC eswitch rules between VFs belonging to different PFs (and hence have different eswitch affinity), it also sets the some of the foundations needed for uplink LAG support. mlx5e-updates-2018-05-17 From: Or Gerlitz This series addresses a regression introduced by the shared block TC changes [1]. Currently, for VF->VF and uplink->VF rules, the TC core (cls_api) attempts to offload the same flow multiple times into the driver, as a side effect of the mlx5 registration to the egdev callback. We use the flow cookie to ignore attempts to add such flows, we can't reject them (return error), b/c this will fail the offload attempt, so we ignore that. The last patch of the series deals with exposing HW stats counters through ethtool for the vport reps. Dave - the regression that we are addressing was introduced in 4.15 [1] and applies to nfp and mlx5. Jiri suggested to push driver side fixes to net-next, this is already done for nfp [2][3]. Once this is upstream, we will submit a small/point single patch fix for the TC core code which can serve for net and stable, but not carried into net-next, b/c it might limit some future use-cases. [1] 208c0f4b5237 "net: sched: use tc_setup_cb_call to call per-block callbacks" [2] c50647d "nfp: flower: ignore duplicate cb requests for same rule" [3] 54a4a03 "nfp: flower: support offloading multiple rules with same cookie" Christophe JAILLET (3): net/mlx5: Vport, Use 'kvfree()' for memory allocated by 'kvzalloc()' net/mlx5: Eswitch, Use 'kvfree()' for memory allocated by 'kvzalloc()' IB/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()' Or Gerlitz (5): net/mlx5e: Add ingress/egress indication for offloaded TC flows net/mlx5e: Prepare for shared table to keep TC eswitch flows net/mlx5e: Use shared table for offloaded TC eswitch flows net/mlx5e: Ignore attempts to offload multiple times a TC flow net/mlx5e: Add HW vport counters to representor ethtool stats Rabie Loulou (2): net/mlx5e: Explicitly set destination e-switch in FDB rules net/mlx5e: Offload TC eswitch rules for VFs belonging to different PFs Roi Dayan (1): net/mlx5: Add merged e-switch cap Saeed Mahameed (1): Merge tag 'mlx5-updates-2018-05-17' of git://git.kernel.org/.../mellanox/linux Shahar Klein (4): net/mlx5: Properly handle a vport destination when setting FTE net/mlx5: Add destination e-switch owner net/mlx5: Add source e-switch owner net/mlx5e: Explicitly set source e-switch in offloaded TC rules drivers/infiniband/hw/mlx5/cq.c| 2 +- .../mellanox/mlx5/core/diag/fs_tracepoint.c| 2 +- drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 - drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 19 +--
[for-next 09/15] net/mlx5e: Explicitly set source e-switch in offloaded TC rules
From: Shahar KleinSet a specific source e-switch when setting a rule that matches on the ingress port. Signed-off-by: Shahar Klein Reviewed-by: Or Gerlitz Reviewed-by: Roi Dayan Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 1 + .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c| 8 3 files changed, 10 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 880adc810ccc..1d2ba687b902 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -2462,6 +2462,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts, memset(attr, 0, sizeof(*attr)); attr->in_rep = rpriv->rep; + attr->in_mdev = priv->mdev; tcf_exts_to_list(exts, ); list_for_each_entry(a, , list) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h index ac5db54823a1..98a306e02640 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h @@ -231,6 +231,7 @@ struct mlx5_esw_flow_attr { struct mlx5_eswitch_rep *in_rep; struct mlx5_eswitch_rep *out_rep; struct mlx5_core_dev*out_mdev; + struct mlx5_core_dev*in_mdev; int action; __be16 vlan_proto; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index ea93867d1ab4..6c83eef5141a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -93,8 +93,16 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw, misc = MLX5_ADDR_OF(fte_match_param, spec->match_value, misc_parameters); MLX5_SET(fte_match_set_misc, misc, source_port, attr->in_rep->vport); + if (MLX5_CAP_ESW(esw->dev, merged_eswitch)) + MLX5_SET(fte_match_set_misc, misc, +source_eswitch_owner_vhca_id, +MLX5_CAP_GEN(attr->in_mdev, vhca_id)); + misc = MLX5_ADDR_OF(fte_match_param, spec->match_criteria, misc_parameters); MLX5_SET_TO_ONES(fte_match_set_misc, misc, source_port); + if (MLX5_CAP_ESW(esw->dev, merged_eswitch)) + MLX5_SET_TO_ONES(fte_match_set_misc, misc, +source_eswitch_owner_vhca_id); spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS | MLX5_MATCH_MISC_PARAMETERS; -- 2.17.0
[for-next 03/15] IB/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
From: Christophe JAILLETWhen 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to free it. Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing") Signed-off-by: Christophe JAILLET Acked-by: Jason Gunthorpe Signed-off-by: Saeed Mahameed --- drivers/infiniband/hw/mlx5/cq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c index 77d257ec899b..6d52ea03574e 100644 --- a/drivers/infiniband/hw/mlx5/cq.c +++ b/drivers/infiniband/hw/mlx5/cq.c @@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata, return 0; err_cqb: - kfree(*cqb); + kvfree(*cqb); err_db: mlx5_ib_db_unmap_user(to_mucontext(context), >db); -- 2.17.0
[for-next 04/15] net/mlx5: Add merged e-switch cap
From: Roi DayanWhen merged e-switch is supported, the per-port e-switch is logically merged into one e-switch that spans both physical ports and all the VFs. Under merged eswitch, both the matching on source vport and setting destination vport can have a 2nd attribute which is the vhca id of the eswitch owner. For example: esw0: {match: action: fwd to } is a flow set on eswitch0 matching on source vport=1 from his eswitch and the action being fwd to dest vport=7 of eswitch1. Signed-off-by: Roi Dayan Reviewed-by: Shahar Klein Reviewed-by: Or Gerlitz Klein Signed-off-by: Saeed Mahameed --- include/linux/mlx5/mlx5_ifc.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 1aad455538f4..ef15f751a984 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -557,7 +557,8 @@ struct mlx5_ifc_e_switch_cap_bits { u8 vport_svlan_insert[0x1]; u8 vport_cvlan_insert_if_not_exist[0x1]; u8 vport_cvlan_insert_overwrite[0x1]; - u8 reserved_at_5[0x19]; + u8 reserved_at_5[0x18]; + u8 merged_eswitch[0x1]; u8 nic_vport_node_guid_modify[0x1]; u8 nic_vport_port_guid_modify[0x1]; -- 2.17.0
[for-next 05/15] net/mlx5: Properly handle a vport destination when setting FTE
From: Shahar KleinWhen creating FTE, properly distinguish between destination being vport or tir. The previous code just worked accidentally b/c of both dest being in the same offset within a union. Signed-off-by: Shahar Klein Reviewed-by: Or Gerlitz Reviewed-by: Roi Dayan Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c index ef5afd7c9325..0bfce6a82c91 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c @@ -372,6 +372,9 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev, if (dst->dest_attr.type == MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE) { id = dst->dest_attr.ft->id; + } else if (dst->dest_attr.type == + MLX5_FLOW_DESTINATION_TYPE_VPORT) { + id = dst->dest_attr.vport_num; } else { id = dst->dest_attr.tir_num; } -- 2.17.0
[for-next 15/15] net/mlx5e: Add HW vport counters to representor ethtool stats
From: Or GerlitzCurrently the representor only report the SW (slow-path) traffic counters. Add packet/bytes reporting of the HW counters, which account for the total amount of traffic that was handled by the vport, both slow and fast (offloaded) paths. The newly exposed counters are named vport_rx/tx_packets/bytes. Signed-off-by: Or Gerlitz Signed-off-by: Adi Nissim Signed-off-by: Saeed Mahameed --- .../net/ethernet/mellanox/mlx5/core/en_rep.c | 35 +++ 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index aa32592a54cb..c3034f58aa33 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -66,18 +66,36 @@ static const struct counter_desc sw_rep_stats_desc[] = { { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_bytes) }, }; -#define NUM_VPORT_REP_COUNTERS ARRAY_SIZE(sw_rep_stats_desc) +struct vport_stats { + u64 vport_rx_packets; + u64 vport_tx_packets; + u64 vport_rx_bytes; + u64 vport_tx_bytes; +}; + +static const struct counter_desc vport_rep_stats_desc[] = { + { MLX5E_DECLARE_STAT(struct vport_stats, vport_rx_packets) }, + { MLX5E_DECLARE_STAT(struct vport_stats, vport_rx_bytes) }, + { MLX5E_DECLARE_STAT(struct vport_stats, vport_tx_packets) }, + { MLX5E_DECLARE_STAT(struct vport_stats, vport_tx_bytes) }, +}; + +#define NUM_VPORT_REP_SW_COUNTERS ARRAY_SIZE(sw_rep_stats_desc) +#define NUM_VPORT_REP_HW_COUNTERS ARRAY_SIZE(vport_rep_stats_desc) static void mlx5e_rep_get_strings(struct net_device *dev, u32 stringset, uint8_t *data) { - int i; + int i, j; switch (stringset) { case ETH_SS_STATS: - for (i = 0; i < NUM_VPORT_REP_COUNTERS; i++) + for (i = 0; i < NUM_VPORT_REP_SW_COUNTERS; i++) strcpy(data + (i * ETH_GSTRING_LEN), sw_rep_stats_desc[i].format); + for (j = 0; j < NUM_VPORT_REP_HW_COUNTERS; j++, i++) + strcpy(data + (i * ETH_GSTRING_LEN), + vport_rep_stats_desc[j].format); break; } } @@ -140,7 +158,7 @@ static void mlx5e_rep_get_ethtool_stats(struct net_device *dev, struct ethtool_stats *stats, u64 *data) { struct mlx5e_priv *priv = netdev_priv(dev); - int i; + int i, j; if (!data) return; @@ -148,18 +166,23 @@ static void mlx5e_rep_get_ethtool_stats(struct net_device *dev, mutex_lock(>state_lock); if (test_bit(MLX5E_STATE_OPENED, >state)) mlx5e_rep_update_sw_counters(priv); + mlx5e_rep_update_hw_counters(priv); mutex_unlock(>state_lock); - for (i = 0; i < NUM_VPORT_REP_COUNTERS; i++) + for (i = 0; i < NUM_VPORT_REP_SW_COUNTERS; i++) data[i] = MLX5E_READ_CTR64_CPU(>stats.sw, sw_rep_stats_desc, i); + + for (j = 0; j < NUM_VPORT_REP_HW_COUNTERS; j++, i++) + data[i] = MLX5E_READ_CTR64_CPU(>stats.vf_vport, + vport_rep_stats_desc, j); } static int mlx5e_rep_get_sset_count(struct net_device *dev, int sset) { switch (sset) { case ETH_SS_STATS: - return NUM_VPORT_REP_COUNTERS; + return NUM_VPORT_REP_SW_COUNTERS + NUM_VPORT_REP_HW_COUNTERS; default: return -EOPNOTSUPP; } -- 2.17.0
[for-next 07/15] net/mlx5e: Explicitly set destination e-switch in FDB rules
From: Rabie LoulouSet a specific destination e-switch when setting a destination vport. Signed-off-by: Rabie Loulou Reviewed-by: Or Gerlitz Reviewed-by: Roi Dayan Reviewed-by: Shahar Klein Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 2 ++ drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 5 + 3 files changed, 8 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 4197001f9801..880adc810ccc 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -836,6 +836,7 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv, out_priv = netdev_priv(encap_dev); rpriv = out_priv->ppriv; attr->out_rep = rpriv->rep; + attr->out_mdev = out_priv->mdev; } err = mlx5_eswitch_add_vlan_action(esw, attr); @@ -2501,6 +2502,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts, out_priv = netdev_priv(out_dev); rpriv = out_priv->ppriv; attr->out_rep = rpriv->rep; + attr->out_mdev = out_priv->mdev; } else if (encap) { parse_attr->mirred_ifindex = out_dev->ifindex; parse_attr->tun_info = *info; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h index 4cd773fa55e3..ac5db54823a1 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h @@ -230,6 +230,7 @@ enum { struct mlx5_esw_flow_attr { struct mlx5_eswitch_rep *in_rep; struct mlx5_eswitch_rep *out_rep; + struct mlx5_core_dev*out_mdev; int action; __be16 vlan_proto; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index 90c8cb31e633..ea93867d1ab4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -72,6 +72,11 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw, if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) { dest[i].type = MLX5_FLOW_DESTINATION_TYPE_VPORT; dest[i].vport.num = attr->out_rep->vport; + if (MLX5_CAP_ESW(esw->dev, merged_eswitch)) { + dest[i].vport.vhca_id = + MLX5_CAP_GEN(attr->out_mdev, vhca_id); + dest[i].vport.vhca_id_valid = 1; + } i++; } if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_COUNT) { -- 2.17.0
[for-next 13/15] net/mlx5e: Use shared table for offloaded TC eswitch flows
From: Or GerlitzCurrently, each representor netdev use their own hash table to keep the mapping from TC flow (f->cookie) to the driver offloaded instance. The table is the one which originally was added for offloading TC NIC (not eswitch) rules. This scheme breaks when the core TC code calls us to add the same flow twice, (e.g under egdev use case) since we don't spot that and offload a 2nd flow into the HW with the wrong source vport. As a pre-step to solve that, we move to use a single table which keeps all offloaded TC eswitch flows. The table is located at the eswitch uplink representor object. Signed-off-by: Or Gerlitz Signed-off-by: Jiri Pirko Reviewed-by: Paul Blakey Signed-off-by: Saeed Mahameed --- .../net/ethernet/mellanox/mlx5/core/en_main.c | 4 +-- .../net/ethernet/mellanox/mlx5/core/en_rep.c | 19 ++-- .../net/ethernet/mellanox/mlx5/core/en_rep.h | 1 + .../net/ethernet/mellanox/mlx5/core/en_tc.c | 29 +++ .../net/ethernet/mellanox/mlx5/core/en_tc.h | 11 --- 5 files changed, 43 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 27e8375a476b..b5a7580b12fe 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4462,7 +4462,7 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv) goto err_destroy_direct_tirs; } - err = mlx5e_tc_init(priv); + err = mlx5e_tc_nic_init(priv); if (err) goto err_destroy_flow_steering; @@ -4483,7 +4483,7 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv) static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv) { - mlx5e_tc_cleanup(priv); + mlx5e_tc_nic_cleanup(priv); mlx5e_destroy_flow_steering(priv); mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_indirect_tirs(priv); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index 182b636552a6..aa32592a54cb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -981,14 +981,8 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv) } rpriv->vport_rx_rule = flow_rule; - err = mlx5e_tc_init(priv); - if (err) - goto err_del_flow_rule; - return 0; -err_del_flow_rule: - mlx5_del_flow_rules(rpriv->vport_rx_rule); err_destroy_direct_tirs: mlx5e_destroy_direct_tirs(priv); err_destroy_direct_rqts: @@ -1000,7 +994,6 @@ static void mlx5e_cleanup_rep_rx(struct mlx5e_priv *priv) { struct mlx5e_rep_priv *rpriv = priv->ppriv; - mlx5e_tc_cleanup(priv); mlx5_del_flow_rules(rpriv->vport_rx_rule); mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_direct_rqts(priv); @@ -1058,8 +1051,15 @@ mlx5e_nic_rep_load(struct mlx5_core_dev *dev, struct mlx5_eswitch_rep *rep) if (err) goto err_remove_sqs; + /* init shared tc flow table */ + err = mlx5e_tc_esw_init(>tc_ht); + if (err) + goto err_neigh_cleanup; + return 0; +err_neigh_cleanup: + mlx5e_rep_neigh_cleanup(rpriv); err_remove_sqs: mlx5e_remove_sqs_fwd_rules(priv); return err; @@ -1074,9 +1074,8 @@ mlx5e_nic_rep_unload(struct mlx5_eswitch_rep *rep) if (test_bit(MLX5E_STATE_OPENED, >state)) mlx5e_remove_sqs_fwd_rules(priv); - /* clean (and re-init) existing uplink offloaded TC rules */ - mlx5e_tc_cleanup(priv); - mlx5e_tc_init(priv); + /* clean uplink offloaded TC rules, delete shared tc flow table */ + mlx5e_tc_esw_cleanup(>tc_ht); mlx5e_rep_neigh_cleanup(rpriv); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h index b9b481f2833a..844d32d5c29f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h @@ -59,6 +59,7 @@ struct mlx5e_rep_priv { struct net_device *netdev; struct mlx5_flow_handle *vport_rx_rule; struct list_head vport_sqs_list; + struct rhashtable tc_ht; /* valid for uplink rep */ }; static inline diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 1c90586d7f58..05c90b4f8a31 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -76,6 +76,7 @@ enum { struct mlx5e_tc_flow { struct rhash_head node; + struct mlx5e_priv *priv; u64 cookie; u8 flags; struct mlx5_flow_handle *rule; @@
[for-next 01/15] net/mlx5: Vport, Use 'kvfree()' for memory allocated by 'kvzalloc()'
From: Christophe JAILLETWhen 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to free it. Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to query vport RoCE fields") Signed-off-by: Christophe JAILLET Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/vport.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c index 177e076b8d17..719cecb182c6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c @@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev, *system_image_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.system_image_guid); - kfree(out); + kvfree(out); return 0; } @@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid) *node_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.node_guid); - kfree(out); + kvfree(out); return 0; } @@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct mlx5_core_dev *mdev, *qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out, nic_vport_context.qkey_violation_counter); - kfree(out); + kvfree(out); return 0; } -- 2.17.0
[for-next 02/15] net/mlx5: Eswitch, Use 'kvfree()' for memory allocated by 'kvzalloc()'
From: Christophe JAILLETWhen 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to free it. Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx rules") Signed-off-by: Christophe JAILLET Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index 35e256eb2f6e..b123f8a52ad8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch *esw) esw->offloads.vport_rx_group = g; out: - kfree(flow_group_in); + kvfree(flow_group_in); return err; } -- 2.17.0
[for-next 08/15] net/mlx5: Add source e-switch owner
From: Shahar KleinThe source e-switch owner allows a vport on one e-switch port be associated with a rule defined on the second port e-switch. The role of the source eswitch owner valid bit in the flow group is to allow the firmware fail driver attempts to wild card the source eswitch match field. If this bit is not set, the firmware ignores the source eswitch owner field totally. Signed-off-by: Shahar Klein Reviewed-by: Or Gerlitz Reviewed-by: Roi Dayan Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 10 ++ include/linux/mlx5/mlx5_ifc.h | 6 -- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 5a80279b052a..b1a2ca0ff320 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -1372,6 +1372,8 @@ static int create_auto_flow_group(struct mlx5_flow_table *ft, struct mlx5_core_dev *dev = get_dev(>node); int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in); void *match_criteria_addr; + u8 src_esw_owner_mask_on; + void *misc; int err; u32 *in; @@ -1384,6 +1386,14 @@ static int create_auto_flow_group(struct mlx5_flow_table *ft, MLX5_SET(create_flow_group_in, in, start_flow_index, fg->start_index); MLX5_SET(create_flow_group_in, in, end_flow_index, fg->start_index + fg->max_ftes - 1); + + misc = MLX5_ADDR_OF(fte_match_param, fg->mask.match_criteria, + misc_parameters); + src_esw_owner_mask_on = !!MLX5_GET(fte_match_set_misc, misc, +source_eswitch_owner_vhca_id); + MLX5_SET(create_flow_group_in, in, +source_eswitch_owner_vhca_id_valid, src_esw_owner_mask_on); + match_criteria_addr = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria); memcpy(match_criteria_addr, fg->mask.match_criteria, diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 3d17709bc30c..9c3538f1b8b9 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -412,7 +412,7 @@ struct mlx5_ifc_fte_match_set_misc_bits { u8 reserved_at_0[0x8]; u8 source_sqn[0x18]; - u8 reserved_at_20[0x10]; + u8 source_eswitch_owner_vhca_id[0x10]; u8 source_port[0x10]; u8 outer_second_prio[0x3]; @@ -6995,7 +6995,9 @@ struct mlx5_ifc_create_flow_group_in_bits { u8 reserved_at_a0[0x8]; u8 table_id[0x18]; - u8 reserved_at_c0[0x20]; + u8 source_eswitch_owner_vhca_id_valid[0x1]; + + u8 reserved_at_c1[0x1f]; u8 start_flow_index[0x20]; -- 2.17.0
[for-next 06/15] net/mlx5: Add destination e-switch owner
From: Shahar KleinThe destination e-switch owner allows a rule in namespace of one e-switch owner to point to a vport that is natively associated with another e-switch owner. Signed-off-by: Shahar Klein Reviewed-by: Or Gerlitz Reviewed-by: Roi Dayan Signed-off-by: Saeed Mahameed --- .../net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +- .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c| 6 +++--- drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 8 +++- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +- include/linux/mlx5/fs.h | 6 +- include/linux/mlx5/mlx5_ifc.h | 5 +++-- 7 files changed, 21 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c index d93ff567b40d..b3820a34e773 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c @@ -235,7 +235,7 @@ const char *parse_fs_dst(struct trace_seq *p, switch (dst->type) { case MLX5_FLOW_DESTINATION_TYPE_VPORT: - trace_seq_printf(p, "vport=%u\n", dst->vport_num); + trace_seq_printf(p, "vport=%u\n", dst->vport.num); break; case MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE: trace_seq_printf(p, "ft=%p\n", dst->ft); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c index 332bc56306bf..9a24314b817a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c @@ -192,7 +192,7 @@ __esw_fdb_set_vport_rule(struct mlx5_eswitch *esw, u32 vport, bool rx_rule, } dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT; - dest.vport_num = vport; + dest.vport.num = vport; esw_debug(esw->dev, "\tFDB add rule dmac_v(%pM) dmac_c(%pM) -> vport(%d)\n", diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index b123f8a52ad8..90c8cb31e633 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -71,7 +71,7 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw, if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) { dest[i].type = MLX5_FLOW_DESTINATION_TYPE_VPORT; - dest[i].vport_num = attr->out_rep->vport; + dest[i].vport.num = attr->out_rep->vport; i++; } if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_COUNT) { @@ -343,7 +343,7 @@ mlx5_eswitch_add_send_to_vport_rule(struct mlx5_eswitch *esw, int vport, u32 sqn spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS; dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT; - dest.vport_num = vport; + dest.vport.num = vport; flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST; flow_rule = mlx5_add_flow_rules(esw->fdb_table.offloads.fdb, spec, @@ -387,7 +387,7 @@ static int esw_add_fdb_miss_rule(struct mlx5_eswitch *esw) dmac_c[0] = 0x01; dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT; - dest.vport_num = 0; + dest.vport.num = 0; flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST; flow_rule = mlx5_add_flow_rules(esw->fdb_table.offloads.fdb, spec, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c index 0bfce6a82c91..5a00deff5457 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c @@ -374,7 +374,13 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev, id = dst->dest_attr.ft->id; } else if (dst->dest_attr.type == MLX5_FLOW_DESTINATION_TYPE_VPORT) { - id = dst->dest_attr.vport_num; + id = dst->dest_attr.vport.num; + MLX5_SET(dest_format_struct, in_dests, + destination_eswitch_owner_vhca_id_valid, +dst->dest_attr.vport.vhca_id_valid); + MLX5_SET(dest_format_struct, in_dests, +destination_eswitch_owner_vhca_id, +dst->dest_attr.vport.vhca_id); } else { id = dst->dest_attr.tir_num; } diff --git
[for-next 10/15] net/mlx5e: Offload TC eswitch rules for VFs belonging to different PFs
From: Rabie LoulouWhen the merged eswitch capability is supported, allow offloading rules between VFs which belong to different PFs (and hence have different eswitch affinity). Signed-off-by: Rabie Loulou Reviewed-by: Or Gerlitz Reviewed-by: Roi Dayan Reviewed-by: Shahar Klein Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 630dd6dcabb9..77c3f8b8ae96 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -2077,6 +2077,20 @@ static int mlx5e_route_lookup_ipv4(struct mlx5e_priv *priv, return 0; } +static bool is_merged_eswitch_dev(struct mlx5e_priv *priv, + struct net_device *peer_netdev) +{ + struct mlx5e_priv *peer_priv; + + peer_priv = netdev_priv(peer_netdev); + + return (MLX5_CAP_ESW(priv->mdev, merged_eswitch) && + (priv->netdev->netdev_ops == peer_netdev->netdev_ops) && + same_hw_devs(priv, peer_priv) && + MLX5_VPORT_MANAGER(peer_priv->mdev) && + (peer_priv->mdev->priv.eswitch->mode == SRIOV_OFFLOADS)); +} + static int mlx5e_route_lookup_ipv6(struct mlx5e_priv *priv, struct net_device *mirred_dev, struct net_device **out_dev, @@ -2535,7 +2549,8 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts, out_dev = tcf_mirred_dev(a); if (switchdev_port_same_parent_id(priv->netdev, - out_dev)) { + out_dev) || + is_merged_eswitch_dev(priv, out_dev)) { action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST | MLX5_FLOW_CONTEXT_ACTION_COUNT; out_priv = netdev_priv(out_dev); -- 2.17.0
RE: [PATCH net-next v3 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST
From: Florian FainelliSent: 2018年5月18日 4:08 > The Freescale FEC driver builds fine with COMPILE_TEST, so make that > possible. > > Signed-off-by: Florian Fainelli Acked-by: Fugang Duan > --- > drivers/net/ethernet/freescale/Kconfig| 2 +- > drivers/net/ethernet/freescale/fec.h | 2 +- > drivers/net/ethernet/freescale/fec_main.c | 2 +- > 3 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/Kconfig > b/drivers/net/ethernet/freescale/Kconfig > index 6e490fd2345d..a580a3dcbe59 100644 > --- a/drivers/net/ethernet/freescale/Kconfig > +++ b/drivers/net/ethernet/freescale/Kconfig > @@ -22,7 +22,7 @@ if NET_VENDOR_FREESCALE config FEC > tristate "FEC ethernet controller (of ColdFire and some i.MX CPUs)" > depends on (M523x || M527x || M5272 || M528x || M520x || > M532x || \ > -ARCH_MXC || SOC_IMX28) > +ARCH_MXC || SOC_IMX28 || COMPILE_TEST) > default ARCH_MXC || SOC_IMX28 if ARM > select PHYLIB > imply PTP_1588_CLOCK > diff --git a/drivers/net/ethernet/freescale/fec.h > b/drivers/net/ethernet/freescale/fec.h > index e7381f8ef89d..4778b663653e 100644 > --- a/drivers/net/ethernet/freescale/fec.h > +++ b/drivers/net/ethernet/freescale/fec.h > @@ -21,7 +21,7 @@ > > #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || > defined(CONFIG_M528x) || \ > defined(CONFIG_M520x) || defined(CONFIG_M532x) || > defined(CONFIG_ARM) || \ > -defined(CONFIG_ARM64) > +defined(CONFIG_ARM64) || defined(CONFIG_COMPILE_TEST) > /* > * Just figures, Motorola would have to change the offsets for > * registers in the same peripheral device on different models > diff --git a/drivers/net/ethernet/freescale/fec_main.c > b/drivers/net/ethernet/freescale/fec_main.c > index f3e43db0d6cb..4358f586e28f 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -2107,7 +2107,7 @@ static int fec_enet_get_regs_len(struct > net_device *ndev) > /* List of registers that can be safety be read to dump them with ethtool > */ #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || > defined(CONFIG_M528x) || \ > defined(CONFIG_M520x) || defined(CONFIG_M532x) || > defined(CONFIG_ARM) || \ > - defined(CONFIG_ARM64) > + defined(CONFIG_ARM64) || defined(CONFIG_COMPILE_TEST) > static u32 fec_enet_register_offset[] = { > FEC_IEVENT, FEC_IMASK, FEC_R_DES_ACTIVE_0, > FEC_X_DES_ACTIVE_0, > FEC_ECNTRL, FEC_MII_DATA, FEC_MII_SPEED, FEC_MIB_CTRLSTAT, > FEC_R_CNTRL, > -- > 2.14.1
Re: [PATCH net-next ] net: mscc: Add SPDX identifier
On Thu, 2018-05-17 at 21:39 +0200, Alexandre Belloni wrote: > On 17/05/2018 12:28:59-0700, Joe Perches wrote: > > On Thu, 2018-05-17 at 21:23 +0200, Alexandre Belloni wrote: > > > ocelot_qsys.h is missing the SPDX identfier, fix that. > > > > > > Signed-off-by: Alexandre Belloni> > > > Only the copyright holders should ideally be modifying > > these and also removing other license content. > > > > For instance, what's the real intent here? > > > > Well, if you have a look, I submitted that file this cycle and it is the > only one that doesn't have the proper SPDX identifier. This is a mistake > I'm fixing. Just because you submitted it does not mean you are the copyright holder. > > > diff --git a/drivers/net/ethernet/mscc/ocelot_qsys.h > > > b/drivers/net/ethernet/mscc/ocelot_qsys.h > > > > [] > > > @@ -1,7 +1,7 @@ > > > +/* SPDX-License-Identifier: (GPL-2.0 OR MIT) */ > > > > GPL 2.0+ or 2.0? > > > > 2.0 How do you know that?
Re: [PATCH bpf-next 3/3] bpf: Add mtu checking to FIB forwarding helper
On 5/17/18 4:22 PM, Daniel Borkmann wrote: > On 05/17/2018 06:09 PM, David Ahern wrote: >> Add check that egress MTU can handle packet to be forwarded. If >> the MTU is less than the packet lenght, return 0 meaning the >> packet is expected to continue up the stack for help - eg., >> fragmenting the packet or sending an ICMP. >> >> Signed-off-by: David Ahern>> --- >> net/core/filter.c | 10 ++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/net/core/filter.c b/net/core/filter.c >> index 6d0d1560bd70..c47c47a75d4b 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -4098,6 +4098,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct >> bpf_fib_lookup *params, >> struct fib_nh *nh; >> struct flowi4 fl4; >> int err; >> +u32 mtu; >> >> dev = dev_get_by_index_rcu(net, params->ifindex); >> if (unlikely(!dev)) >> @@ -4149,6 +4150,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, >> struct bpf_fib_lookup *params, >> if (res.fi->fib_nhs > 1) >> fib_select_path(net, , , NULL); >> >> +mtu = ip_mtu_from_fib_result(, params->ipv4_dst); >> +if (params->tot_len > mtu) >> +return 0; >> + >> nh = >fib_nh[res.nh_sel]; >> >> /* do not handle lwt encaps right now */ >> @@ -4188,6 +4193,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct >> bpf_fib_lookup *params, >> struct flowi6 fl6; >> int strict = 0; >> int oif; >> +u32 mtu; >> >> /* link local addresses are never forwarded */ >> if (rt6_need_strict(dst) || rt6_need_strict(src)) >> @@ -4250,6 +4256,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, >> struct bpf_fib_lookup *params, >> fl6.flowi6_oif, NULL, >> strict); >> >> +mtu = ip6_mtu_from_fib6(f6i, dst, src); >> +if (params->tot_len > mtu) >> +return 0; >> + >> if (f6i->fib6_nh.nh_lwtstate) >> return 0; > > Could you elaborate how this interacts in tc BPF use case where you have e.g. > GSO packets and tot_len from aggregated packets would definitely be larger > than MTU (e.g. see is_skb_forwardable() as one example on such checks)? Should > this be an opt-in via a new flag for the helper? It should not be opt-in for XDP. I could add a flag to the internal call -- bpf_skb_fib_lookup sets the flag to skip the MTU check in bpf_ipv4_fib_lookup and bpf_ipv6_fib_lookup. For the skb case do you want bpf_skb_fib_lookup call is_skb_forwardable or leave that to the BPF program?
pull-request: bpf 2018-05-18
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix two bugs in sockmap, a use after free in sockmap's error path from sock_map_ctx_update_elem() where we mistakenly drop a reference we didn't take prior to that, and in the same function fix a race in bpf_prog_inc_not_zero() where we didn't use the progs from prior READ_ONCE(), from John. 2) Reject program expansions once we figure out that their jump target which crosses patchlet boundaries could otherwise get truncated in insn->off space, from Daniel. 3) Check the return value of fopen() in BPF selftest's test_verifier where we determine whether unpriv BPF is disabled, and iff we do fail there then just assume it is disabled. This fixes a segfault when used with older kernels, from Jesper. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git When this gets later merged into net-next there are a two trivial BPF conflicts to resolve: In kernel/bpf/sockmap.c the bpf_prog_inc_not_zero() cases must use verdict, parse and tx_msg as their arguments as opposed to the buggy old version where progs->bpf_{verdict,parse,tx_msg} were used as passed args. In tools/lib/bpf/libbpf.c use the hunk from net-next with the __bpf_object__open() + IS_ERR(obj) test combination. Thus, net-next code only is sufficient here. Thanks a lot! The following changes since commit 02f99df1875c11330cd0be69a40fa8ccd14749b2: erspan: fix invalid erspan version. (2018-05-17 15:48:49 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to 050fad7c4534c13c8eb1d9c2ba66012e014773cb: bpf: fix truncated jump targets on heavy expansions (2018-05-17 16:05:35 -0700) Daniel Borkmann (1): bpf: fix truncated jump targets on heavy expansions Jesper Dangaard Brouer (1): selftests/bpf: check return value of fopen in test_verifier.c John Fastabend (2): bpf: sockmap update rollback on error can incorrectly dec prog refcnt bpf: parse and verdict prog attach may race with bpf map update kernel/bpf/core.c | 100 +--- kernel/bpf/sockmap.c| 18 ++--- net/core/filter.c | 11 ++- tools/testing/selftests/bpf/test_verifier.c | 5 ++ 4 files changed, 98 insertions(+), 36 deletions(-)
Re: [PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link
On 5/17/18 4:36 PM, Stephen Hemminger wrote: > On Thu, 17 May 2018 16:22:37 -0600 > dsah...@kernel.org wrote: > >> From: David Ahern>> >> Using iproute2 to create a bridge and add 4094 vlans to it can take from >> 2 to 3 *minutes*. The reason is the extraneous call to ll_name_to_index. >> ll_name_to_index results in an ioctl(SIOCGIFINDEX) call which in turn >> invokes dev_load. If the index does not exist, which it won't when >> creating a new link, dev_load calls modprobe twice -- once for >> netdev-NAME and again for NAME. This is unnecessary overhead for each >> link create. >> >> When ip link is invoked for a new device, there is no reason to >> call ll_name_to_index for the new device. With this patch, creating >> a bridge and adding 4094 vlans takes less than 3 *seconds*. >> >> Signed-off-by: David Ahern > > Yes this looks like a real problem. > Isn't the cache supposed to reduce this? > > Don't like to make lots of special case flags. > The device does not exist, so it won't be in any cache. ll_name_to_index already checks it though before calling if_nametoindex.
[PATCH net v2] net: dsa: Do not register devlink for unused ports
Even if commit 1d27732f411d ("net: dsa: setup and teardown ports") indicated that registering a devlink instance for unused ports is not a problem, and this is true, this can be confusing nonetheless, so let's not do it. Fixes: 1d27732f411d ("net: dsa: setup and teardown ports") Reported-by: Jiri PirkoSigned-off-by: Florian Fainelli --- net/dsa/dsa2.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index adf50fbc4c13..47725250b4ca 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -258,11 +258,13 @@ static void dsa_tree_teardown_default_cpu(struct dsa_switch_tree *dst) static int dsa_port_setup(struct dsa_port *dp) { struct dsa_switch *ds = dp->ds; - int err; + int err = 0; memset(>devlink_port, 0, sizeof(dp->devlink_port)); - err = devlink_port_register(ds->devlink, >devlink_port, dp->index); + if (dp->type != DSA_PORT_TYPE_UNUSED) + err = devlink_port_register(ds->devlink, >devlink_port, + dp->index); if (err) return err; @@ -293,7 +295,8 @@ static int dsa_port_setup(struct dsa_port *dp) static void dsa_port_teardown(struct dsa_port *dp) { - devlink_port_unregister(>devlink_port); + if (dp->type != DSA_PORT_TYPE_UNUSED) + devlink_port_unregister(>devlink_port); switch (dp->type) { case DSA_PORT_TYPE_UNUSED: -- 2.14.1
[net-next:master 1230/1233] arch/mips/include/asm/io.h:422:1: note: in expansion of macro '__BUILD_MEMORY_SINGLE'
tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: 538e2de104cfb4ef1acb35af42427bff42adbe4d commit: 2652113ff043ca2ce1cb3be529b5ca9270c421d4 [1230/1233] net: ethernet: ti: Allow most drivers with COMPILE_TEST config: mips-allyesconfig (attached as .config) compiler: mips-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross git checkout 2652113ff043ca2ce1cb3be529b5ca9270c421d4 # save the attached .config to linux build tree make.cross ARCH=mips All warnings (new ones prefixed by >>): drivers/net//ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit': drivers/net//ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 of 'writel' makes integer from pointer without a cast [-Wint-conversion] writel_relaxed(token, >sw_token); ^ In file included from arch/mips/include/asm/page.h:194:0, from include/linux/mmzone.h:21, from include/linux/gfp.h:6, from include/linux/idr.h:16, from include/linux/kernfs.h:14, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from drivers/net//ethernet/ti/davinci_cpdma.c:17: arch/mips/include/asm/io.h:315:25: note: expected 'u32 {aka unsigned int}' but argument is of type 'void *' static inline void pfx##write##bwlq(type val,\ ^ >> arch/mips/include/asm/io.h:422:1: note: in expansion of macro >> '__BUILD_MEMORY_SINGLE' __BUILD_MEMORY_SINGLE(bus, bwlq, type, 1) ^ >> arch/mips/include/asm/io.h:427:1: note: in expansion of macro >> '__BUILD_MEMORY_PFX' __BUILD_MEMORY_PFX(, bwlq, type) \ ^~ >> arch/mips/include/asm/io.h:432:1: note: in expansion of macro 'BUILDIO_MEM' BUILDIO_MEM(l, u32) ^~~ -- drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit': drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 of 'writel' makes integer from pointer without a cast [-Wint-conversion] writel_relaxed(token, >sw_token); ^ In file included from arch/mips/include/asm/page.h:194:0, from include/linux/mmzone.h:21, from include/linux/gfp.h:6, from include/linux/idr.h:16, from include/linux/kernfs.h:14, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from drivers/net/ethernet/ti/davinci_cpdma.c:17: arch/mips/include/asm/io.h:315:25: note: expected 'u32 {aka unsigned int}' but argument is of type 'void *' static inline void pfx##write##bwlq(type val,\ ^ >> arch/mips/include/asm/io.h:422:1: note: in expansion of macro >> '__BUILD_MEMORY_SINGLE' __BUILD_MEMORY_SINGLE(bus, bwlq, type, 1) ^ >> arch/mips/include/asm/io.h:427:1: note: in expansion of macro >> '__BUILD_MEMORY_PFX' __BUILD_MEMORY_PFX(, bwlq, type) \ ^~ >> arch/mips/include/asm/io.h:432:1: note: in expansion of macro 'BUILDIO_MEM' BUILDIO_MEM(l, u32) ^~~ vim +/__BUILD_MEMORY_SINGLE +422 arch/mips/include/asm/io.h 8faca49a6 arch/mips/include/asm/io.h David Daney 2008-12-11 312 ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 313 #define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, irq) \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 314 \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 @315 static inline void pfx##write##bwlq(type val, \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 316 volatile void __iomem *mem) \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 317 { \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 318 volatile type *__mem; \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 319 type __val; \ ^1da177e4 include/asm-mips/io.h Linus Torvalds2005-04-16 320 \ 1e820da3c arch/mips/include/asm/io.h Huacai Chen 2016-03-03 321 war_io_reorder_wmb(); \
Re: [PATCH bpf-next 2/7] bpf: introduce bpf subcommand BPF_PERF_EVENT_QUERY
Hi Yonghong, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-BPF_PERF_EVENT_QUERY-for-perf-event-query/20180518-060508 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: i386-randconfig-x000-201819 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): kernel/trace/trace_kprobe.c: In function 'bpf_get_kprobe_info': >> kernel/trace/trace_kprobe.c:1315:17: warning: cast from pointer to integer >> of different size [-Wpointer-to-int-cast] *probe_addr = (u64)tk->rp.kp.addr; ^ vim +1315 kernel/trace/trace_kprobe.c 1290 1291 int bpf_get_kprobe_info(struct perf_event *event, u32 *prog_info, 1292 const char **symbol, u64 *probe_offset, 1293 u64 *probe_addr, bool perf_type_tracepoint) 1294 { 1295 const char *pevent = trace_event_name(event->tp_event); 1296 const char *group = event->tp_event->class->system; 1297 struct trace_kprobe *tk; 1298 1299 if (perf_type_tracepoint) 1300 tk = find_trace_kprobe(pevent, group); 1301 else 1302 tk = event->tp_event->data; 1303 if (!tk) 1304 return -EINVAL; 1305 1306 *prog_info = trace_kprobe_is_return(tk) ? BPF_PERF_INFO_KRETPROBE 1307 : BPF_PERF_INFO_KPROBE; 1308 if (tk->symbol) { 1309 *symbol = tk->symbol; 1310 *probe_offset = tk->rp.kp.offset; 1311 *probe_addr = 0; 1312 } else { 1313 *symbol = NULL; 1314 *probe_offset = 0; > 1315 *probe_addr = (u64)tk->rp.kp.addr; 1316 } 1317 return 0; 1318 } 1319 #endif /* CONFIG_PERF_EVENTS */ 1320 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Greetings
Hello Greetings to you please i have a business proposal for you contact me for more detailes asap thanks. Best Regards, Miss.Zeliha ömer faruk Esentepe Mahallesi Büyükdere Caddesi Kristal Kule Binasi No:215 Sisli - Istanbul, Turkey
[PATCH iproute2] tc: allow 0% for percent options
Allowing 0% is sometimes useful for example in netem loss and drop or perhaps dropping all traffic in a HTB bin. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199745 Reported-by: stuartmars...@gmail.com Fixes: 927e3cfb52b5 ("tc: B.W limits can now be specified in %.") Signed-off-by: Stephen Hemminger--- lib/utils.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/utils.c b/lib/utils.c index 7b2c6dd19268..02ce67721915 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -105,7 +105,7 @@ int parse_percent(double *val, const char *str) *val = strtod(str, ) / 100.; if (*val == HUGE_VALF || *val == HUGE_VALL) return 1; - if (*val == 0.0 || (*p && strcmp(p, "%"))) + if (*p && strcmp(p, "%")) return -1; return 0; -- 2.17.0
Re: [PATCH v3 net-next 3/6] tcp: add SACK compression
Eric Dumazetwrites: > When TCP receives an out-of-order packet, it immediately sends > a SACK packet, generating network load but also forcing the > receiver to send 1-MSS pathological packets, increasing its > RTX queue length/depth, and thus processing time. > > Wifi networks suffer from this aggressive behavior, but generally > speaking, all these SACK packets add fuel to the fire when networks > are under congestion. > > This patch adds a high resolution timer and tp->compressed_ack counter. > > Instead of sending a SACK, we program this timer with a small delay, > based on RTT and capped to 1 ms : > > delay = min ( 5 % of RTT, 1 ms) > > If subsequent SACKs need to be sent while the timer has not yet > expired, we simply increment tp->compressed_ack. > > When timer expires, a SACK is sent with the latest information. > Whenever an ACK is sent (if data is sent, or if in-order > data is received) timer is canceled. > > Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent > if the sack blocks need to be shuffled, even if the timer has not > expired. > > A new SNMP counter is added in the following patch. > > Two other patches add sysctls to allow changing the 1,000,000 and 44 > values that this commit hard-coded. > > Signed-off-by: Eric Dumazet Acked-by: Toke Høiland-Jørgensen
Re: [PATCH] ath10k: transmit queued frames after waking queues
On Thu, 17 May 2018 at 16:16, Niklas Casselwrote: > diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c > index cda164f6e9f6..1d3b2d2c3fee 100644 > --- a/drivers/net/wireless/ath/ath10k/txrx.c > +++ b/drivers/net/wireless/ath/ath10k/txrx.c > @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt, > wake_up(>empty_tx_wq); > spin_unlock_bh(>tx_lock); > + if (htt->num_pending_tx <= 3 && !list_empty(>txqs)) > + ath10k_mac_tx_push_pending(ar); > + Just sanity checking - what's protecting htt->num_pending_tx? or is it serialised some other way? > dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE); > ath10k_report_offchan_tx(htt->ar, msdu); > -- > 2.17.0
Regression bisected to: softirq: Let ksoftirqd do its job
One of my out-of-tree patches is a network impairment tool that acts a lot like an Ethernet bridge with latency, jitter, etc. We noticed recently that we were seeing igb adapter errors when testing with our emulator at high speeds. For whatever reason, it is only easily reproduced when we add jitter to our emulator. This would cause a bit more CPU usage and lock contention in our software, and would increase the skb pkts allocated at any given time. I bisected the problem to the commit below: Author: Eric DumazetDate: Wed Aug 31 10:42:29 2016 -0700 softirq: Let ksoftirqd do its job A while back, Paolo and Hannes sent an RFC patch adding threaded-able napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) If I replace my emulator with a bridge, then I do not see the problem. But, I also do not (or very rarely?) see the problem when configuring the emulator with zero latency and jitter, which is how the bridge would act. Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout? If you have any interest, I will be happy to email you my out-of-tree patches and instructions to reproduce the problem. The kernel splat looks like this, and repeats often: May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' May 17 16:03:39 localhost.localdomain kernel: [ cut here ] May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 dev_watchdog+0x234/0x240 May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): transmit queue 0 timed out May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack] May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 Tainted: G O4.8.0-rc7+ #132 May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017 May 17 16:03:39 localhost.localdomain kernel: 88087fd43d78 81417eb1 88087fd43dc8 May 17 16:03:39 localhost.localdomain kernel: 88087fd43db8 81103556 013c7fd43da8 May 17 16:03:39 localhost.localdomain kernel: 880854221940 0005 880854bb8000 May 17 16:03:39 localhost.localdomain kernel: Call Trace: May 17 16:03:39 localhost.localdomain kernel:[] dump_stack+0x63/0x82 May 17 16:03:39 localhost.localdomain kernel: [] __warn+0xc6/0xe0 May 17 16:03:39 localhost.localdomain kernel: [] warn_slowpath_fmt+0x4a/0x50 May 17 16:03:39 localhost.localdomain kernel: [] dev_watchdog+0x234/0x240 May 17 16:03:39 localhost.localdomain kernel: [] ? qdisc_rcu_free+0x40/0x40 May 17 16:03:39 localhost.localdomain kernel: [] call_timer_fn+0x30/0x150 May 17 16:03:39 localhost.localdomain kernel: [] ? qdisc_rcu_free+0x40/0x40 May 17 16:03:39 localhost.localdomain kernel: [] run_timer_softirq+0x1ea/0x450 May 17 16:03:39 localhost.localdomain kernel: [] ? ktime_get+0x37/0xa0 May 17 16:03:39 localhost.localdomain kernel: [] ? lapic_next_deadline+0x21/0x30 May 17 16:03:39 localhost.localdomain kernel: [] ? clockevents_program_event+0x7d/0x120 May 17 16:03:39 localhost.localdomain kernel: [] __do_softirq+0xca/0x2d0 May 17 16:03:39 localhost.localdomain kernel: [] irq_exit+0xb3/0xc0 May 17 16:03:39 localhost.localdomain kernel: [] smp_apic_timer_interrupt+0x3d/0x50 May 17 16:03:39 localhost.localdomain kernel: [] apic_timer_interrupt+0x82/0x90 May 17 16:03:39 localhost.localdomain kernel:[] ? cpuidle_enter_state+0x126/0x300 May 17 16:03:39 localhost.localdomain kernel: [] cpuidle_enter+0x12/0x20 May 17 16:03:39 localhost.localdomain kernel: [] call_cpuidle+0x25/0x40 May 17 16:03:39 localhost.localdomain kernel: [] cpu_startup_entry+0x2ba/0x380 May 17 16:03:39 localhost.localdomain kernel: [] start_secondary+0x149/0x170 May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f ]--- Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
[PATCH] ath10k: transmit queued frames after waking queues
The following problem was observed when running iperf: [ 3] 0.0- 1.0 sec 2.00 MBytes 16.8 Mbits/sec [ 3] 1.0- 2.0 sec 3.12 MBytes 26.2 Mbits/sec [ 3] 2.0- 3.0 sec 3.25 MBytes 27.3 Mbits/sec [ 3] 3.0- 4.0 sec 655 KBytes 5.36 Mbits/sec [ 3] 4.0- 5.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 5.0- 6.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 6.0- 7.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 7.0- 8.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 8.0- 9.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 0.0-10.3 sec 9.01 MBytes 7.32 Mbits/sec There are frames in the ieee80211_txq and there are frames that have been removed from from this queue, but haven't yet been sent on the wire (num_pending_tx). When num_pending_tx reaches max_num_pending_tx, we will stop the queues by calling ieee80211_stop_queues(). As frames that have previously been sent for transmission (num_pending_tx) are completed, we will decrease num_pending_tx and wake the queues by calling ieee80211_wake_queue(). ieee80211_wake_queue() does not call wake_tx_queue, so we might still have frames in the queue at this point. While the queues were stopped, the socket buffer might have filled up, and in order for user space to write more, we need to free the frames in the queue, since they are accounted to the socket. In order to free them, we first need to transmit them. In order to avoid trying to flush the queue every time we free a frame, only do this when there are 3 or less frames pending, and while we actually have frames in the queue. This logic was copied from mt76_txq_schedule (mt76), one of few other drivers that are actually using wake_tx_queue. Suggested-by: Toke Høiland-JørgensenSigned-off-by: Niklas Cassel --- drivers/net/wireless/ath/ath10k/txrx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c index cda164f6e9f6..1d3b2d2c3fee 100644 --- a/drivers/net/wireless/ath/ath10k/txrx.c +++ b/drivers/net/wireless/ath/ath10k/txrx.c @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt, wake_up(>empty_tx_wq); spin_unlock_bh(>tx_lock); + if (htt->num_pending_tx <= 3 && !list_empty(>txqs)) + ath10k_mac_tx_push_pending(ar); + dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE); ath10k_report_offchan_tx(htt->ar, msdu); -- 2.17.0
Re: [PATCH bpf] bpf: fix truncated jump targets on heavy expansions
On Thu, May 17, 2018 at 01:44:11AM +0200, Daniel Borkmann wrote: > Recently during testing, I ran into the following panic: > > Therefore it becomes necessary to detect and reject any such occasions > in a generic way for native eBPF and cBPF to eBPF migrations. For > the latter we can simply check bounds in the bpf_convert_filter()'s > BPF_EMIT_JMP helper macro and bail out once we surpass limits. The > bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case > of subsequent hardening) is a bit more complex in that we need to > detect such truncations before hitting the bpf_prog_realloc(). Thus > the latter is split into an extra pass to probe problematic offsets > on the original program in order to fail early. With that in place > and carefully tested I no longer hit the panic and the rewrites are > rejected properly. The above example panic I've seen on bpf-next, > though the issue itself is generic in that a guard against this issue > in bpf seems more appropriate in this case. > > Signed-off-by: Daniel BorkmannNice catch! Applied.
Re: [PATCH ghak81 V3 3/3] audit: collect audit task parameters
On Wed, May 16, 2018 at 7:55 AM, Richard Guy Briggswrote: > The audit-related parameters in struct task_struct should ideally be > collected together and accessed through a standard audit API. > > Collect the existing loginuid, sessionid and audit_context together in a > new struct audit_task_info called "audit" in struct task_struct. > > Use kmem_cache to manage this pool of memory. > Un-inline audit_free() to be able to always recover that memory. > > See: https://github.com/linux-audit/audit-kernel/issues/81 > > Signed-off-by: Richard Guy Briggs > --- > include/linux/audit.h | 34 -- > include/linux/sched.h | 5 + > init/init_task.c | 3 +-- > init/main.c | 2 ++ > kernel/auditsc.c | 51 > ++- > kernel/fork.c | 2 +- > 6 files changed, 71 insertions(+), 26 deletions(-) As discussed on-list and offline, I'm going to hold off on this change until the audit container ID work is father along. That is the main driver for this change, and until that is closer to ready I just can't justify the extra overhead. > diff --git a/include/linux/audit.h b/include/linux/audit.h > index 69c7847..4f824c4 100644 > --- a/include/linux/audit.h > +++ b/include/linux/audit.h > @@ -216,8 +216,15 @@ static inline void audit_log_task_info(struct > audit_buffer *ab, > > /* These are defined in auditsc.c */ > /* Public API */ > +struct audit_task_info { > + kuid_t loginuid; > + unsigned intsessionid; > + struct audit_context*ctx; > +}; > +extern struct audit_task_info init_struct_audit; > +extern void __init audit_task_init(void); > extern int audit_alloc(struct task_struct *task); > -extern void __audit_free(struct task_struct *task); > +extern void audit_free(struct task_struct *task); > extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long > a1, > unsigned long a2, unsigned long a3); > extern void __audit_syscall_exit(int ret_success, long ret_value); > @@ -239,12 +246,15 @@ extern void audit_seccomp_actions_logged(const char > *names, > > static inline void audit_set_context(struct task_struct *task, struct > audit_context *ctx) > { > - task->audit_context = ctx; > + task->audit->ctx = ctx; > } > > static inline struct audit_context *audit_context(void) > { > - return current->audit_context; > + if (current->audit) > + return current->audit->ctx; > + else > + return NULL; > } > > static inline bool audit_dummy_context(void) > @@ -252,11 +262,7 @@ static inline bool audit_dummy_context(void) > void *p = audit_context(); > return !p || *(int *)p; > } > -static inline void audit_free(struct task_struct *task) > -{ > - if (unlikely(task->audit_context)) > - __audit_free(task); > -} > + > static inline void audit_syscall_entry(int major, unsigned long a0, >unsigned long a1, unsigned long a2, >unsigned long a3) > @@ -328,12 +334,18 @@ extern int auditsc_get_stamp(struct audit_context *ctx, > > static inline kuid_t audit_get_loginuid(struct task_struct *tsk) > { > - return tsk->loginuid; > + if (tsk->audit) > + return tsk->audit->loginuid; > + else > + return INVALID_UID; > } > > static inline unsigned int audit_get_sessionid(struct task_struct *tsk) > { > - return tsk->sessionid; > + if (tsk->audit) > + return tsk->audit->sessionid; > + else > + return AUDIT_SID_UNSET; > } > > extern void __audit_ipc_obj(struct kern_ipc_perm *ipcp); > @@ -458,6 +470,8 @@ static inline void audit_fanotify(unsigned int response) > extern int audit_n_rules; > extern int audit_signals; > #else /* CONFIG_AUDITSYSCALL */ > +static inline void __init audit_task_init(void) > +{ } > static inline int audit_alloc(struct task_struct *task) > { > return 0; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index b3d697f..6a5db0e 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -29,7 +29,6 @@ > #include > > /* task_struct member predeclarations (sorted alphabetically): */ > -struct audit_context; > struct backing_dev_info; > struct bio_list; > struct blk_plug; > @@ -832,10 +831,8 @@ struct task_struct { > > struct callback_head*task_works; > > - struct audit_context*audit_context; > #ifdef CONFIG_AUDITSYSCALL > - kuid_t loginuid; > - unsigned intsessionid; > + struct audit_task_info *audit; > #endif > struct seccomp seccomp; > > diff --git a/init/init_task.c b/init/init_task.c > index
Re: [PATCH ghak81 V3 1/3] audit: use new audit_context access funciton for seccomp_actions_logged
On Wed, May 16, 2018 at 7:55 AM, Richard Guy Briggswrote: > On the rebase of the following commit on the new seccomp actions_logged > function, one audit_context access was missed. > > commit cdfb6b341f0f2409aba24b84f3b4b2bba50be5c5 > ("audit: use inline function to get audit context") > > Signed-off-by: Richard Guy Briggs > --- > kernel/auditsc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Merged into audit/next, thanks for the follow-up. > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index cbab0da..f3d3dc6 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -2497,7 +2497,7 @@ void audit_seccomp_actions_logged(const char *names, > const char *old_names, > if (!audit_enabled) > return; > > - ab = audit_log_start(current->audit_context, GFP_KERNEL, > + ab = audit_log_start(audit_context(), GFP_KERNEL, > AUDIT_CONFIG_CHANGE); > if (unlikely(!ab)) > return; > -- > 1.8.3.1 -- paul moore www.paul-moore.com
Re: [PATCH ghak81 V3 2/3] audit: normalize loginuid read access
On Wed, May 16, 2018 at 7:55 AM, Richard Guy Briggswrote: > Recognizing that the loginuid is an internal audit value, use an access > function to retrieve the audit loginuid value for the task rather than > reaching directly into the task struct to get it. > > Signed-off-by: Richard Guy Briggs > --- > kernel/auditsc.c | 24 +++- > 1 file changed, 15 insertions(+), 9 deletions(-) Also merged into audit/next. > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index f3d3dc6..ef3e189 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -374,7 +374,7 @@ static int audit_field_compare(struct task_struct *tsk, > case AUDIT_COMPARE_EGID_TO_OBJ_GID: > return audit_compare_gid(cred->egid, name, f, ctx); > case AUDIT_COMPARE_AUID_TO_OBJ_UID: > - return audit_compare_uid(tsk->loginuid, name, f, ctx); > + return audit_compare_uid(audit_get_loginuid(tsk), name, f, > ctx); > case AUDIT_COMPARE_SUID_TO_OBJ_UID: > return audit_compare_uid(cred->suid, name, f, ctx); > case AUDIT_COMPARE_SGID_TO_OBJ_GID: > @@ -385,7 +385,8 @@ static int audit_field_compare(struct task_struct *tsk, > return audit_compare_gid(cred->fsgid, name, f, ctx); > /* uid comparisons */ > case AUDIT_COMPARE_UID_TO_AUID: > - return audit_uid_comparator(cred->uid, f->op, tsk->loginuid); > + return audit_uid_comparator(cred->uid, f->op, > + audit_get_loginuid(tsk)); > case AUDIT_COMPARE_UID_TO_EUID: > return audit_uid_comparator(cred->uid, f->op, cred->euid); > case AUDIT_COMPARE_UID_TO_SUID: > @@ -394,11 +395,14 @@ static int audit_field_compare(struct task_struct *tsk, > return audit_uid_comparator(cred->uid, f->op, cred->fsuid); > /* auid comparisons */ > case AUDIT_COMPARE_AUID_TO_EUID: > - return audit_uid_comparator(tsk->loginuid, f->op, cred->euid); > + return audit_uid_comparator(audit_get_loginuid(tsk), f->op, > + cred->euid); > case AUDIT_COMPARE_AUID_TO_SUID: > - return audit_uid_comparator(tsk->loginuid, f->op, cred->suid); > + return audit_uid_comparator(audit_get_loginuid(tsk), f->op, > + cred->suid); > case AUDIT_COMPARE_AUID_TO_FSUID: > - return audit_uid_comparator(tsk->loginuid, f->op, > cred->fsuid); > + return audit_uid_comparator(audit_get_loginuid(tsk), f->op, > + cred->fsuid); > /* euid comparisons */ > case AUDIT_COMPARE_EUID_TO_SUID: > return audit_uid_comparator(cred->euid, f->op, cred->suid); > @@ -611,7 +615,8 @@ static int audit_filter_rules(struct task_struct *tsk, > result = match_tree_refs(ctx, rule->tree); > break; > case AUDIT_LOGINUID: > - result = audit_uid_comparator(tsk->loginuid, f->op, > f->uid); > + result = audit_uid_comparator(audit_get_loginuid(tsk), > + f->op, f->uid); > break; > case AUDIT_LOGINUID_SET: > result = audit_comparator(audit_loginuid_set(tsk), > f->op, f->val); > @@ -2278,14 +2283,15 @@ int audit_signal_info(int sig, struct task_struct *t) > { > struct audit_aux_data_pids *axp; > struct audit_context *ctx = audit_context(); > - kuid_t uid = current_uid(), t_uid = task_uid(t); > + kuid_t uid = current_uid(), auid, t_uid = task_uid(t); > > if (auditd_test_task(t) && > (sig == SIGTERM || sig == SIGHUP || > sig == SIGUSR1 || sig == SIGUSR2)) { > audit_sig_pid = task_tgid_nr(current); > - if (uid_valid(current->loginuid)) > - audit_sig_uid = current->loginuid; > + auid = audit_get_loginuid(current); > + if (uid_valid(auid)) > + audit_sig_uid = auid; > else > audit_sig_uid = uid; > security_task_getsecid(current, _sig_sid); > -- > 1.8.3.1 > -- paul moore www.paul-moore.com
Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports
On 05/17/2018 03:40 PM, Andrew Lunn wrote: > On Thu, May 17, 2018 at 03:06:36PM -0700, Florian Fainelli wrote: >> On 05/17/2018 02:08 PM, Andrew Lunn wrote: >>> On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote: Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote: > On 05/17/2018 10:39 AM, Jiri Pirko wrote: That is compiled inside "fixed_phy", isn't it? >>> >>> It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes >>> built-in, if is modular, it is also modular, this was fixed with >>> 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional >>> dsa-loop dependency on FIXED_PHY") >> >> Now I have it compiled as module, and after modprobe dsa_loop I see: >> [ 1168.129202] libphy: Fixed MDIO Bus: probed >> [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f >> >> This messages I did not see when I had fixed_phy compiled as buildin. >> >> But I still see no netdevs :/ > > The platform data assumes there is a network device named "eth0" as the Oups, I missed, I created dummy device and modprobed again. Now I see: $ sudo devlink port mdio_bus/fixed-0:1f/0: type eth netdev lan1 mdio_bus/fixed-0:1f/1: type eth netdev lan2 mdio_bus/fixed-0:1f/2: type eth netdev lan3 mdio_bus/fixed-0:1f/3: type eth netdev lan4 mdio_bus/fixed-0:1f/4: type notset mdio_bus/fixed-0:1f/5: type notset mdio_bus/fixed-0:1f/6: type notset mdio_bus/fixed-0:1f/7: type notset mdio_bus/fixed-0:1f/8: type notset mdio_bus/fixed-0:1f/9: type notset mdio_bus/fixed-0:1f/10: type notset mdio_bus/fixed-0:1f/11: type notset I wonder why there are ports 4-11 >>> >>> Hi Jiri >>> >>> ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS); >>> >>> It is allocating a switch with 12 ports. However only 4 of them have >>> names. So the core only creates slave devices for those 4. >>> >>> This is a useful test. Real hardware often has unused ports. A WiFi AP >>> with a 7 port switch which only uses 6 ports is often seen. >> >> The following patch should fix this: >> >> >> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c >> index adf50fbc4c13..a06c29ec91f0 100644 >> --- a/net/dsa/dsa2.c >> +++ b/net/dsa/dsa2.c >> @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp) >> >> memset(>devlink_port, 0, sizeof(dp->devlink_port)); >> >> + if (dp->type == DSA_PORT_TYPE_UNUSED) >> + return 0; >> + >> err = devlink_port_register(ds->devlink, >devlink_port, >> dp->index); > > Hi Florian, Jiri > > Maybe it is better to add a devlink port type unused? The port does not exist on the switch, so it should not even be registered IMHO. -- Florian
Re: [net-next PATCH v2 3/4] net-sysfs: Add interface for Rx queue map per Tx queue
On 5/17/2018 12:05 PM, Florian Fainelli wrote: > On 05/15/2018 06:26 PM, Amritha Nambiar wrote: >> Extend transmit queue sysfs attribute to configure Rx queue map >> per Tx queue. By default no receive queues are configured for the >> Tx queue. >> >> - /sys/class/net/eth0/queues/tx-*/xps_rxqs > > Please include an update to Documentation/ABI/testing/sysfs-class-net > with your new attribute. > Will do in the next version. Thanks.
Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports
On Thu, May 17, 2018 at 03:06:36PM -0700, Florian Fainelli wrote: > On 05/17/2018 02:08 PM, Andrew Lunn wrote: > > On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote: > >> Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote: > >>> On 05/17/2018 10:39 AM, Jiri Pirko wrote: > >> That is compiled inside "fixed_phy", isn't it? > > > > It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes > > built-in, if is modular, it is also modular, this was fixed with > > 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional > > dsa-loop dependency on FIXED_PHY") > > Now I have it compiled as module, and after modprobe dsa_loop I see: > [ 1168.129202] libphy: Fixed MDIO Bus: probed > [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f > > This messages I did not see when I had fixed_phy compiled as buildin. > > But I still see no netdevs :/ > >>> > >>> The platform data assumes there is a network device named "eth0" as the > >> > >> Oups, I missed, I created dummy device and modprobed again. Now I see: > >> > >> $ sudo devlink port > >> mdio_bus/fixed-0:1f/0: type eth netdev lan1 > >> mdio_bus/fixed-0:1f/1: type eth netdev lan2 > >> mdio_bus/fixed-0:1f/2: type eth netdev lan3 > >> mdio_bus/fixed-0:1f/3: type eth netdev lan4 > >> mdio_bus/fixed-0:1f/4: type notset > >> mdio_bus/fixed-0:1f/5: type notset > >> mdio_bus/fixed-0:1f/6: type notset > >> mdio_bus/fixed-0:1f/7: type notset > >> mdio_bus/fixed-0:1f/8: type notset > >> mdio_bus/fixed-0:1f/9: type notset > >> mdio_bus/fixed-0:1f/10: type notset > >> mdio_bus/fixed-0:1f/11: type notset > >> > >> I wonder why there are ports 4-11 > > > > Hi Jiri > > > > ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS); > > > > It is allocating a switch with 12 ports. However only 4 of them have > > names. So the core only creates slave devices for those 4. > > > > This is a useful test. Real hardware often has unused ports. A WiFi AP > > with a 7 port switch which only uses 6 ports is often seen. > > The following patch should fix this: > > > diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c > index adf50fbc4c13..a06c29ec91f0 100644 > --- a/net/dsa/dsa2.c > +++ b/net/dsa/dsa2.c > @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp) > > memset(>devlink_port, 0, sizeof(dp->devlink_port)); > > + if (dp->type == DSA_PORT_TYPE_UNUSED) > + return 0; > + > err = devlink_port_register(ds->devlink, >devlink_port, > dp->index); Hi Florian, Jiri Maybe it is better to add a devlink port type unused? Andrew
Re: [PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link
On Thu, 17 May 2018 16:22:37 -0600 dsah...@kernel.org wrote: > From: David Ahern> > Using iproute2 to create a bridge and add 4094 vlans to it can take from > 2 to 3 *minutes*. The reason is the extraneous call to ll_name_to_index. > ll_name_to_index results in an ioctl(SIOCGIFINDEX) call which in turn > invokes dev_load. If the index does not exist, which it won't when > creating a new link, dev_load calls modprobe twice -- once for > netdev-NAME and again for NAME. This is unnecessary overhead for each > link create. > > When ip link is invoked for a new device, there is no reason to > call ll_name_to_index for the new device. With this patch, creating > a bridge and adding 4094 vlans takes less than 3 *seconds*. > > Signed-off-by: David Ahern Yes this looks like a real problem. Isn't the cache supposed to reduce this? Don't like to make lots of special case flags.
Re: [bpf PATCH v2 1/2] bpf: sockmap update rollback on error can incorrectly dec prog refcnt
On 05/17/2018 11:06 PM, John Fastabend wrote: > If the user were to only attach one of the parse or verdict programs > then it is possible a subsequent sockmap update could incorrectly > decrement the refcnt on the program. This happens because in the > rollback logic, after an error, we have to decrement the program > reference count when its been incremented. However, we only increment > the program reference count if the user has both a verdict and a > parse program. The reason for this is because, at least at the > moment, both are required for any one to be meaningful. The problem > fixed here is in the rollback path we decrement the program refcnt > even if only one existing. But we never incremented the refcnt in > the first place creating an imbalance. > > This patch fixes the error path to handle this case. > > Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add > multi-map support") > Reported-by: Daniel Borkmann> Signed-off-by: John Fastabend > Acked-by: Martin KaFai Lau Applied to bpf tree, thanks!
Re: [PATCH net] net: dsa: Do not register devlink for unused ports
On 05/17/2018 03:16 PM, Florian Fainelli wrote: > Even if commit 1d27732f411d ("net: dsa: setup and teardown ports") indicated > that registering a devlink instance for unused ports is not a problem, and > this > is true, this can be confusing nonetheless, so let's not do it. > > Fixes: 1d27732f411d ("net: dsa: setup and teardown ports") > Reported-by: Jiri Pirko> Signed-off-by: Florian Fainelli > --- > net/dsa/dsa2.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c > index adf50fbc4c13..cc45a8ca45fb 100644 > --- a/net/dsa/dsa2.c > +++ b/net/dsa/dsa2.c > @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp) > > memset(>devlink_port, 0, sizeof(dp->devlink_port)); > > + if (dp->type == DSA_PORT_TYPE_UNUSED) > + return 0; > + > err = devlink_port_register(ds->devlink, >devlink_port, dp->index); > if (err) > return err; > > switch (dp->type) { > - case DSA_PORT_TYPE_UNUSED: > - break; > case DSA_PORT_TYPE_CPU: > case DSA_PORT_TYPE_DSA: > err = dsa_port_link_register_of(dp); > @@ -293,11 +294,12 @@ static int dsa_port_setup(struct dsa_port *dp) > > static void dsa_port_teardown(struct dsa_port *dp) > { > + if (dp->type == DSA_PORT_TYPE_UNUSED) > + return; > + > devlink_port_unregister(>devlink_port); > > switch (dp->type) { > - case DSA_PORT_TYPE_UNUSED: > - break; Actually those should be kept in there in order not to generate a warning about DSA_PORT_TYPE_UNUSED not being handled by the switch() case statement, I will resubmit that shortly, or we could even move the registration until after, either way is likely fine. -- Florian
Re: [PATCH bpf-next 3/3] bpf: Add mtu checking to FIB forwarding helper
On 05/17/2018 06:09 PM, David Ahern wrote: > Add check that egress MTU can handle packet to be forwarded. If > the MTU is less than the packet lenght, return 0 meaning the > packet is expected to continue up the stack for help - eg., > fragmenting the packet or sending an ICMP. > > Signed-off-by: David Ahern> --- > net/core/filter.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 6d0d1560bd70..c47c47a75d4b 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -4098,6 +4098,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct > bpf_fib_lookup *params, > struct fib_nh *nh; > struct flowi4 fl4; > int err; > + u32 mtu; > > dev = dev_get_by_index_rcu(net, params->ifindex); > if (unlikely(!dev)) > @@ -4149,6 +4150,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct > bpf_fib_lookup *params, > if (res.fi->fib_nhs > 1) > fib_select_path(net, , , NULL); > > + mtu = ip_mtu_from_fib_result(, params->ipv4_dst); > + if (params->tot_len > mtu) > + return 0; > + > nh = >fib_nh[res.nh_sel]; > > /* do not handle lwt encaps right now */ > @@ -4188,6 +4193,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct > bpf_fib_lookup *params, > struct flowi6 fl6; > int strict = 0; > int oif; > + u32 mtu; > > /* link local addresses are never forwarded */ > if (rt6_need_strict(dst) || rt6_need_strict(src)) > @@ -4250,6 +4256,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct > bpf_fib_lookup *params, > fl6.flowi6_oif, NULL, > strict); > > + mtu = ip6_mtu_from_fib6(f6i, dst, src); > + if (params->tot_len > mtu) > + return 0; > + > if (f6i->fib6_nh.nh_lwtstate) > return 0; Could you elaborate how this interacts in tc BPF use case where you have e.g. GSO packets and tot_len from aggregated packets would definitely be larger than MTU (e.g. see is_skb_forwardable() as one example on such checks)? Should this be an opt-in via a new flag for the helper? Thanks, Daniel
[PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link
From: David AhernUsing iproute2 to create a bridge and add 4094 vlans to it can take from 2 to 3 *minutes*. The reason is the extraneous call to ll_name_to_index. ll_name_to_index results in an ioctl(SIOCGIFINDEX) call which in turn invokes dev_load. If the index does not exist, which it won't when creating a new link, dev_load calls modprobe twice -- once for netdev-NAME and again for NAME. This is unnecessary overhead for each link create. When ip link is invoked for a new device, there is no reason to call ll_name_to_index for the new device. With this patch, creating a bridge and adding 4094 vlans takes less than 3 *seconds*. Signed-off-by: David Ahern --- ip/ip_common.h| 3 ++- ip/iplink.c | 22 +- ip/iplink_vxcan.c | 3 ++- ip/link_veth.c| 3 ++- 4 files changed, 19 insertions(+), 12 deletions(-) diff --git a/ip/ip_common.h b/ip/ip_common.h index 1b89795caa58..67f413474631 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -132,7 +132,8 @@ struct link_util { struct link_util *get_link_kind(const char *kind); -int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type); +int iplink_parse(int argc, char **argv, struct iplink_req *req, +char **type, bool is_add_cmd); /* iplink_bridge.c */ void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len); diff --git a/ip/iplink.c b/ip/iplink.c index e6bb4493120e..c8bf49ed3d24 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -571,7 +571,8 @@ static int iplink_parse_vf(int vf, int *argcp, char ***argvp, return 0; } -int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type) +int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type, +bool is_add_cmd) { char *name = NULL; char *dev = NULL; @@ -610,7 +611,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type) name = *argv; if (!dev) { dev = name; - dev_index = ll_name_to_index(dev); + if (!is_add_cmd) + dev_index = ll_name_to_index(dev); } } else if (strcmp(*argv, "index") == 0) { NEXT_ARG(); @@ -919,7 +921,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type) if (check_ifname(*argv)) invarg("\"dev\" not a valid ifname", *argv); dev = *argv; - dev_index = ll_name_to_index(dev); + if (!is_add_cmd) + dev_index = ll_name_to_index(dev); } argc--; argv++; } @@ -1011,7 +1014,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type) return ret; } -static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) +static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv, +bool is_add_cmd) { char *type = NULL; struct iplink_req req = { @@ -1022,7 +1026,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) }; int ret; - ret = iplink_parse(argc, argv, , ); + ret = iplink_parse(argc, argv, , , is_add_cmd); if (ret < 0) return ret; @@ -1630,18 +1634,18 @@ int do_iplink(int argc, char **argv) if (matches(*argv, "add") == 0) return iplink_modify(RTM_NEWLINK, NLM_F_CREATE|NLM_F_EXCL, -argc-1, argv+1); +argc-1, argv+1, true); if (matches(*argv, "set") == 0 || matches(*argv, "change") == 0) return iplink_modify(RTM_NEWLINK, 0, -argc-1, argv+1); +argc-1, argv+1, false); if (matches(*argv, "replace") == 0) return iplink_modify(RTM_NEWLINK, NLM_F_CREATE|NLM_F_REPLACE, -argc-1, argv+1); +argc-1, argv+1, false); if (matches(*argv, "delete") == 0) return iplink_modify(RTM_DELLINK, 0, -argc-1, argv+1); +argc-1, argv+1, false); } else { #if IPLINK_IOCTL_COMPAT if (matches(*argv, "set") == 0) diff --git a/ip/iplink_vxcan.c b/ip/iplink_vxcan.c index 8b08c9a70c65..e30a784d9851 100644 --- a/ip/iplink_vxcan.c +++
[PATCH net] net: dsa: Do not register devlink for unused ports
Even if commit 1d27732f411d ("net: dsa: setup and teardown ports") indicated that registering a devlink instance for unused ports is not a problem, and this is true, this can be confusing nonetheless, so let's not do it. Fixes: 1d27732f411d ("net: dsa: setup and teardown ports") Reported-by: Jiri PirkoSigned-off-by: Florian Fainelli --- net/dsa/dsa2.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index adf50fbc4c13..cc45a8ca45fb 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp) memset(>devlink_port, 0, sizeof(dp->devlink_port)); + if (dp->type == DSA_PORT_TYPE_UNUSED) + return 0; + err = devlink_port_register(ds->devlink, >devlink_port, dp->index); if (err) return err; switch (dp->type) { - case DSA_PORT_TYPE_UNUSED: - break; case DSA_PORT_TYPE_CPU: case DSA_PORT_TYPE_DSA: err = dsa_port_link_register_of(dp); @@ -293,11 +294,12 @@ static int dsa_port_setup(struct dsa_port *dp) static void dsa_port_teardown(struct dsa_port *dp) { + if (dp->type == DSA_PORT_TYPE_UNUSED) + return; + devlink_port_unregister(>devlink_port); switch (dp->type) { - case DSA_PORT_TYPE_UNUSED: - break; case DSA_PORT_TYPE_CPU: case DSA_PORT_TYPE_DSA: dsa_port_link_unregister_of(dp); -- 2.14.1
Proposal
Hello Greetings to you please i have a business proposal for you contact me for more detailes asap thanks. Best Regards, Miss.Zeliha ömer faruk Esentepe Mahallesi Büyükdere Caddesi Kristal Kule Binasi No:215 Sisli - Istanbul, Turkey
Re: [PATCH v3 net-next 3/6] tcp: add SACK compression
On Thu, May 17, 2018 at 2:57 PM, Neal Cardwellwrote: > On Thu, May 17, 2018 at 5:47 PM Eric Dumazet wrote: > >> When TCP receives an out-of-order packet, it immediately sends >> a SACK packet, generating network load but also forcing the >> receiver to send 1-MSS pathological packets, increasing its >> RTX queue length/depth, and thus processing time. > >> Wifi networks suffer from this aggressive behavior, but generally >> speaking, all these SACK packets add fuel to the fire when networks >> are under congestion. > >> This patch adds a high resolution timer and tp->compressed_ack counter. > >> Instead of sending a SACK, we program this timer with a small delay, >> based on RTT and capped to 1 ms : > >> delay = min ( 5 % of RTT, 1 ms) > >> If subsequent SACKs need to be sent while the timer has not yet >> expired, we simply increment tp->compressed_ack. > >> When timer expires, a SACK is sent with the latest information. >> Whenever an ACK is sent (if data is sent, or if in-order >> data is received) timer is canceled. > >> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent >> if the sack blocks need to be shuffled, even if the timer has not >> expired. > >> A new SNMP counter is added in the following patch. > >> Two other patches add sysctls to allow changing the 1,000,000 and 44 >> values that this commit hard-coded. > >> Signed-off-by: Eric Dumazet >> --- > > Very nice. I like the constants and the min(rcv_rtt, srtt). > > Acked-by: Neal Cardwell Acked-by: Yuchung Cheng Great work. Hopefully this would save middle-boxes' from handling TCP-ACK themselves. > > Thanks! > > neal
Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports
On 05/17/2018 02:08 PM, Andrew Lunn wrote: > On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote: >> Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote: >>> On 05/17/2018 10:39 AM, Jiri Pirko wrote: >> That is compiled inside "fixed_phy", isn't it? > > It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes > built-in, if is modular, it is also modular, this was fixed with > 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional > dsa-loop dependency on FIXED_PHY") Now I have it compiled as module, and after modprobe dsa_loop I see: [ 1168.129202] libphy: Fixed MDIO Bus: probed [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f This messages I did not see when I had fixed_phy compiled as buildin. But I still see no netdevs :/ >>> >>> The platform data assumes there is a network device named "eth0" as the >> >> Oups, I missed, I created dummy device and modprobed again. Now I see: >> >> $ sudo devlink port >> mdio_bus/fixed-0:1f/0: type eth netdev lan1 >> mdio_bus/fixed-0:1f/1: type eth netdev lan2 >> mdio_bus/fixed-0:1f/2: type eth netdev lan3 >> mdio_bus/fixed-0:1f/3: type eth netdev lan4 >> mdio_bus/fixed-0:1f/4: type notset >> mdio_bus/fixed-0:1f/5: type notset >> mdio_bus/fixed-0:1f/6: type notset >> mdio_bus/fixed-0:1f/7: type notset >> mdio_bus/fixed-0:1f/8: type notset >> mdio_bus/fixed-0:1f/9: type notset >> mdio_bus/fixed-0:1f/10: type notset >> mdio_bus/fixed-0:1f/11: type notset >> >> I wonder why there are ports 4-11 > > Hi Jiri > > ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS); > > It is allocating a switch with 12 ports. However only 4 of them have > names. So the core only creates slave devices for those 4. > > This is a useful test. Real hardware often has unused ports. A WiFi AP > with a 7 port switch which only uses 6 ports is often seen. The following patch should fix this: diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index adf50fbc4c13..a06c29ec91f0 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp) memset(>devlink_port, 0, sizeof(dp->devlink_port)); + if (dp->type == DSA_PORT_TYPE_UNUSED) + return 0; + err = devlink_port_register(ds->devlink, >devlink_port, dp->index); if (err) return err; switch (dp->type) { - case DSA_PORT_TYPE_UNUSED: - break; case DSA_PORT_TYPE_CPU: case DSA_PORT_TYPE_DSA: err = dsa_port_link_register_of(dp); @@ -286,6 +287,8 @@ static int dsa_port_setup(struct dsa_port *dp) else devlink_port_type_eth_set(>devlink_port, dp->slave); break; + default: + break; } return 0; @@ -293,11 +296,12 @@ static int dsa_port_setup(struct dsa_port *dp) static void dsa_port_teardown(struct dsa_port *dp) { + if (dp->type == DSA_PORT_TYPE_UNUSED) + return; + devlink_port_unregister(>devlink_port); switch (dp->type) { - case DSA_PORT_TYPE_UNUSED: - break; case DSA_PORT_TYPE_CPU: case DSA_PORT_TYPE_DSA: dsa_port_link_unregister_of(dp); @@ -308,6 +312,8 @@ static void dsa_port_teardown(struct dsa_port *dp) dp->slave = NULL; } break; + default: + break; } } -- Florian
Re: [PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl
On Thu, May 17, 2018 at 5:47 PM Eric Dumazetwrote: > This per netns sysctl allows for TCP SACK compression fine-tuning. > This limits number of SACK that can be compressed. > Using 0 disables SACK compression. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl
On Thu, May 17, 2018 at 5:47 PM Eric Dumazetwrote: > This per netns sysctl allows for TCP SACK compression fine-tuning. > Its default value is 1,000,000, or 1 ms to meet TSO autosizing period. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
[net PATCH] net: Fix a bug in removing queues from XPS map
While removing queues from the XPS map, the individual CPU ID alone was used to index the CPUs map, this should be changed to also factor in the traffic class mapping for the CPU-to-queue lookup. Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes") Signed-off-by: Amritha NambiarAcked-by: Alexander Duyck --- net/core/dev.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 9f43901..9397577 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2125,7 +2125,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev, int i, j; for (i = count, j = offset; i--; j++) { - if (!remove_xps_queue(dev_maps, cpu, j)) + if (!remove_xps_queue(dev_maps, tci, j)) break; }
Re: [PATCH v3 net-next 3/6] tcp: add SACK compression
On Thu, May 17, 2018 at 5:47 PM Eric Dumazetwrote: > When TCP receives an out-of-order packet, it immediately sends > a SACK packet, generating network load but also forcing the > receiver to send 1-MSS pathological packets, increasing its > RTX queue length/depth, and thus processing time. > Wifi networks suffer from this aggressive behavior, but generally > speaking, all these SACK packets add fuel to the fire when networks > are under congestion. > This patch adds a high resolution timer and tp->compressed_ack counter. > Instead of sending a SACK, we program this timer with a small delay, > based on RTT and capped to 1 ms : > delay = min ( 5 % of RTT, 1 ms) > If subsequent SACKs need to be sent while the timer has not yet > expired, we simply increment tp->compressed_ack. > When timer expires, a SACK is sent with the latest information. > Whenever an ACK is sent (if data is sent, or if in-order > data is received) timer is canceled. > Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent > if the sack blocks need to be shuffled, even if the timer has not > expired. > A new SNMP counter is added in the following patch. > Two other patches add sysctls to allow changing the 1,000,000 and 44 > values that this commit hard-coded. > Signed-off-by: Eric Dumazet > --- Very nice. I like the constants and the min(rcv_rtt, srtt). Acked-by: Neal Cardwell Thanks! neal
Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
On 2018-05-17 17:00, Steve Grubb wrote: > On Fri, 16 Mar 2018 05:00:28 -0400 > Richard Guy Briggswrote: > > > Implement the proc fs write to set the audit container ID of a > > process, emitting an AUDIT_CONTAINER record to document the event. > > > > This is a write from the container orchestrator task to a proc entry > > of the form /proc/PID/containerid where PID is the process ID of the > > newly created task that is to become the first task in a container, > > or an additional task added to a container. > > > > The write expects up to a u64 value (unset: 18446744073709551615). > > > > This will produce a record such as this: > > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 > > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 > > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 > > res=0 > > The was one thing I was wondering about. Currently when we set the > loginuid, the record is AUDIT_LOGINUID. The corollary is that when we > set the container id, the event should be AUDIT_CONTAINERID or > AUDIT_CONTAINER_ID. The record type is actually AUDIT_LOGIN. The field type is AUDIT_LOGINUID. Given that correction, I think we're fine and could potentially violently agree. The existing naming is consistent. > During syscall events, the path info is returned in a a record simply > called AUDIT_PATH, cwd info is returned in AUDIT_CWD. So, rather than > calling the record that gets attached to everything > AUDIT_CONTAINER_INFO, how about simply AUDIT_CONTAINER. Considering the container initiation record is different than the record to document the container involved in an otherwise normal syscall, we need two names. I don't have a strong opinion what they are. I'd prefer AUDIT_CONTAINER and AUDIT_CONTAINER_INFO so that the two are different enough to be visually distinct while leaving AUDIT_CONTAINERID for the field type in patch 4 ("audit: add containerid filtering") > > The "op" field indicates an initial set. The "pid" to "ses" fields > > are the orchestrator while the "opid" field is the object's PID, the > > process being "contained". Old and new container ID values are given > > in the "contid" fields, while res indicates its success. > > > > It is not permitted to self-set, unset or re-set the container ID. A > > child inherits its parent's container ID, but then can be set only > > once after. > > > > See: https://github.com/linux-audit/audit-kernel/issues/32 > > > > Signed-off-by: Richard Guy Briggs > > --- > > fs/proc/base.c | 37 > > include/linux/audit.h | 16 + > > include/linux/init_task.h | 4 ++- > > include/linux/sched.h | 1 + > > include/uapi/linux/audit.h | 2 ++ > > kernel/auditsc.c | 84 > > ++ 6 files changed, 143 > > insertions(+), 1 deletion(-) > > > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > index 60316b5..6ce4fbe 100644 > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -1299,6 +1299,41 @@ static ssize_t proc_sessionid_read(struct file > > * file, char __user * buf, .read= proc_sessionid_read, > > .llseek = generic_file_llseek, > > }; > > + > > +static ssize_t proc_containerid_write(struct file *file, const char > > __user *buf, > > + size_t count, loff_t *ppos) > > +{ > > + struct inode *inode = file_inode(file); > > + u64 containerid; > > + int rv; > > + struct task_struct *task = get_proc_task(inode); > > + > > + if (!task) > > + return -ESRCH; > > + if (*ppos != 0) { > > + /* No partial writes. */ > > + put_task_struct(task); > > + return -EINVAL; > > + } > > + > > + rv = kstrtou64_from_user(buf, count, 10, ); > > + if (rv < 0) { > > + put_task_struct(task); > > + return rv; > > + } > > + > > + rv = audit_set_containerid(task, containerid); > > + put_task_struct(task); > > + if (rv < 0) > > + return rv; > > + return count; > > +} > > + > > +static const struct file_operations proc_containerid_operations = { > > + .write = proc_containerid_write, > > + .llseek = generic_file_llseek, > > +}; > > + > > #endif > > > > #ifdef CONFIG_FAULT_INJECTION > > @@ -2961,6 +2996,7 @@ static int proc_pid_patch_state(struct seq_file > > *m, struct pid_namespace *ns, #ifdef CONFIG_AUDITSYSCALL > > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), > > REG("sessionid", S_IRUGO, proc_sessionid_operations), > > + REG("containerid", S_IWUSR, proc_containerid_operations), > > #endif > > #ifdef CONFIG_FAULT_INJECTION > > REG("make-it-fail", S_IRUGO|S_IWUSR, > > proc_fault_inject_operations), @@ -3355,6 +3391,7 @@ static int > > proc_tid_comm_permission(struct inode *inode, int mask) #ifdef > > CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, > >
Re: [PATCH net 0/7] net: ip6_gre: Fixes in headroom handling
David Millerwrites: > Luckily for you, your Fixes: tags went out before I pushed, so I could > actually fix up the commit messages and add the tags. I was hoping that would be the case. Thanks, Petr
[PATCH v3 net-next 2/6] tcp: do not force quickack when receiving out-of-order packets
As explained in commit 9f9843a751d0 ("tcp: properly handle stretch acks in slow start"), TCP stacks have to consider how many packets are acknowledged in one single ACK, because of GRO, but also because of ACK compression or losses. We plan to add SACK compression in the following patch, we must therefore not call tcp_enter_quickack_mode() Signed-off-by: Eric DumazetAcked-by: Neal Cardwell Acked-by: Soheil Hassas Yeganeh --- net/ipv4/tcp_input.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0bf032839548f8dccb7f24a6fb5a7d47ea29208b..f5622b250665178e44460fa2cd4a11af23dfb23d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4715,8 +4715,6 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp))) goto out_of_window; - tcp_enter_quickack_mode(sk); - if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) { /* Partial packet, seq < rcv_next < end_seq */ SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X\n", -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v3 net-next 1/6] tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()
Socket can not disappear under us. Signed-off-by: Eric DumazetAcked-by: Neal Cardwell --- include/net/tcp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6deb540297ccaa1f05ce633efe313d1ca2c15dd9..511bd0fde1dc1dd842598d083905b0425bcb05f8 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -559,7 +559,7 @@ void tcp_init_xmit_timers(struct sock *); static inline void tcp_clear_xmit_timers(struct sock *sk) { if (hrtimer_try_to_cancel(_sk(sk)->pacing_timer) == 1) - sock_put(sk); + __sock_put(sk); inet_csk_clear_xmit_timers(sk); } -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT
Hi, Again thanks for a thoughtful review. This will definitely will improve the code. On Thu, May 17, 2018 at 10:02:52AM -0700, Y Song wrote: > On Wed, May 16, 2018 at 2:04 PM, Sean Youngwrote: > > Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call > > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report > > that the last key should be repeated. > > > > The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; > > the target_fd must be the /dev/lircN device. > > > > Signed-off-by: Sean Young > > --- > > drivers/media/rc/Kconfig | 13 ++ > > drivers/media/rc/Makefile | 1 + > > drivers/media/rc/bpf-rawir-event.c | 363 + > > drivers/media/rc/lirc_dev.c| 24 ++ > > drivers/media/rc/rc-core-priv.h| 24 ++ > > drivers/media/rc/rc-ir-raw.c | 14 +- > > include/linux/bpf_rcdev.h | 30 +++ > > include/linux/bpf_types.h | 3 + > > include/uapi/linux/bpf.h | 55 - > > kernel/bpf/syscall.c | 7 + > > 10 files changed, 531 insertions(+), 3 deletions(-) > > create mode 100644 drivers/media/rc/bpf-rawir-event.c > > create mode 100644 include/linux/bpf_rcdev.h > > > > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig > > index eb2c3b6eca7f..2172d65b0213 100644 > > --- a/drivers/media/rc/Kconfig > > +++ b/drivers/media/rc/Kconfig > > @@ -25,6 +25,19 @@ config LIRC > >passes raw IR to and from userspace, which is needed for > >IR transmitting (aka "blasting") and for the lirc daemon. > > > > +config BPF_RAWIR_EVENT > > + bool "Support for eBPF programs attached to lirc devices" > > + depends on BPF_SYSCALL > > + depends on RC_CORE=y > > + depends on LIRC > > + help > > + Allow attaching eBPF programs to a lirc device using the bpf(2) > > + syscall command BPF_PROG_ATTACH. This is supported for raw IR > > + receivers. > > + > > + These eBPF programs can be used to decode IR into scancodes, for > > + IR protocols not supported by the kernel decoders. > > + > > menuconfig RC_DECODERS > > bool "Remote controller decoders" > > depends on RC_CORE > > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile > > index 2e1c87066f6c..74907823bef8 100644 > > --- a/drivers/media/rc/Makefile > > +++ b/drivers/media/rc/Makefile > > @@ -5,6 +5,7 @@ obj-y += keymaps/ > > obj-$(CONFIG_RC_CORE) += rc-core.o > > rc-core-y := rc-main.o rc-ir-raw.o > > rc-core-$(CONFIG_LIRC) += lirc_dev.o > > +rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o > > obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o > > obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o > > obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o > > diff --git a/drivers/media/rc/bpf-rawir-event.c > > b/drivers/media/rc/bpf-rawir-event.c > > new file mode 100644 > > index ..7cb48b8d87b5 > > --- /dev/null > > +++ b/drivers/media/rc/bpf-rawir-event.c > > @@ -0,0 +1,363 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +// bpf-rawir-event.c - handles bpf > > +// > > +// Copyright (C) 2018 Sean Young > > + > > +#include > > +#include > > +#include > > +#include "rc-core-priv.h" > > + > > +/* > > + * BPF interface for raw IR > > + */ > > +const struct bpf_prog_ops rawir_event_prog_ops = { > > +}; > > + > > +BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event) > > +{ > > + struct ir_raw_event_ctrl *ctrl; > > + > > + ctrl = container_of(event, struct ir_raw_event_ctrl, > > bpf_rawir_event); > > + > > + rc_repeat(ctrl->dev); > > + > > + return 0; > > +} > > + > > +static const struct bpf_func_proto rc_repeat_proto = { > > + .func = bpf_rc_repeat, > > + .gpl_only = true, /* rc_repeat is EXPORT_SYMBOL_GPL */ > > + .ret_type = RET_INTEGER, > > + .arg1_type = ARG_PTR_TO_CTX, > > +}; > > + > > +BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol, > > + u32, scancode, u32, toggle) > > +{ > > + struct ir_raw_event_ctrl *ctrl; > > + > > + ctrl = container_of(event, struct ir_raw_event_ctrl, > > bpf_rawir_event); > > + > > + rc_keydown(ctrl->dev, protocol, scancode, toggle != 0); > > + > > + return 0; > > +} > > + > > +static const struct bpf_func_proto rc_keydown_proto = { > > + .func = bpf_rc_keydown, > > + .gpl_only = true, /* rc_keydown is EXPORT_SYMBOL_GPL */ > > + .ret_type = RET_INTEGER, > > + .arg1_type = ARG_PTR_TO_CTX, > > + .arg2_type = ARG_ANYTHING, > > + .arg3_type = ARG_ANYTHING, > > + .arg4_type = ARG_ANYTHING, > > +}; > > + > > +static const struct bpf_func_proto * > > +rawir_event_func_proto(enum bpf_func_id func_id, const struct bpf_prog > > *prog) > > +{ > > + switch (func_id) { > > + case BPF_FUNC_rc_repeat: > > +
[PATCH v3 net-next 3/6] tcp: add SACK compression
When TCP receives an out-of-order packet, it immediately sends a SACK packet, generating network load but also forcing the receiver to send 1-MSS pathological packets, increasing its RTX queue length/depth, and thus processing time. Wifi networks suffer from this aggressive behavior, but generally speaking, all these SACK packets add fuel to the fire when networks are under congestion. This patch adds a high resolution timer and tp->compressed_ack counter. Instead of sending a SACK, we program this timer with a small delay, based on RTT and capped to 1 ms : delay = min ( 5 % of RTT, 1 ms) If subsequent SACKs need to be sent while the timer has not yet expired, we simply increment tp->compressed_ack. When timer expires, a SACK is sent with the latest information. Whenever an ACK is sent (if data is sent, or if in-order data is received) timer is canceled. Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent if the sack blocks need to be shuffled, even if the timer has not expired. A new SNMP counter is added in the following patch. Two other patches add sysctls to allow changing the 1,000,000 and 44 values that this commit hard-coded. Signed-off-by: Eric Dumazet--- include/linux/tcp.h | 2 ++ include/net/tcp.h | 3 +++ net/ipv4/tcp.c| 1 + net/ipv4/tcp_input.c | 35 +-- net/ipv4/tcp_output.c | 7 +++ net/ipv4/tcp_timer.c | 25 + 6 files changed, 67 insertions(+), 6 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 807776928cb8610fe97121fbc3c600b08d5d2991..72705eaf4b84060a45bf04d5170f389a18010eac 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -218,6 +218,7 @@ struct tcp_sock { reord:1; /* reordering detected */ } rack; u16 advmss; /* Advertised MSS */ + u8 compressed_ack; u32 chrono_start; /* Start time in jiffies of a TCP chrono */ u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */ u8 chrono_type:2, /* current chronograph type */ @@ -297,6 +298,7 @@ struct tcp_sock { u32 sacked_out; /* SACK'd packets */ struct hrtimer pacing_timer; + struct hrtimer compressed_ack_timer; /* from STCP, retrans queue hinting */ struct sk_buff* lost_skb_hint; diff --git a/include/net/tcp.h b/include/net/tcp.h index 511bd0fde1dc1dd842598d083905b0425bcb05f8..952d842a604a3ed79e1bf87a712db20a461c35a9 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -561,6 +561,9 @@ static inline void tcp_clear_xmit_timers(struct sock *sk) if (hrtimer_try_to_cancel(_sk(sk)->pacing_timer) == 1) __sock_put(sk); + if (hrtimer_try_to_cancel(_sk(sk)->compressed_ack_timer) == 1) + __sock_put(sk); + inet_csk_clear_xmit_timers(sk); } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 62b776f9003798eaf06992a4eb0914d17646aa61..0a2ea0bbf867271db05aedd7d48b193677664321 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2595,6 +2595,7 @@ int tcp_disconnect(struct sock *sk, int flags) dst_release(sk->sk_rx_dst); sk->sk_rx_dst = NULL; tcp_saved_syn_free(tp); + tp->compressed_ack = 0; /* Clean up fastopen related fields */ tcp_free_fastopen_req(tp); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index f5622b250665178e44460fa2cd4a11af23dfb23d..cc2ac5346b92b968593f919192d543384865bcb8 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4249,6 +4249,8 @@ static void tcp_sack_new_ofo_skb(struct sock *sk, u32 seq, u32 end_seq) * If the sack array is full, forget about the last one. */ if (this_sack >= TCP_NUM_SACKS) { + if (tp->compressed_ack) + tcp_send_ack(sk); this_sack--; tp->rx_opt.num_sacks--; sp--; @@ -5081,6 +5083,7 @@ static inline void tcp_data_snd_check(struct sock *sk) static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) { struct tcp_sock *tp = tcp_sk(sk); + unsigned long rtt, delay; /* More than one full frame received... */ if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss && @@ -5092,15 +5095,35 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat || __tcp_select_window(sk) >= tp->rcv_wnd)) || /* We ACK each frame or... */ - tcp_in_quickack_mode(sk) || - /* We have out of order data. */ - (ofo_possible && !RB_EMPTY_ROOT(>out_of_order_queue))) { - /* Then ack it now */ + tcp_in_quickack_mode(sk)) { +send_now: tcp_send_ack(sk); - } else { - /* Else, send delayed
[PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl
This per netns sysctl allows for TCP SACK compression fine-tuning. Its default value is 1,000,000, or 1 ms to meet TSO autosizing period. Signed-off-by: Eric Dumazet--- Documentation/networking/ip-sysctl.txt | 7 +++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 7 +++ net/ipv4/tcp_input.c | 4 ++-- net/ipv4/tcp_ipv4.c| 1 + 5 files changed, 18 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ea304a23c8d72c92a925d0c107bfd2bcfbbb92ec..7ba952959bca0eee4ecf81fb5837e17790db0fde 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -525,6 +525,13 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max tcp_sack - BOOLEAN Enable select acknowledgments (SACKS). +tcp_comp_sack_delay_ns - LONG INTEGER + TCP tries to reduce number of SACK sent, using a timer + based on 5% of SRTT, capped by this sysctl, in nano seconds. + The default is 1ms, based on TSO autosizing period. + + Default : 1,000,000 ns (1 ms) + tcp_slow_start_after_idle - BOOLEAN If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 8491bc9c86b1553ab603e4363e8e38ca7ff547e0..927318243cfaa2ddd8eb423c6ba6e66253f771d3 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -160,6 +160,7 @@ struct netns_ipv4 { int sysctl_tcp_pacing_ca_ratio; int sysctl_tcp_wmem[3]; int sysctl_tcp_rmem[3]; + unsigned long sysctl_tcp_comp_sack_delay_ns; struct inet_timewait_death_row tcp_death_row; int sysctl_max_syn_backlog; int sysctl_tcp_fastopen; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 4b195bac8ac0eefe0a224528ad854338c4f8e6e3..11fbfdc1566eca95f91360522178295318277588 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1151,6 +1151,13 @@ static struct ctl_table ipv4_net_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = , }, + { + .procname = "tcp_comp_sack_delay_ns", + .data = _net.ipv4.sysctl_tcp_comp_sack_delay_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, { .procname = "udp_rmem_min", .data = _net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cc2ac5346b92b968593f919192d543384865bcb8..6a1dae38c9558c7bc9dd31e9f16c4e8ea8c78149 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5113,13 +5113,13 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) if (hrtimer_is_queued(>compressed_ack_timer)) return; - /* compress ack timer : 5 % of rtt, but no more than 1 ms */ + /* compress ack timer : 5 % of rtt, but no more than tcp_comp_sack_delay_ns */ rtt = tp->rcv_rtt_est.rtt_us; if (tp->srtt_us && tp->srtt_us < rtt) rtt = tp->srtt_us; - delay = min_t(unsigned long, NSEC_PER_MSEC, + delay = min_t(unsigned long, sock_net(sk)->ipv4.sysctl_tcp_comp_sack_delay_ns, rtt * (NSEC_PER_USEC >> 3)/20); sock_hold(sk); hrtimer_start(>compressed_ack_timer, ns_to_ktime(delay), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index caf23de88f8a369c2038cecd34ce42c522487e90..a3f4647341db2eb5a63c3e9f1e8b93099aedadab 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2572,6 +2572,7 @@ static int __net_init tcp_sk_init(struct net *net) init_net.ipv4.sysctl_tcp_wmem, sizeof(init_net.ipv4.sysctl_tcp_wmem)); } + net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC; net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE; spin_lock_init(>ipv4.tcp_fastopen_ctx_lock); net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl
This per netns sysctl allows for TCP SACK compression fine-tuning. This limits number of SACK that can be compressed. Using 0 disables SACK compression. Signed-off-by: Eric Dumazet--- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 10 ++ net/ipv4/tcp_input.c | 3 ++- net/ipv4/tcp_ipv4.c| 1 + 5 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 7ba952959bca0eee4ecf81fb5837e17790db0fde..924bd51327b7a8dff3503d7afccdd54e1eb5c29b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -532,6 +532,12 @@ tcp_comp_sack_delay_ns - LONG INTEGER Default : 1,000,000 ns (1 ms) +tcp_comp_sack_nr - INTEGER + Max numer of SACK that can be compressed. + Using 0 disables SACK compression. + + Detault : 44 + tcp_slow_start_after_idle - BOOLEAN If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 927318243cfaa2ddd8eb423c6ba6e66253f771d3..661348f23ea5a3a9320b2cafcd17e23960214771 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -160,6 +160,7 @@ struct netns_ipv4 { int sysctl_tcp_pacing_ca_ratio; int sysctl_tcp_wmem[3]; int sysctl_tcp_rmem[3]; + int sysctl_tcp_comp_sack_nr; unsigned long sysctl_tcp_comp_sack_delay_ns; struct inet_timewait_death_row tcp_death_row; int sysctl_max_syn_backlog; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 11fbfdc1566eca95f91360522178295318277588..d2eed3ddcb0a1ad9778d96d46c685f6c60b93d8d 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -46,6 +46,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int comp_sack_nr_max = 255; /* obsolete */ static int sysctl_tcp_low_latency __read_mostly; @@ -1158,6 +1159,15 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, + { + .procname = "tcp_comp_sack_nr", + .data = _net.ipv4.sysctl_tcp_comp_sack_nr, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = , + .extra2 = _sack_nr_max, + }, { .procname = "udp_rmem_min", .data = _net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6a1dae38c9558c7bc9dd31e9f16c4e8ea8c78149..aebb29ab2fdf2ceaa182cd11928f145a886149ff 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5106,7 +5106,8 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) return; } - if (!tcp_is_sack(tp) || tp->compressed_ack >= 44) + if (!tcp_is_sack(tp) || + tp->compressed_ack >= sock_net(sk)->ipv4.sysctl_tcp_comp_sack_nr) goto send_now; tp->compressed_ack++; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a3f4647341db2eb5a63c3e9f1e8b93099aedadab..adbdb503db0c983ef4185f83b138aa51bafd15bf 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2573,6 +2573,7 @@ static int __net_init tcp_sk_init(struct net *net) sizeof(init_net.ipv4.sysctl_tcp_wmem)); } net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC; + net->ipv4.sysctl_tcp_comp_sack_nr = 44; net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE; spin_lock_init(>ipv4.tcp_fastopen_ctx_lock); net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v3 net-next 4/6] tcp: add TCPAckCompressed SNMP counter
This counter tracks number of ACK packets that the host has not sent, thanks to ACK compression. Sample output : $ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed" IpInReceives123250 0.0 IpOutRequests 3684 0.0 TcpInSegs 123251 0.0 TcpOutSegs 3684 0.0 TcpExtTCPAckCompressed 119252 0.0 Signed-off-by: Eric DumazetAcked-by: Neal Cardwell --- include/uapi/linux/snmp.h | 1 + net/ipv4/proc.c | 1 + net/ipv4/tcp_output.c | 2 ++ 3 files changed, 4 insertions(+) diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h index d02e859301ff499dd72a1c0e1b56bed10a9397a6..750d89120335eb489f698191edb6c5110969fa8c 100644 --- a/include/uapi/linux/snmp.h +++ b/include/uapi/linux/snmp.h @@ -278,6 +278,7 @@ enum LINUX_MIB_TCPMTUPSUCCESS, /* TCPMTUPSuccess */ LINUX_MIB_TCPDELIVERED, /* TCPDelivered */ LINUX_MIB_TCPDELIVEREDCE, /* TCPDeliveredCE */ + LINUX_MIB_TCPACKCOMPRESSED, /* TCPAckCompressed */ __LINUX_MIB_MAX }; diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 261b71d0ccc5c17c6032bf67eb8f842006766e64..6c1ff89a60fa0a3485dcc71fafc799e798d5dc11 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -298,6 +298,7 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS), SNMP_MIB_ITEM("TCPDelivered", LINUX_MIB_TCPDELIVERED), SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE), + SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED), SNMP_MIB_SENTINEL }; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7ee98aad82b758674ca7f3e90bd3fc165e8fcd45..437bb7ceba7fd388abac1c12f2920b02be77bad9 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -165,6 +165,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts) struct tcp_sock *tp = tcp_sk(sk); if (unlikely(tp->compressed_ack)) { + NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPACKCOMPRESSED, + tp->compressed_ack); tp->compressed_ack = 0; if (hrtimer_try_to_cancel(>compressed_ack_timer) == 1) __sock_put(sk); -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v3 net-next 0/6] tcp: implement SACK compression
When TCP receives an out-of-order packet, it immediately sends a SACK packet, generating network load but also forcing the receiver to send 1-MSS pathological packets, increasing its RTX queue length/depth, and thus processing time. Wifi networks suffer from this aggressive behavior, but generally speaking, all these SACK packets add fuel to the fire when networks are under congestion. This patch series adds SACK compression, but the infrastructure could be leveraged to also compress ACK in the future. v2: Addressed Neal feedback. Added two sysctls to allow fine tuning, or even disabling the feature. v3: take rtt = min(srtt, rcv_rtt) as Yuchung suggested, because rcv_rtt can be over estimated for RPC (or sender limited) Eric Dumazet (6): tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers() tcp: do not force quickack when receiving out-of-order packets tcp: add SACK compression tcp: add TCPAckCompressed SNMP counter tcp: add tcp_comp_sack_delay_ns sysctl tcp: add tcp_comp_sack_nr sysctl Documentation/networking/ip-sysctl.txt | 13 + include/linux/tcp.h| 2 ++ include/net/netns/ipv4.h | 2 ++ include/net/tcp.h | 5 +++- include/uapi/linux/snmp.h | 1 + net/ipv4/proc.c| 1 + net/ipv4/sysctl_net_ipv4.c | 17 net/ipv4/tcp.c | 1 + net/ipv4/tcp_input.c | 38 -- net/ipv4/tcp_ipv4.c| 2 ++ net/ipv4/tcp_output.c | 9 ++ net/ipv4/tcp_timer.c | 25 + 12 files changed, 107 insertions(+), 9 deletions(-) -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. And you'll understand some people care about this. Strongly. Thanks.
Re: [RFC PATCH ghak32 V2 03/13] audit: log container info of syscalls
On 2018-05-17 17:09, Steve Grubb wrote: > On Fri, 16 Mar 2018 05:00:30 -0400 > Richard Guy Briggswrote: > > > Create a new audit record AUDIT_CONTAINER_INFO to document the > > container ID of a process if it is present. > > As mentioned in a previous email, I think AUDIT_CONTAINER is more > suitable for the container record. One more comment below... > > > Called from audit_log_exit(), syscalls are covered. > > > > A sample raw event: > > type=SYSCALL msg=audit(1519924845.499:257): arch=c03e syscall=257 > > success=yes exit=3 a0=ff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 > > ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 > > sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" > > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > > key="tmpcontainerid" type=CWD msg=audit(1519924845.499:257): > > cwd="/root" type=PATH msg=audit(1519924845.499:257): item=0 > > name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 > > rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT > > cap_fp= cap_fi= cap_fe=0 cap_fver=0 > > type=PATH msg=audit(1519924845.499:257): item=1 > > name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 > > ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 > > nametype=CREATE cap_fp= cap_fi= > > cap_fe=0 cap_fver=0 type=PROCTITLE msg=audit(1519924845.499:257): > > proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964 > > type=CONTAINER_INFO msg=audit(1519924845.499:257): op=task > > contid=123458 > > > > See: https://github.com/linux-audit/audit-kernel/issues/32 > > Signed-off-by: Richard Guy Briggs > > --- > > include/linux/audit.h | 5 + > > include/uapi/linux/audit.h | 1 + > > kernel/audit.c | 20 > > kernel/auditsc.c | 2 ++ > > 4 files changed, 28 insertions(+) > > > > diff --git a/include/linux/audit.h b/include/linux/audit.h > > index fe4ba3f..3acbe9d 100644 > > --- a/include/linux/audit.h > > +++ b/include/linux/audit.h > > @@ -154,6 +154,8 @@ extern void > > audit_log_link_denied(const char *operation, extern int > > audit_log_task_context(struct audit_buffer *ab); extern void > > audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk); > > +extern int audit_log_container_info(struct task_struct *tsk, > > +struct audit_context *context); > > > > extern int audit_update_lsm_rules(void); > > > > @@ -205,6 +207,9 @@ static inline int audit_log_task_context(struct > > audit_buffer *ab) static inline void audit_log_task_info(struct > > audit_buffer *ab, struct task_struct *tsk) > > { } > > +static inline int audit_log_container_info(struct task_struct *tsk, > > + struct audit_context > > *context); +{ } > > #define audit_enabled 0 > > #endif /* CONFIG_AUDIT */ > > > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > > index 921a71f..e83ccbd 100644 > > --- a/include/uapi/linux/audit.h > > +++ b/include/uapi/linux/audit.h > > @@ -115,6 +115,7 @@ > > #define AUDIT_REPLACE 1329/* Replace auditd > > if this packet unanswerd */ #define AUDIT_KERN_MODULE > > 1330/* Kernel Module events */ #define > > AUDIT_FANOTIFY 1331/* Fanotify access decision > > */ +#define AUDIT_CONTAINER_INFO1332/* Container ID > > information */ #define AUDIT_AVC1400/* SE > > Linux avc denial or grant */ #define AUDIT_SELINUX_ERR > > 1401/* Internal SE Linux Errors */ diff --git > > a/kernel/audit.c b/kernel/audit.c index 3f2f143..a12f21f 100644 > > --- a/kernel/audit.c > > +++ b/kernel/audit.c > > @@ -2049,6 +2049,26 @@ void audit_log_session_info(struct > > audit_buffer *ab) audit_log_format(ab, " auid=%u ses=%u", auid, > > sessionid); } > > > > +/* > > + * audit_log_container_info - report container info > > + * @tsk: task to be recorded > > + * @context: task or local context for record > > + */ > > +int audit_log_container_info(struct task_struct *tsk, struct > > audit_context *context) +{ > > + struct audit_buffer *ab; > > + > > + if (!audit_containerid_set(tsk)) > > + return 0; > > + /* Generate AUDIT_CONTAINER_INFO with container ID */ > > + ab = audit_log_start(context, GFP_KERNEL, > > AUDIT_CONTAINER_INFO); > > + if (!ab) > > + return -ENOMEM; > > + audit_log_format(ab, "contid=%llu", > > audit_get_containerid(tsk)); > > + audit_log_end(ab); > > + return 0; > > +} > > + > > void audit_log_key(struct audit_buffer *ab, char *key) > > { > > audit_log_format(ab, " key="); > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > index a6b0a52..65be110 100644 > > --- a/kernel/auditsc.c > > +++ b/kernel/auditsc.c > > @@ -1453,6 +1453,8 @@ static void audit_log_exit(struct
Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy
On Tue, 15 May 2018 21:06:15 +0200 Björn Töpelwrote: > From: Magnus Karlsson > > Here, the zero-copy ndo is implemented. As a shortcut, the existing > XDP Tx rings are used for zero-copy. This means that and XDP program > cannot redirect to an AF_XDP enabled XDP Tx ring. This "shortcut" is not acceptable, and completely broken. The XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily clash with the configured XSK queue_index. Provided a bit more code context below... On Tue, 15 May 2018 21:06:15 +0200 Björn Töpel wrote: int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf) { struct i40e_netdev_priv *np = netdev_priv(dev); unsigned int queue_index = smp_processor_id(); struct i40e_vsi *vsi = np->vsi; int err; if (test_bit(__I40E_VSI_DOWN, vsi->state)) return -ENETDOWN; > @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct > xdp_frame *xdpf) > if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs) > return -ENXIO; > > + if (vsi->xdp_rings[queue_index]->xsk_umem) > + return -ENXIO; > + Using the sane errno makes this impossible to debug (via the tracepoints). > err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]); > if (err != I40E_XDP_TX) > return -ENOSPC; > @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev) > if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs) > return; > > + if (vsi->xdp_rings[queue_index]->xsk_umem) > + return; > + > i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]); > } -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf] bpf: fix truncated jump targets on heavy expansions
On Thu, May 17, 2018 at 01:44:11AM +0200, Daniel Borkmann wrote: > Recently during testing, I ran into the following panic: > > [ 207.892422] Internal error: Accessing user space memory outside > uaccess.h routines: 9604 [#1] SMP > [ 207.901637] Modules linked in: binfmt_misc [...] > [ 207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: GW > 4.17.0-rc3+ #7 > [ 207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A > 03/31/2017 > [ 207.982428] pstate: 6045 (nZCv daif +PAN -UAO) > [ 207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0 > [ 207.992603] lr : 0x00bdb754 > [ 207.996080] sp : 13703ca0 > [ 207.999384] x29: 13703ca0 x28: 0001 > [ 208.004688] x27: 0001 x26: > [ 208.009992] x25: 13703ce0 x24: 800fb4afcb00 > [ 208.015295] x23: 7d2f5038 x22: 7d2f5000 > [ 208.020599] x21: feff2a6f x20: 000a > [ 208.025903] x19: 09578000 x18: 0a03 > [ 208.031206] x17: x16: > [ 208.036510] x15: 9de83000 x14: > [ 208.041813] x13: x12: > [ 208.047116] x11: 0001 x10: 089e7f18 > [ 208.052419] x9 : feff2a6f x8 : > [ 208.057723] x7 : 000a x6 : 00280c616000 > [ 208.063026] x5 : 0018 x4 : 7db6 > [ 208.068329] x3 : 0008647a x2 : 19868179b1484500 > [ 208.073632] x1 : x0 : 09578c08 > [ 208.078938] Process test_verifier (pid: 2256, stack limit = > 0x49ca7974) > [ 208.086235] Call trace: > [ 208.088672] bpf_skb_load_helper_8_no_cache+0x34/0xc0 > [ 208.093713] 0x00bdb754 > [ 208.096845] bpf_test_run+0x78/0xf8 > [ 208.100324] bpf_prog_test_run_skb+0x148/0x230 > [ 208.104758] sys_bpf+0x314/0x1198 > [ 208.108064] el0_svc_naked+0x30/0x34 > [ 208.111632] Code: 91302260 f941 f9001fa1 d281 (29500680) > [ 208.117717] ---[ end trace 263cb8a59b5bf29f ]--- > > The program itself which caused this had a long jump over the whole > instruction sequence where all of the inner instructions required > heavy expansions into multiple BPF instructions. Additionally, I also > had BPF hardening enabled which requires once more rewrites of all > constant values in order to blind them. Each time we rewrite insns, > bpf_adj_branches() would need to potentially adjust branch targets > which cross the patchlet boundary to accommodate for the additional > delta. Eventually that lead to the case where the target offset could > not fit into insn->off's upper 0x7fff limit anymore where then offset > wraps around becoming negative (in s16 universe), or vice versa > depending on the jump direction. > > Therefore it becomes necessary to detect and reject any such occasions > in a generic way for native eBPF and cBPF to eBPF migrations. For > the latter we can simply check bounds in the bpf_convert_filter()'s > BPF_EMIT_JMP helper macro and bail out once we surpass limits. The > bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case > of subsequent hardening) is a bit more complex in that we need to > detect such truncations before hitting the bpf_prog_realloc(). Thus > the latter is split into an extra pass to probe problematic offsets > on the original program in order to fail early. With that in place > and carefully tested I no longer hit the panic and the rewrites are > rejected properly. The above example panic I've seen on bpf-next, > though the issue itself is generic in that a guard against this issue > in bpf seems more appropriate in this case. > > Signed-off-by: Daniel BorkmannAcked-by: Martin KaFai Lau
Re: [PATCH net 0/7] net: ip6_gre: Fixes in headroom handling
From: Petr MachataDate: Fri, 18 May 2018 00:03:58 +0300 > David Miller writes: > >> Series applied, thank you. > > Hi David, I forgot to add Fixes lines to the individual patches. I > replied to the e-mails with those. Let me know if you want me to send a > v2 with that and the Acked-by's. When something is already in my tree, it can't be changed as it is committed to the permanent record of my GIT tree and I cannot rebase since so many people clone my tree. Luckily for you, your Fixes: tags went out before I pushed, so I could actually fix up the commit messages and add the tags. >> Those reproducable test cases in the various commit messages are >> pretty awesome. Could you please extract them and put them somewhere >> appropriate under selftests? > > The ip6gretap one is covered by the mirror_gre test if you run it > with veth devices instead of HW ports, but I can make it self-contained > if you think that would be better. > > I'll add the erspan one. Thank you.
[bpf-next PATCH v2 2/2] bpf: add sk_msg prog sk access tests to test_verifier
Add tests for BPF_PROG_TYPE_SK_MSG to test_verifier for read access to new sk fields. Signed-off-by: John FastabendAcked-by: Martin KaFai Lau --- tools/include/uapi/linux/bpf.h |8 ++ tools/testing/selftests/bpf/test_verifier.c | 115 +++ 2 files changed, 123 insertions(+) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index d94d333..97446bb 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2176,6 +2176,14 @@ enum sk_action { struct sk_msg_md { void *data; void *data_end; + + __u32 family; + __u32 remote_ip4; /* Stored in network byte order */ + __u32 local_ip4;/* Stored in network byte order */ + __u32 remote_ip6[4];/* Stored in network byte order */ + __u32 local_ip6[4]; /* Stored in network byte order */ + __u32 remote_port; /* Stored in network byte order */ + __u32 local_port; /* stored in host byte order */ }; #define BPF_TAG_SIZE 8 diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index a877af0..6ec4d9d 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -1686,6 +1686,121 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self) .prog_type = BPF_PROG_TYPE_SK_SKB, }, { + "valid access family in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, family)), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SK_MSG, + }, + { + "valid access remote_ip4 in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, remote_ip4)), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SK_MSG, + }, + { + "valid access local_ip4 in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, local_ip4)), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SK_MSG, + }, + { + "valid access remote_port in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, remote_port)), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SK_MSG, + }, + { + "valid access local_port in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, local_port)), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SK_MSG, + }, + { + "valid access remote_ip6 in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, remote_ip6[0])), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, remote_ip6[1])), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, remote_ip6[2])), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, remote_ip6[3])), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SK_SKB, + }, + { + "valid access local_ip6 in SK_MSG", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, local_ip6[0])), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, local_ip6[1])), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, local_ip6[2])), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct sk_msg_md, local_ip6[3])), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, +
[bpf-next PATCH v2 1/2] bpf: allow sk_msg programs to read sock fields
Currently sk_msg programs only have access to the raw data. However, it is often useful when building policies to have the policies specific to the socket endpoint. This allows using the socket tuple as input into filters, etc. This patch adds ctx access to the sock fields. Signed-off-by: John FastabendAcked-by: Martin KaFai Lau --- include/linux/filter.h |1 include/uapi/linux/bpf.h |8 +++ kernel/bpf/sockmap.c |1 net/core/filter.c| 114 +- 4 files changed, 121 insertions(+), 3 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 9dbcb9d..d358d18 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -517,6 +517,7 @@ struct sk_msg_buff { bool sg_copy[MAX_SKB_FRAGS]; __u32 flags; struct sock *sk_redir; + struct sock *sk; struct sk_buff *skb; struct list_head list; }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d94d333..97446bb 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2176,6 +2176,14 @@ enum sk_action { struct sk_msg_md { void *data; void *data_end; + + __u32 family; + __u32 remote_ip4; /* Stored in network byte order */ + __u32 local_ip4;/* Stored in network byte order */ + __u32 remote_ip6[4];/* Stored in network byte order */ + __u32 local_ip6[4]; /* Stored in network byte order */ + __u32 remote_port; /* Stored in network byte order */ + __u32 local_port; /* stored in host byte order */ }; #define BPF_TAG_SIZE 8 diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index c6de139..0ebf157 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -523,6 +523,7 @@ static unsigned int smap_do_tx_msg(struct sock *sk, } bpf_compute_data_pointers_sg(md); + md->sk = sk; rc = (*prog->bpf_func)(md, prog->insnsi); psock->apply_bytes = md->apply_bytes; diff --git a/net/core/filter.c b/net/core/filter.c index 6d0d156..aec5eba 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5148,18 +5148,23 @@ static bool sk_msg_is_valid_access(int off, int size, switch (off) { case offsetof(struct sk_msg_md, data): info->reg_type = PTR_TO_PACKET; + if (size != sizeof(__u64)) + return false; break; case offsetof(struct sk_msg_md, data_end): info->reg_type = PTR_TO_PACKET_END; + if (size != sizeof(__u64)) + return false; break; + default: + if (size != sizeof(__u32)) + return false; } if (off < 0 || off >= sizeof(struct sk_msg_md)) return false; if (off % size != 0) return false; - if (size != sizeof(__u64)) - return false; return true; } @@ -5835,7 +5840,8 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, break; case offsetof(struct bpf_sock_ops, local_ip4): - BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4); + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, + skc_rcv_saddr) != 4); *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( struct bpf_sock_ops_kern, sk), @@ -6152,6 +6158,7 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type, struct bpf_prog *prog, u32 *target_size) { struct bpf_insn *insn = insn_buf; + int off; switch (si->off) { case offsetof(struct sk_msg_md, data): @@ -6164,6 +6171,107 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type, si->dst_reg, si->src_reg, offsetof(struct sk_msg_buff, data_end)); break; + case offsetof(struct sk_msg_md, family): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg_buff, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg_buff, sk)); + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_family)); + break; + + case offsetof(struct sk_msg_md, remote_ip4): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg_buff, sk), +
[bpf-next PATCH v2 0/2] SK_MSG programs: read sock fields
In this series we add the ability for sk msg programs to read basic sock information about the sock they are attached to. The second patch adds the tests to the selftest test_verifier. One obseration that I had from writing this seriess is lots of the ./net/core/filter.c code is almost duplicated across program types. I thought about building a template/macro that we could use as a single block of code to read sock data out for multiple programs, but I wasn't convinced it was worth it yet. The result was using a macro saved a couple lines of code per block but made the code a bit harder to read IMO. We can probably revisit the idea later if we get more duplication. v2: add errstr field to negative test_verifier test cases to ensure we get the expected err string back from the verifier. --- John Fastabend (2): bpf: allow sk_msg programs to read sock fields bpf: add sk_msg prog sk access tests to test_verifier include/linux/filter.h |1 include/uapi/linux/bpf.h|8 ++ kernel/bpf/sockmap.c|1 net/core/filter.c | 114 ++- tools/include/uapi/linux/bpf.h |8 ++ tools/testing/selftests/bpf/test_verifier.c | 115 +++ 6 files changed, 244 insertions(+), 3 deletions(-) -- Signature
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 05/17/2018 01:53 PM, Qing Huang wrote: > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d And you'll understand some people care about this. Strongly. Thanks.
Re: [PATCH net-next v3 0/3] net: Allow more drivers with COMPILE_TEST
From: Florian FainelliDate: Thu, 17 May 2018 13:07:42 -0700 > Hi David, > > This patch series includes more drivers to be build tested with COMPILE_TEST > enabled. This helps cover some of the issues I just ran into with missing > a driver *sigh*. > > Chanves in v3: > > - drop the TI Keystone NETCP driver from the COMPILE_TEST additions > > Changes in v2: > > - allow FEC to build outside of CONFIG_ARM/ARM64 by defining a layout of > registers, this is not meant to run, so this is not a real issue if we > are not matching the correct register layout Ok, series applied. Just some printf format string warnings to clear up on 64-bit in TI driver files davinci_cpdma.c, cpsw.c, and cpts.c. In file included from ./arch/x86/include/asm/bug.h:83:0, from ./include/linux/bug.h:5, from ./include/linux/thread_info.h:12, from ./arch/x86/include/asm/preempt.h:7, from ./include/linux/preempt.h:81, from ./include/linux/spinlock.h:51, from drivers/net/ethernet/ti/davinci_cpdma.c:16: drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘cpdma_desc_pool_destroy’: drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=] "cpdma_desc_pool size %d != avail %d", ^ gen_pool_size(pool->gen_pool), ~ ./include/asm-generic/bug.h:98:50: note: in definition of macro ‘__WARN_printf’ #define __WARN_printf(arg...) do { __warn_printk(arg); __WARN(); } while (0) ^~~ drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro ‘WARN’ WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool), ^~~~ drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat=] "cpdma_desc_pool size %d != avail %d", ^ drivers/net/ethernet/ti/davinci_cpdma.c:196:7: gen_pool_avail(pool->gen_pool)); ~~ ./include/asm-generic/bug.h:98:50: note: in definition of macro ‘__WARN_printf’ #define __WARN_printf(arg...) do { __warn_printk(arg); __WARN(); } while (0) ^~~ drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro ‘WARN’ WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool), ^~~~ In file included from ./arch/x86/include/asm/realmode.h:15:0, from ./arch/x86/include/asm/acpi.h:33, from ./arch/x86/include/asm/fixmap.h:19, from ./arch/x86/include/asm/apic.h:10, from ./arch/x86/include/asm/smp.h:13, from ./arch/x86/include/asm/mmzone_64.h:11, from ./arch/x86/include/asm/mmzone.h:5, from ./include/linux/mmzone.h:911, from ./include/linux/gfp.h:6, from ./include/linux/idr.h:16, from ./include/linux/kernfs.h:14, from ./include/linux/sysfs.h:16, from ./include/linux/kobject.h:20, from ./include/linux/device.h:16, from drivers/net/ethernet/ti/davinci_cpdma.c:17: drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘cpdma_chan_submit’: drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 of ‘__writel’ makes integer from pointer without a cast [-Wint-conversion] writel_relaxed(token, >sw_token); ^ ./arch/x86/include/asm/io.h:88:39: note: in definition of macro ‘writel_relaxed’ #define writel_relaxed(v, a) __writel(v, a) ^ ./arch/x86/include/asm/io.h:71:18: note: expected ‘unsigned int’ but argument is of type ‘void *’ build_mmio_write(__writel, "l", unsigned int, "r", ) ^ ./arch/x86/include/asm/io.h:53:20: note: in definition of macro ‘build_mmio_write’ static inline void name(type val, volatile void __iomem *addr) \ ^~~~ drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘__cpdma_chan_free’: drivers/net/ethernet/ti/davinci_cpdma.c:1126:15: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] token = (void *)desc_read(desc, sw_token); ^ In file included from ./include/linux/kernel.h:14:0, from ./include/linux/uio.h:12, from ./include/linux/socket.h:8, from ./include/uapi/linux/if.h:25, from drivers/net/ethernet/ti/cpts.c:21: drivers/net/ethernet/ti/cpts.c: In function ‘cpts_overflow_check’: drivers/net/ethernet/ti/cpts.c:297:11: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘__kernel_time_t {aka long int}’ [-Wformat=]
Re: [PATCH 0/2] bpf: sockmap, fix uninitialized variable and double-free
Hi Daniel, On 05/17/2018 03:51 PM, Daniel Borkmann wrote: On 05/17/2018 04:04 PM, Gustavo A. R. Silva wrote: This patchset aims to fix an uninitialized variable issue and a double-free issue in __sock_map_ctx_update_elem. Both issues were reported by Coverity. Thanks. Gustavo A. R. Silva (2): bpf: sockmap, fix uninitialized variable bpf: sockmap, fix double-free kernel/bpf/sockmap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) Applied to bpf-next, thanks Gustavo! Glad to help. :) P.s.: Please indicate that next time in the email subject via '[PATCH bpf-next]'. OK. Will do that. Thanks -- Gustavo
Re: [RFC PATCH ghak32 V2 03/13] audit: log container info of syscalls
On Fri, 16 Mar 2018 05:00:30 -0400 Richard Guy Briggswrote: > Create a new audit record AUDIT_CONTAINER_INFO to document the > container ID of a process if it is present. As mentioned in a previous email, I think AUDIT_CONTAINER is more suitable for the container record. One more comment below... > Called from audit_log_exit(), syscalls are covered. > > A sample raw event: > type=SYSCALL msg=audit(1519924845.499:257): arch=c03e syscall=257 > success=yes exit=3 a0=ff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 > ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 > sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > key="tmpcontainerid" type=CWD msg=audit(1519924845.499:257): > cwd="/root" type=PATH msg=audit(1519924845.499:257): item=0 > name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 > rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT > cap_fp= cap_fi= cap_fe=0 cap_fver=0 > type=PATH msg=audit(1519924845.499:257): item=1 > name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 > ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 > nametype=CREATE cap_fp= cap_fi= > cap_fe=0 cap_fver=0 type=PROCTITLE msg=audit(1519924845.499:257): > proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964 > type=CONTAINER_INFO msg=audit(1519924845.499:257): op=task > contid=123458 > > See: https://github.com/linux-audit/audit-kernel/issues/32 > Signed-off-by: Richard Guy Briggs > --- > include/linux/audit.h | 5 + > include/uapi/linux/audit.h | 1 + > kernel/audit.c | 20 > kernel/auditsc.c | 2 ++ > 4 files changed, 28 insertions(+) > > diff --git a/include/linux/audit.h b/include/linux/audit.h > index fe4ba3f..3acbe9d 100644 > --- a/include/linux/audit.h > +++ b/include/linux/audit.h > @@ -154,6 +154,8 @@ extern void > audit_log_link_denied(const char *operation, extern int > audit_log_task_context(struct audit_buffer *ab); extern void > audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk); > +extern int audit_log_container_info(struct task_struct *tsk, > + struct audit_context *context); > > extern int audit_update_lsm_rules(void); > > @@ -205,6 +207,9 @@ static inline int audit_log_task_context(struct > audit_buffer *ab) static inline void audit_log_task_info(struct > audit_buffer *ab, struct task_struct *tsk) > { } > +static inline int audit_log_container_info(struct task_struct *tsk, > + struct audit_context > *context); +{ } > #define audit_enabled 0 > #endif /* CONFIG_AUDIT */ > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > index 921a71f..e83ccbd 100644 > --- a/include/uapi/linux/audit.h > +++ b/include/uapi/linux/audit.h > @@ -115,6 +115,7 @@ > #define AUDIT_REPLACE1329/* Replace auditd > if this packet unanswerd */ #define AUDIT_KERN_MODULE > 1330 /* Kernel Module events */ #define > AUDIT_FANOTIFY1331/* Fanotify access decision > */ +#define AUDIT_CONTAINER_INFO 1332/* Container ID > information */ #define AUDIT_AVC 1400/* SE > Linux avc denial or grant */ #define AUDIT_SELINUX_ERR > 1401 /* Internal SE Linux Errors */ diff --git > a/kernel/audit.c b/kernel/audit.c index 3f2f143..a12f21f 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -2049,6 +2049,26 @@ void audit_log_session_info(struct > audit_buffer *ab) audit_log_format(ab, " auid=%u ses=%u", auid, > sessionid); } > > +/* > + * audit_log_container_info - report container info > + * @tsk: task to be recorded > + * @context: task or local context for record > + */ > +int audit_log_container_info(struct task_struct *tsk, struct > audit_context *context) +{ > + struct audit_buffer *ab; > + > + if (!audit_containerid_set(tsk)) > + return 0; > + /* Generate AUDIT_CONTAINER_INFO with container ID */ > + ab = audit_log_start(context, GFP_KERNEL, > AUDIT_CONTAINER_INFO); > + if (!ab) > + return -ENOMEM; > + audit_log_format(ab, "contid=%llu", > audit_get_containerid(tsk)); > + audit_log_end(ab); > + return 0; > +} > + > void audit_log_key(struct audit_buffer *ab, char *key) > { > audit_log_format(ab, " key="); > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index a6b0a52..65be110 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -1453,6 +1453,8 @@ static void audit_log_exit(struct audit_context > *context, struct task_struct *ts > audit_log_proctitle(tsk, context); > > + audit_log_container_info(tsk, context); Would there be any problem moving audit_log_container_info before audit_log_proctitle? There are some
Re: [PATCH net-next] vlan: Add extack messages for link create
From: David AhernDate: Thu, 17 May 2018 12:29:47 -0700 > Add informative messages for error paths related to adding a > VLAN to a device. > > Signed-off-by: David Ahern Applied, thanks David.
Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports
On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote: > Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote: > >On 05/17/2018 10:39 AM, Jiri Pirko wrote: > That is compiled inside "fixed_phy", isn't it? > >>> > >>> It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes > >>> built-in, if is modular, it is also modular, this was fixed with > >>> 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional > >>> dsa-loop dependency on FIXED_PHY") > >> > >> Now I have it compiled as module, and after modprobe dsa_loop I see: > >> [ 1168.129202] libphy: Fixed MDIO Bus: probed > >> [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f > >> > >> This messages I did not see when I had fixed_phy compiled as buildin. > >> > >> But I still see no netdevs :/ > > > >The platform data assumes there is a network device named "eth0" as the > > Oups, I missed, I created dummy device and modprobed again. Now I see: > > $ sudo devlink port > mdio_bus/fixed-0:1f/0: type eth netdev lan1 > mdio_bus/fixed-0:1f/1: type eth netdev lan2 > mdio_bus/fixed-0:1f/2: type eth netdev lan3 > mdio_bus/fixed-0:1f/3: type eth netdev lan4 > mdio_bus/fixed-0:1f/4: type notset > mdio_bus/fixed-0:1f/5: type notset > mdio_bus/fixed-0:1f/6: type notset > mdio_bus/fixed-0:1f/7: type notset > mdio_bus/fixed-0:1f/8: type notset > mdio_bus/fixed-0:1f/9: type notset > mdio_bus/fixed-0:1f/10: type notset > mdio_bus/fixed-0:1f/11: type notset > > I wonder why there are ports 4-11 Hi Jiri ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS); It is allocating a switch with 12 ports. However only 4 of them have names. So the core only creates slave devices for those 4. This is a useful test. Real hardware often has unused ports. A WiFi AP with a 7 port switch which only uses 6 ports is often seen. Andrew
Re: [PATCH net-next 1/1] qede: Add build_skb() support.
From: Manish ChopraDate: Thu, 17 May 2018 12:05:00 -0700 > This patch makes use of build_skb() throughout in driver's receieve > data path [HW gro flow and non HW gro flow]. With this, driver can > build skb directly from the page segments which are already mapped > to the hardware instead of allocating new SKB via netdev_alloc_skb() > and memcpy the data which is quite costly. > > This really improves performance (keeping same or slight gain in rx > throughput) in terms of CPU utilization which is significantly reduced > [almost half] in non HW gro flow where for every incoming MTU sized > packet driver had to allocate skb, memcpy headers etc. Additionally > in that flow, it also gets rid of bunch of additional overheads > [eth_get_headlen() etc.] to split headers and data in the skb. > > Tested with: > system: 2 sockets, 4 cores per socket, hyperthreading, 2x4x2=16 cores > iperf [server]: iperf -s > iperf [client]: iperf -c -t 500 -i 10 -P 32 > > HW GRO off – w/o build_skb(), throughput: 36.8 Gbits/sec > > Average: CPU%usr %nice%sys %iowait%irq %soft %steal > %guest %idle > Average: all0.590.00 32.930.000.00 43.070.00 > 0.00 23.42 > > HW GRO off - with build_skb(), throughput: 36.9 Gbits/sec > > Average: CPU%usr %nice%sys %iowait%irq %soft %steal > %guest %idle > Average: all0.700.00 31.700.000.00 25.680.00 > 0.00 41.92 ^ ^ > > HW GRO on - w/o build_skb(), throughput: 36.9 Gbits/sec > > Average: CPU%usr %nice%sys %iowait%irq %soft %steal > %guest %idle > Average: all0.860.00 24.140.000.006.590.00 > 0.00 68.41 > > HW GRO on - with build_skb(), throughput: 37.5 Gbits/sec > > Average: CPU%usr %nice%sys %iowait%irq %soft %steal > %guest %idle > Average: all0.870.00 23.750.000.006.190.00 > 0.00 69.19 > > Signed-off-by: Ariel Elior > Signed-off-by: Manish Chopra Looks great, applied, thank you.
Re: [PATCH net v2] net: test tailroom before appending to linear skb
From: Willem de BruijnDate: Thu, 17 May 2018 13:13:29 -0400 > From: Willem de Bruijn > > Device features may change during transmission. In particular with > corking, a device may toggle scatter-gather in between allocating > and writing to an skb. > > Do not unconditionally assume that !NETIF_F_SG at write time implies > that the same held at alloc time and thus the skb has sufficient > tailroom. > > This issue predates git history. > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Reported-by: Eric Dumazet > Signed-off-by: Willem de Bruijn > > --- > > v2: fix ipv4 boundary condition Applied and queued up for -stable, thanks Willem.