[PATCH bpf-next v2 1/7] perf/core: add perf_get_event() to return perf_event given a struct file

2018-05-17 Thread Yonghong Song
A new extern function, perf_get_event(), is added to return a perf event
given a struct file. This function will be used in later patches.

Signed-off-by: Yonghong Song 
---
 include/linux/perf_event.h | 5 +
 kernel/events/core.c   | 8 
 2 files changed, 13 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e71e99e..b5c1ad3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -868,6 +868,7 @@ extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
 extern struct file *perf_event_get(unsigned int fd);
+extern struct perf_event *perf_get_event(struct file *file);
 extern const struct perf_event_attr *perf_event_attrs(struct perf_event 
*event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
@@ -1289,6 +1290,10 @@ static inline void perf_event_exit_task(struct 
task_struct *child)   { }
 static inline void perf_event_free_task(struct task_struct *task)  { }
 static inline void perf_event_delayed_put(struct task_struct *task){ }
 static inline struct file *perf_event_get(unsigned int fd) { return 
ERR_PTR(-EINVAL); }
+static inline struct perf_event *perf_get_event(struct file *file)
+{
+   return ERR_PTR(-EINVAL);
+}
 static inline const struct perf_event_attr *perf_event_attrs(struct perf_event 
*event)
 {
return ERR_PTR(-EINVAL);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67612ce..1e3cddb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11212,6 +11212,14 @@ struct file *perf_event_get(unsigned int fd)
return file;
 }
 
+struct perf_event *perf_get_event(struct file *file)
+{
+   if (file->f_op != _fops)
+   return ERR_PTR(-EINVAL);
+
+   return file->private_data;
+}
+
 const struct perf_event_attr *perf_event_attrs(struct perf_event *event)
 {
if (!event)
-- 
2.9.5



Re: [PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT

2018-05-17 Thread Y Song
On Thu, May 17, 2018 at 2:45 PM, Sean Young  wrote:
> Hi,
>
> Again thanks for a thoughtful review. This will definitely will improve
> the code.
>
> On Thu, May 17, 2018 at 10:02:52AM -0700, Y Song wrote:
>> On Wed, May 16, 2018 at 2:04 PM, Sean Young  wrote:
>> > Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call
>> > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
>> > that the last key should be repeated.
>> >
>> > The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall;
>> > the target_fd must be the /dev/lircN device.
>> >
>> > Signed-off-by: Sean Young 
>> > ---
>> >  drivers/media/rc/Kconfig   |  13 ++
>> >  drivers/media/rc/Makefile  |   1 +
>> >  drivers/media/rc/bpf-rawir-event.c | 363 +
>> >  drivers/media/rc/lirc_dev.c|  24 ++
>> >  drivers/media/rc/rc-core-priv.h|  24 ++
>> >  drivers/media/rc/rc-ir-raw.c   |  14 +-
>> >  include/linux/bpf_rcdev.h  |  30 +++
>> >  include/linux/bpf_types.h  |   3 +
>> >  include/uapi/linux/bpf.h   |  55 -
>> >  kernel/bpf/syscall.c   |   7 +
>> >  10 files changed, 531 insertions(+), 3 deletions(-)
>> >  create mode 100644 drivers/media/rc/bpf-rawir-event.c
>> >  create mode 100644 include/linux/bpf_rcdev.h
>> >
>> > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
>> > index eb2c3b6eca7f..2172d65b0213 100644
>> > --- a/drivers/media/rc/Kconfig
>> > +++ b/drivers/media/rc/Kconfig
>> > @@ -25,6 +25,19 @@ config LIRC
>> >passes raw IR to and from userspace, which is needed for
>> >IR transmitting (aka "blasting") and for the lirc daemon.
>> >
>> > +config BPF_RAWIR_EVENT
>> > +   bool "Support for eBPF programs attached to lirc devices"
>> > +   depends on BPF_SYSCALL
>> > +   depends on RC_CORE=y
>> > +   depends on LIRC
>> > +   help
>> > +  Allow attaching eBPF programs to a lirc device using the bpf(2)
>> > +  syscall command BPF_PROG_ATTACH. This is supported for raw IR
>> > +  receivers.
>> > +
>> > +  These eBPF programs can be used to decode IR into scancodes, for
>> > +  IR protocols not supported by the kernel decoders.
>> > +
>> >  menuconfig RC_DECODERS
>> > bool "Remote controller decoders"
>> > depends on RC_CORE
>> > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
>> > index 2e1c87066f6c..74907823bef8 100644
>> > --- a/drivers/media/rc/Makefile
>> > +++ b/drivers/media/rc/Makefile
>> > @@ -5,6 +5,7 @@ obj-y += keymaps/
>> >  obj-$(CONFIG_RC_CORE) += rc-core.o
>> >  rc-core-y := rc-main.o rc-ir-raw.o
>> >  rc-core-$(CONFIG_LIRC) += lirc_dev.o
>> > +rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o
>> >  obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
>> >  obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
>> >  obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
>> > diff --git a/drivers/media/rc/bpf-rawir-event.c 
>> > b/drivers/media/rc/bpf-rawir-event.c
>> > new file mode 100644
>> > index ..7cb48b8d87b5
>> > --- /dev/null
>> > +++ b/drivers/media/rc/bpf-rawir-event.c
>> > @@ -0,0 +1,363 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +// bpf-rawir-event.c - handles bpf
>> > +//
>> > +// Copyright (C) 2018 Sean Young 
>> > +
>> > +#include 
>> > +#include 
>> > +#include 
>> > +#include "rc-core-priv.h"
>> > +
>> > +/*
>> > + * BPF interface for raw IR
>> > + */
>> > +const struct bpf_prog_ops rawir_event_prog_ops = {
>> > +};
>> > +
>> > +BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event)
>> > +{
>> > +   struct ir_raw_event_ctrl *ctrl;
>> > +
>> > +   ctrl = container_of(event, struct ir_raw_event_ctrl, 
>> > bpf_rawir_event);
>> > +
>> > +   rc_repeat(ctrl->dev);
>> > +
>> > +   return 0;
>> > +}
>> > +
>> > +static const struct bpf_func_proto rc_repeat_proto = {
>> > +   .func  = bpf_rc_repeat,
>> > +   .gpl_only  = true, /* rc_repeat is EXPORT_SYMBOL_GPL */
>> > +   .ret_type  = RET_INTEGER,
>> > +   .arg1_type = ARG_PTR_TO_CTX,
>> > +};
>> > +
>> > +BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol,
>> > +  u32, scancode, u32, toggle)
>> > +{
>> > +   struct ir_raw_event_ctrl *ctrl;
>> > +
>> > +   ctrl = container_of(event, struct ir_raw_event_ctrl, 
>> > bpf_rawir_event);
>> > +
>> > +   rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
>> > +
>> > +   return 0;
>> > +}
>> > +
>> > +static const struct bpf_func_proto rc_keydown_proto = {
>> > +   .func  = bpf_rc_keydown,
>> > +   .gpl_only  = true, /* rc_keydown is EXPORT_SYMBOL_GPL */
>> > +   .ret_type  = RET_INTEGER,
>> > +   .arg1_type = ARG_PTR_TO_CTX,
>> > +   .arg2_type = ARG_ANYTHING,
>> > +   .arg3_type = ARG_ANYTHING,
>> > +   .arg4_type = ARG_ANYTHING,
>> > +};
>> > +
>> > +static 

[RFC PATCH net-next] tcp: tcp_rack_reo_wnd() can be static

2018-05-17 Thread kbuild test robot

Fixes: 20b654dfe1be ("tcp: support DUPACK threshold in RACK")
Signed-off-by: kbuild test robot 
---
 tcp_recovery.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 30cbfb6..71593e4 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -21,7 +21,7 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 
seq2)
return t1 > t2 || (t1 == t2 && after(seq1, seq2));
 }
 
-u32 tcp_rack_reo_wnd(const struct sock *sk)
+static u32 tcp_rack_reo_wnd(const struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 


[net-next:master 1200/1233] net/ipv4/tcp_recovery.c:24:5: sparse: symbol 'tcp_rack_reo_wnd' was not declared. Should it be static?

2018-05-17 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   538e2de104cfb4ef1acb35af42427bff42adbe4d
commit: 20b654dfe1beaca60ab51894ff405a049248433d [1200/1233] tcp: support 
DUPACK threshold in RACK
reproduce:
# apt-get install sparse
git checkout 20b654dfe1beaca60ab51894ff405a049248433d
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/ipv4/tcp_recovery.c:46:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_recovery.c:46:16: sparse: expression using sizeof(void)
>> net/ipv4/tcp_recovery.c:24:5: sparse: symbol 'tcp_rack_reo_wnd' was not 
>> declared. Should it be static?
   include/net/tcp.h:738:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_recovery.c:102:40: sparse: expression using sizeof(void)
   net/ipv4/tcp_recovery.c:102:40: sparse: expression using sizeof(void)
   include/net/tcp.h:738:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_recovery.c:210:42: sparse: expression using sizeof(void)

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH bpf-next v2 2/7] bpf: introduce bpf subcommand BPF_TASK_FD_QUERY

2018-05-17 Thread Yonghong Song
Currently, suppose a userspace application has loaded a bpf program
and attached it to a tracepoint/kprobe/uprobe, and a bpf
introspection tool, e.g., bpftool, wants to show which bpf program
is attached to which tracepoint/kprobe/uprobe. Such attachment
information will be really useful to understand the overall bpf
deployment in the system.

There is a name field (16 bytes) for each program, which could
be used to encode the attachment point. There are some drawbacks
for this approaches. First, bpftool user (e.g., an admin) may not
really understand the association between the name and the
attachment point. Second, if one program is attached to multiple
places, encoding a proper name which can imply all these
attachments becomes difficult.

This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
Given a pid and fd, if the  is associated with a
tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
   . prog_id
   . tracepoint name, or
   . k[ret]probe funcname + offset or kernel addr, or
   . u[ret]probe filename + offset
to the userspace.
The user can use "bpftool prog" to find more information about
bpf program itself with prog_id.

Signed-off-by: Yonghong Song 
---
 include/linux/trace_events.h |  15 ++
 include/uapi/linux/bpf.h |  27 ++
 kernel/bpf/syscall.c | 124 +++
 kernel/trace/bpf_trace.c |  48 +
 kernel/trace/trace_kprobe.c  |  29 ++
 kernel/trace/trace_uprobe.c  |  22 
 6 files changed, 265 insertions(+)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2bde3ef..bd08e11 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -473,6 +473,9 @@ int perf_event_query_prog_array(struct perf_event *event, 
void __user *info);
 int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
 int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
 struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
+int bpf_get_perf_event_info(struct perf_event *event, u32 *prog_id,
+   u32 *attach_info, const char **buf,
+   u64 *probe_offset, u64 *probe_addr);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void 
*ctx)
 {
@@ -504,6 +507,12 @@ static inline struct bpf_raw_event_map 
*bpf_find_raw_tracepoint(const char *name
 {
return NULL;
 }
+static inline int bpf_get_perf_event_info(struct file *file, u32 *prog_id,
+ u32 *attach_info, const char **buf,
+ u64 *probe_offset, u64 *probe_addr)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 enum {
@@ -560,10 +569,16 @@ extern void perf_trace_del(struct perf_event *event, int 
flags);
 #ifdef CONFIG_KPROBE_EVENTS
 extern int  perf_kprobe_init(struct perf_event *event, bool is_retprobe);
 extern void perf_kprobe_destroy(struct perf_event *event);
+extern int bpf_get_kprobe_info(struct perf_event *event, u32 *attach_info,
+  const char **symbol, u64 *probe_offset,
+  u64 *probe_addr, bool perf_type_tracepoint);
 #endif
 #ifdef CONFIG_UPROBE_EVENTS
 extern int  perf_uprobe_init(struct perf_event *event, bool is_retprobe);
 extern void perf_uprobe_destroy(struct perf_event *event);
+extern int bpf_get_uprobe_info(struct perf_event *event, u32 *attach_info,
+  const char **filename, u64 *probe_offset,
+  bool perf_type_tracepoint);
 #endif
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d94d333..6a22ad4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -97,6 +97,7 @@ enum bpf_cmd {
BPF_RAW_TRACEPOINT_OPEN,
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
+   BPF_TASK_FD_QUERY,
 };
 
 enum bpf_map_type {
@@ -379,6 +380,22 @@ union bpf_attr {
__u32   btf_log_size;
__u32   btf_log_level;
};
+
+   struct {
+   int pid;/* input: pid */
+   int fd; /* input: fd */
+   __u32   flags;  /* input: flags */
+   __u32   buf_len;/* input: buf len */
+   __aligned_u64   buf;/* input/output:
+*   tp_name for tracepoint
+*   symbol for kprobe
+*   filename for uprobe
+*/
+   __u32   prog_id;/* output: prod_id */
+   __u32   attach_info;   

[PATCH bpf-next v2 4/7] tools/bpf: add ksym_get_addr() in trace_helpers

2018-05-17 Thread Yonghong Song
Given a kernel function name, ksym_get_addr() will return the kernel
address for this function, or 0 if it cannot find this function name
in /proc/kallsyms. This function will be used later when a kernel
address is used to initiate a kprobe perf event.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/trace_helpers.c | 12 
 tools/testing/selftests/bpf/trace_helpers.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/bpf/trace_helpers.c 
b/tools/testing/selftests/bpf/trace_helpers.c
index 8fb4fe8..3868dcb 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -72,6 +72,18 @@ struct ksym *ksym_search(long key)
return [0];
 }
 
+long ksym_get_addr(const char *name)
+{
+   int i;
+
+   for (i = 0; i < sym_cnt; i++) {
+   if (strcmp(syms[i].name, name) == 0)
+   return syms[i].addr;
+   }
+
+   return 0;
+}
+
 static int page_size;
 static int page_cnt = 8;
 static struct perf_event_mmap_page *header;
diff --git a/tools/testing/selftests/bpf/trace_helpers.h 
b/tools/testing/selftests/bpf/trace_helpers.h
index 36d90e3..3b4bcf7 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -11,6 +11,7 @@ struct ksym {
 
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
+long ksym_get_addr(const char *name);
 
 typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
 
-- 
2.9.5



[PATCH bpf-next v2 0/7] bpf: implement BPF_TASK_FD_QUERY

2018-05-17 Thread Yonghong Song
Currently, suppose a userspace application has loaded a bpf program
and attached it to a tracepoint/kprobe/uprobe, and a bpf
introspection tool, e.g., bpftool, wants to show which bpf program
is attached to which tracepoint/kprobe/uprobe. Such attachment
information will be really useful to understand the overall bpf
deployment in the system.

There is a name field (16 bytes) for each program, which could
be used to encode the attachment point. There are some drawbacks
for this approaches. First, bpftool user (e.g., an admin) may not
really understand the association between the name and the
attachment point. Second, if one program is attached to multiple
places, encoding a proper name which can imply all these
attachments becomes difficult.

This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
Given a pid and fd, this command will return bpf related information
to user space. Right now it only supports tracepoint/kprobe/uprobe
perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
   . prog_id
   . tracepoint name, or
   . k[ret]probe funcname + offset or kernel addr, or
   . u[ret]probe filename + offset
to the userspace.
The user can use "bpftool prog" to find more information about
bpf program itself with prog_id.

Patch #1 adds function perf_get_event() in kernel/events/core.c.
Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
in the libbpf library for samples/selftests/bpftool to use.
Patch #4 adds ksym_get_addr() utility function.
Patch #5 add a test in samples/bpf for querying k[ret]probes and
u[ret]probes.
Patch #6 add a test in tools/testing/selftests/bpf for querying
raw_tracepoint and tracepoint.
Patch #7 add a new subcommand "perf" to bpftool.

Changelogs:
  v1 -> v2:
 . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
   to BPF_TASK_FD_QUERY.
 . fixed various "bpftool perf" issues and added documentation
   and auto-completion.

Yonghong Song (7):
  perf/core: add perf_get_event() to return perf_event given a struct
file
  bpf: introduce bpf subcommand BPF_TASK_FD_QUERY
  tools/bpf: sync kernel header bpf.h and add bpf_trace_event_query in
libbpf
  tools/bpf: add ksym_get_addr() in trace_helpers
  samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY
  tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs
  tools/bpftool: add perf subcommand

 include/linux/perf_event.h   |   5 +
 include/linux/trace_events.h |  15 +
 include/uapi/linux/bpf.h |  27 ++
 kernel/bpf/syscall.c | 124 
 kernel/events/core.c |   8 +
 kernel/trace/bpf_trace.c |  48 +++
 kernel/trace/trace_kprobe.c  |  29 ++
 kernel/trace/trace_uprobe.c  |  22 ++
 samples/bpf/Makefile |   4 +
 samples/bpf/task_fd_query_kern.c |  19 ++
 samples/bpf/task_fd_query_user.c | 379 +++
 tools/bpf/bpftool/Documentation/bpftool-perf.rst |  81 +
 tools/bpf/bpftool/Documentation/bpftool.rst  |   5 +-
 tools/bpf/bpftool/bash-completion/bpftool|   9 +
 tools/bpf/bpftool/main.c |   3 +-
 tools/bpf/bpftool/main.h |   1 +
 tools/bpf/bpftool/perf.c | 200 
 tools/include/uapi/linux/bpf.h   |  27 ++
 tools/lib/bpf/bpf.c  |  24 ++
 tools/lib/bpf/bpf.h  |   3 +
 tools/testing/selftests/bpf/test_progs.c | 133 
 tools/testing/selftests/bpf/trace_helpers.c  |  12 +
 tools/testing/selftests/bpf/trace_helpers.h  |   1 +
 23 files changed, 1177 insertions(+), 2 deletions(-)
 create mode 100644 samples/bpf/task_fd_query_kern.c
 create mode 100644 samples/bpf/task_fd_query_user.c
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
 create mode 100644 tools/bpf/bpftool/perf.c

-- 
2.9.5



[PATCH bpf-next v2 7/7] tools/bpftool: add perf subcommand

2018-05-17 Thread Yonghong Song
The new command "bpftool perf [show | list]" will traverse
all processes under /proc, and if any fd is associated
with a perf event, it will print out related perf event
information. Documentation is also added.

Below is an example to show the results using bcc commands.
Running the following 4 bcc commands:
  kprobe: trace.py '__x64_sys_nanosleep'
  kretprobe:  trace.py 'r::__x64_sys_nanosleep'
  tracepoint: trace.py 't:syscalls:sys_enter_nanosleep'
  uprobe: trace.py 'p:/home/yhs/a.out:main'

The bpftool command line and result:

  $ bpftool perf
  pid 21711  fd 5: prog_id 5  kprobe  func __x64_sys_write  offset 0
  pid 21765  fd 5: prog_id 7  kretprobe  func __x64_sys_nanosleep  offset 0
  pid 21767  fd 5: prog_id 8  tracepoint  sys_enter_nanosleep
  pid 21800  fd 5: prog_id 9  uprobe  filename /home/yhs/a.out  offset 1159

  $ bpftool -j perf
  
{"pid":21711,"fd":5,"prog_id":5,"attach_info":"kprobe","func":"__x64_sys_write","offset":0},
 \
  
{"pid":21765,"fd":5,"prog_id":7,"attach_info":"kretprobe","func":"__x64_sys_nanosleep","offset":0},
 \
  
{"pid":21767,"fd":5,"prog_id":8,"attach_info":"tracepoint","tracepoint":"sys_enter_nanosleep"},
 \
  
{"pid":21800,"fd":5,"prog_id":9,"attach_info":"uprobe","filename":"/home/yhs/a.out","offset":1159}

  $ bpftool prog
  5: kprobe  name probe___x64_sys  tag e495a0c82f2c7a8d  gpl
  loaded_at 2018-05-15T04:46:37-0700  uid 0
  xlated 200B  not jited  memlock 4096B  map_ids 4
  7: kprobe  name probe___x64_sys  tag f2fdee479a503abf  gpl
  loaded_at 2018-05-15T04:48:32-0700  uid 0
  xlated 200B  not jited  memlock 4096B  map_ids 7
  8: tracepoint  name tracepoint__sys  tag 5390badef2395fcf  gpl
  loaded_at 2018-05-15T04:48:48-0700  uid 0
  xlated 200B  not jited  memlock 4096B  map_ids 8
  9: kprobe  name probe_main_1  tag 0a87bdc2e2953b6d  gpl
  loaded_at 2018-05-15T04:49:52-0700  uid 0
  xlated 200B  not jited  memlock 4096B  map_ids 9

  $ ps ax | grep "python ./trace.py"
  21711 pts/0T  0:03 python ./trace.py __x64_sys_write
  21765 pts/0S+ 0:00 python ./trace.py r::__x64_sys_nanosleep
  21767 pts/2S+ 0:00 python ./trace.py t:syscalls:sys_enter_nanosleep
  21800 pts/3S+ 0:00 python ./trace.py p:/home/yhs/a.out:main
  22374 pts/1S+ 0:00 grep --color=auto python ./trace.py

Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/Documentation/bpftool-perf.rst |  81 +
 tools/bpf/bpftool/Documentation/bpftool.rst  |   5 +-
 tools/bpf/bpftool/bash-completion/bpftool|   9 +
 tools/bpf/bpftool/main.c |   3 +-
 tools/bpf/bpftool/main.h |   1 +
 tools/bpf/bpftool/perf.c | 200 +++
 6 files changed, 297 insertions(+), 2 deletions(-)
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-perf.rst
 create mode 100644 tools/bpf/bpftool/perf.c

diff --git a/tools/bpf/bpftool/Documentation/bpftool-perf.rst 
b/tools/bpf/bpftool/Documentation/bpftool-perf.rst
new file mode 100644
index 000..3e65375
--- /dev/null
+++ b/tools/bpf/bpftool/Documentation/bpftool-perf.rst
@@ -0,0 +1,81 @@
+
+bpftool-perf
+
+---
+tool for inspection of perf related bpf prog attachments
+---
+
+:Manual section: 8
+
+SYNOPSIS
+
+
+   **bpftool** [*OPTIONS*] **perf** *COMMAND*
+
+   *OPTIONS* := { [{ **-j** | **--json** }] [{ **-p** | **--pretty** }] }
+
+   *COMMANDS* :=
+   { **show** | **list** | **help** }
+
+PERF COMMANDS
+=
+
+|  **bpftool** **perf { show | list }**
+|  **bpftool** **perf help**
+
+DESCRIPTION
+===
+   **bpftool perf { show | list }**
+ List all raw_tracepoint, tracepoint, kprobe attachment in the 
system.
+
+ Output will start with process id and file descriptor in that 
process,
+ followed by bpf program id, attachment information, and 
attachment point.
+ The attachment point for raw_tracepoint/tracepoint is the 
trace probe name.
+ The attachment point for k[ret]probe is either symbol name 
and offset,
+ or a kernel virtual address.
+ The attachment point for u[ret]probe is the file name and the 
file offset.
+
+   **bpftool perf help**
+ Print short help message.
+
+OPTIONS
+===
+   -h, --help
+ Print short generic help message (similar to **bpftool 
help**).
+
+   -v, --version
+ Print version number (similar to **bpftool version**).
+
+   -j, --json
+ Generate JSON output. For commands that cannot produce JSON, 
this
+ option has no effect.
+
+   -p, --pretty
+ Generate 

[PATCH bpf-next v2 3/7] tools/bpf: sync kernel header bpf.h and add bpf_trace_event_query in libbpf

2018-05-17 Thread Yonghong Song
Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and
implement bpf_trace_event_query() in libbpf. The test programs
in samples/bpf and tools/testing/selftests/bpf, and later bpftool
will use this libbpf function to query kernel.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h | 27 +++
 tools/lib/bpf/bpf.c| 24 
 tools/lib/bpf/bpf.h|  3 +++
 3 files changed, 54 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d94d333..6a22ad4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -97,6 +97,7 @@ enum bpf_cmd {
BPF_RAW_TRACEPOINT_OPEN,
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
+   BPF_TASK_FD_QUERY,
 };
 
 enum bpf_map_type {
@@ -379,6 +380,22 @@ union bpf_attr {
__u32   btf_log_size;
__u32   btf_log_level;
};
+
+   struct {
+   int pid;/* input: pid */
+   int fd; /* input: fd */
+   __u32   flags;  /* input: flags */
+   __u32   buf_len;/* input: buf len */
+   __aligned_u64   buf;/* input/output:
+*   tp_name for tracepoint
+*   symbol for kprobe
+*   filename for uprobe
+*/
+   __u32   prog_id;/* output: prod_id */
+   __u32   attach_info;/* output: BPF_ATTACH_* */
+   __u64   probe_offset;   /* output: probe_offset */
+   __u64   probe_addr; /* output: probe_addr */
+   } task_fd_query;
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
@@ -2450,4 +2467,14 @@ struct bpf_fib_lookup {
__u8dmac[6]; /* ETH_ALEN */
 };
 
+/* used by  based query */
+enum {
+   BPF_ATTACH_RAW_TRACEPOINT,  /* tp name */
+   BPF_ATTACH_TRACEPOINT,  /* tp name */
+   BPF_ATTACH_KPROBE,  /* (symbol + offset) or addr */
+   BPF_ATTACH_KRETPROBE,   /* (symbol + offset) or addr */
+   BPF_ATTACH_UPROBE,  /* filename + offset */
+   BPF_ATTACH_URETPROBE,   /* filename + offset */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 6a8a000..da3f336 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -643,3 +643,27 @@ int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, 
__u32 log_buf_size,
 
return fd;
 }
+
+int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf, __u32 buf_len,
+ __u32 *prog_id, __u32 *attach_info,
+ __u64 *probe_offset, __u64 *probe_addr)
+{
+   union bpf_attr attr = {};
+   int err;
+
+   attr.task_fd_query.pid = pid;
+   attr.task_fd_query.fd = fd;
+   attr.task_fd_query.flags = flags;
+   attr.task_fd_query.buf = ptr_to_u64(buf);
+   attr.task_fd_query.buf_len = buf_len;
+
+   err = sys_bpf(BPF_TASK_FD_QUERY, , sizeof(attr));
+   if (!err) {
+   *prog_id = attr.task_fd_query.prog_id;
+   *attach_info = attr.task_fd_query.attach_info;
+   *probe_offset = attr.task_fd_query.probe_offset;
+   *probe_addr = attr.task_fd_query.probe_addr;
+   }
+
+   return err;
+}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 15bff77..9adfde6 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -107,4 +107,7 @@ int bpf_prog_query(int target_fd, enum bpf_attach_type 
type, __u32 query_flags,
 int bpf_raw_tracepoint_open(const char *name, int prog_fd);
 int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size,
 bool do_log);
+int bpf_task_fd_query(int pid, int fd, __u32 flags, char *buf, __u32 buf_len,
+ __u32 *prog_id, __u32 *prog_info,
+ __u64 *probe_offset, __u64 *probe_addr);
 #endif
-- 
2.9.5



[PATCH bpf-next v2 5/7] samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY

2018-05-17 Thread Yonghong Song
This is mostly to test kprobe/uprobe which needs kernel headers.

Signed-off-by: Yonghong Song 
---
 samples/bpf/Makefile |   4 +
 samples/bpf/task_fd_query_kern.c |  19 ++
 samples/bpf/task_fd_query_user.c | 379 +++
 3 files changed, 402 insertions(+)
 create mode 100644 samples/bpf/task_fd_query_kern.c
 create mode 100644 samples/bpf/task_fd_query_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 62d1aa1..7dc85ed 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -51,6 +51,7 @@ hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
+hostprogs-y += task_fd_query
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -105,6 +106,7 @@ cpustat-objs := bpf_load.o cpustat_user.o
 xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
+task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -160,6 +162,7 @@ always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
+always += task_fd_query_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -175,6 +178,7 @@ HOSTCFLAGS_offwaketime_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
 
 HOST_LOADLIBES += $(LIBBPF) -lelf
 HOSTLOADLIBES_tracex4  += -lrt
diff --git a/samples/bpf/task_fd_query_kern.c b/samples/bpf/task_fd_query_kern.c
new file mode 100644
index 000..f4b0a9e
--- /dev/null
+++ b/samples/bpf/task_fd_query_kern.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/blk_start_request")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   return 0;
+}
+
+SEC("kretprobe/blk_account_io_completion")
+int bpf_prog2(struct pt_regs *ctx)
+{
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/task_fd_query_user.c b/samples/bpf/task_fd_query_user.c
new file mode 100644
index 000..792ef24
--- /dev/null
+++ b/samples/bpf/task_fd_query_user.c
@@ -0,0 +1,379 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "perf-sys.h"
+#include "trace_helpers.h"
+
+#define CHECK_PERROR_RET(condition) ({ \
+   int __ret = !!(condition);  \
+   if (__ret) {\
+   printf("FAIL: %s:\n", __func__);\
+   perror(""); \
+   return -1;  \
+   }   \
+})
+
+#define CHECK_AND_RET(condition) ({\
+   int __ret = !!(condition);  \
+   if (__ret)  \
+   return -1;  \
+})
+
+static __u64 ptr_to_u64(void *ptr)
+{
+   return (__u64) (unsigned long) ptr;
+}
+
+#define PMU_TYPE_FILE "/sys/bus/event_source/devices/%s/type"
+static int bpf_find_probe_type(const char *event_type)
+{
+   char buf[256];
+   int fd, ret;
+
+   ret = snprintf(buf, sizeof(buf), PMU_TYPE_FILE, event_type);
+   CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf));
+
+   fd = open(buf, O_RDONLY);
+   CHECK_PERROR_RET(fd < 0);
+
+   ret = read(fd, buf, sizeof(buf));
+   close(fd);
+   CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf));
+
+   errno = 0;
+   ret = (int)strtol(buf, NULL, 10);
+   CHECK_PERROR_RET(errno);
+   return ret;
+}
+
+#define PMU_RETPROBE_FILE "/sys/bus/event_source/devices/%s/format/retprobe"
+static int bpf_get_retprobe_bit(const char *event_type)
+{
+   char buf[256];
+   int fd, ret;
+
+   ret = snprintf(buf, sizeof(buf), PMU_RETPROBE_FILE, event_type);
+   CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf));
+
+   fd = open(buf, O_RDONLY);
+   CHECK_PERROR_RET(fd < 0);
+
+   ret = read(fd, buf, sizeof(buf));
+   close(fd);
+   CHECK_PERROR_RET(ret < 0 || ret >= sizeof(buf));
+   CHECK_PERROR_RET(strlen(buf) < strlen("config:"));
+
+   errno = 0;
+   ret = (int)strtol(buf + strlen("config:"), NULL, 10);
+   CHECK_PERROR_RET(errno);
+   return ret;
+}
+
+static int test_debug_fs_kprobe(int fd_idx, const char *fn_name,
+   

[PATCH bpf-next v2 6/7] tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs

2018-05-17 Thread Yonghong Song
The new tests are added to query perf_event information
for raw_tracepoint and tracepoint attachment. For tracepoint,
both syscalls and non-syscalls tracepoints are queries as
they are treated slightly differently inside the kernel.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_progs.c | 133 +++
 1 file changed, 133 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 3ecf733..f7ede03 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1542,6 +1542,137 @@ static void test_get_stack_raw_tp(void)
bpf_object__close(obj);
 }
 
+static void test_task_fd_query_rawtp(void)
+{
+   const char *file = "./test_get_stack_rawtp.o";
+   struct perf_event_attr attr = {};
+   __u64 probe_offset, probe_addr;
+   int efd, err, prog_fd, pmu_fd;
+   __u32 prog_id, attach_info;
+   struct bpf_object *obj;
+   __u32 duration = 0;
+   char buf[256];
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_RAW_TRACEPOINT, , _fd);
+   if (CHECK(err, "prog_load raw tp", "err %d errno %d\n", err, errno))
+   return;
+
+   efd = bpf_raw_tracepoint_open("sys_enter", prog_fd);
+   if (CHECK(efd < 0, "raw_tp_open", "err %d errno %d\n", efd, errno))
+   goto close_prog;
+
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.type = PERF_TYPE_SOFTWARE;
+   attr.config = PERF_COUNT_SW_BPF_OUTPUT;
+   pmu_fd = syscall(__NR_perf_event_open, , getpid(), -1, -1, 0);
+   if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+ errno))
+   goto close_prog;
+
+   err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+   if (CHECK(err < 0, "ioctl PERF_EVENT_IOC_ENABLE", "err %d errno %d\n",
+ err, errno))
+   goto close_prog;
+
+   /* query (getpid(), efd */
+   err = bpf_task_fd_query(getpid(), efd, 0, buf, 256, _id,
+   _info, _offset, _addr);
+   if (CHECK(err < 0, "bpf_trace_event_query", "err %d errno %d\n", err,
+ errno))
+   goto close_prog;
+
+   err = (attach_info == BPF_ATTACH_RAW_TRACEPOINT) &&
+ (strcmp(buf, "sys_enter") == 0);
+   if (CHECK(!err, "check_results", "attach_info %d tp_name %s\n",
+ attach_info, buf))
+   goto close_prog;
+
+   goto close_prog_noerr;
+close_prog:
+   error_cnt++;
+close_prog_noerr:
+   bpf_object__close(obj);
+}
+
+static void test_task_fd_query_tp_core(const char *probe_name,
+  const char *tp_name)
+{
+   const char *file = "./test_tracepoint.o";
+   int err, bytes, efd, prog_fd, pmu_fd;
+   struct perf_event_attr attr = {};
+   __u64 probe_offset, probe_addr;
+   __u32 prog_id, attach_info;
+   struct bpf_object *obj;
+   __u32 duration = 0;
+   char buf[256];
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, , _fd);
+   if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+   goto close_prog;
+
+   snprintf(buf, sizeof(buf),
+"/sys/kernel/debug/tracing/events/%s/id", probe_name);
+   efd = open(buf, O_RDONLY, 0);
+   if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+   goto close_prog;
+   bytes = read(efd, buf, sizeof(buf));
+   close(efd);
+   if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+ "bytes %d errno %d\n", bytes, errno))
+   goto close_prog;
+
+   attr.config = strtol(buf, NULL, 0);
+   attr.type = PERF_TYPE_TRACEPOINT;
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+   pmu_fd = syscall(__NR_perf_event_open, , -1 /* pid */,
+0 /* cpu 0 */, -1 /* group id */,
+0 /* flags */);
+   if (CHECK(err, "perf_event_open", "err %d errno %d\n", err, errno))
+   goto close_pmu;
+
+   err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+   if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+ errno))
+   goto close_pmu;
+
+   err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+   if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+ errno))
+   goto close_pmu;
+
+   /* query (getpid(), pmu_fd */
+   err = bpf_task_fd_query(getpid(), pmu_fd, 0, buf, 256, _id,
+   _info, _offset, _addr);
+   if (CHECK(err < 0, "bpf_trace_event_query", "err %d errno %d\n", err,
+ errno))
+   goto close_pmu;
+
+   err = (attach_info == BPF_ATTACH_TRACEPOINT) && !strcmp(buf, tp_name);
+   if (CHECK(!err, "check_results", "attach_info %d tp_name %s\n",
+  

Re: [PATCH bpf 5/6] tools: bpftool: resolve calls without using imm field

2018-05-17 Thread Sandipan Das
Hi Jakub,

On 05/18/2018 12:21 AM, Jakub Kicinski wrote:
> On Thu, 17 May 2018 12:05:47 +0530, Sandipan Das wrote:
>> Currently, we resolve the callee's address for a JITed function
>> call by using the imm field of the call instruction as an offset
>> from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
>> use this address to get the callee's kernel symbol's name.
>>
>> For some architectures, such as powerpc64, the imm field is not
>> large enough to hold this offset. So, instead of assigning this
>> offset to the imm field, the verifier now assigns the subprog
>> id. Also, a list of kernel symbol addresses for all the JITed
>> functions is provided in the program info. We now use the imm
>> field as an index for this list to lookup a callee's symbol's
>> address and resolve its name.
>>
>> Suggested-by: Daniel Borkmann 
>> Signed-off-by: Sandipan Das 
> 
> A few nit-picks below, thank you for the patch!
> 
>>  tools/bpf/bpftool/prog.c  | 31 +++
>>  tools/bpf/bpftool/xlated_dumper.c | 24 +---
>>  tools/bpf/bpftool/xlated_dumper.h |  2 ++
>>  3 files changed, 50 insertions(+), 7 deletions(-)
>>
>> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
>> index 9bdfdf2d3fbe..ac2f62a97e84 100644
>> --- a/tools/bpf/bpftool/prog.c
>> +++ b/tools/bpf/bpftool/prog.c
>> @@ -430,6 +430,10 @@ static int do_dump(int argc, char **argv)
>>  unsigned char *buf;
>>  __u32 *member_len;
>>  __u64 *member_ptr;
>> +unsigned int nr_addrs;
>> +unsigned long *addrs = NULL;
>> +__u32 *ksyms_len;
>> +__u64 *ksyms_ptr;
> 
> nit: please try to keep the variables ordered longest to shortest like
> we do in networking code (please do it in all functions).
> 
>>  ssize_t n;
>>  int err;
>>  int fd;
>> @@ -437,6 +441,8 @@ static int do_dump(int argc, char **argv)
>>  if (is_prefix(*argv, "jited")) {
>>  member_len = _prog_len;
>>  member_ptr = _prog_insns;
>> +ksyms_len = _jited_ksyms;
>> +ksyms_ptr = _ksyms;
>>  } else if (is_prefix(*argv, "xlated")) {
>>  member_len = _prog_len;
>>  member_ptr = _prog_insns;
>> @@ -496,10 +502,23 @@ static int do_dump(int argc, char **argv)
>>  return -1;
>>  }
>>  
>> +nr_addrs = *ksyms_len;
> 
> Here and ...
> 
>> +if (nr_addrs) {
>> +addrs = malloc(nr_addrs * sizeof(__u64));
>> +if (!addrs) {
>> +p_err("mem alloc failed");
>> +free(buf);
>> +close(fd);
>> +return -1;
> 
> You can just jump to err_free here.
> 
>> +}
>> +}
>> +
>>  memset(, 0, sizeof(info));
>>  
>>  *member_ptr = ptr_to_u64(buf);
>>  *member_len = buf_size;
>> +*ksyms_ptr = ptr_to_u64(addrs);
>> +*ksyms_len = nr_addrs;
> 
> ... here - this function is getting long, so maybe I'm not seeing
> something, but are ksyms_ptr and ksyms_len guaranteed to be initialized?
> 
>>  err = bpf_obj_get_info_by_fd(fd, , );
>>  close(fd);
>> @@ -513,6 +532,11 @@ static int do_dump(int argc, char **argv)
>>  goto err_free;
>>  }
>>  
>> +if (*ksyms_len > nr_addrs) {
>> +p_err("too many addresses returned");
>> +goto err_free;
>> +}
>> +
>>  if ((member_len == _prog_len &&
>>   info.jited_prog_insns == 0) ||
>>  (member_len == _prog_len &&
>> @@ -558,6 +582,9 @@ static int do_dump(int argc, char **argv)
>>  dump_xlated_cfg(buf, *member_len);
>>  } else {
>>  kernel_syms_load();
>> +dd.jited_ksyms = ksyms_ptr;
>> +dd.nr_jited_ksyms = *ksyms_len;
>> +
>>  if (json_output)
>>  dump_xlated_json(, buf, *member_len, opcodes);
>>  else
>> @@ -566,10 +593,14 @@ static int do_dump(int argc, char **argv)
>>  }
>>  
>>  free(buf);
>> +if (addrs)
>> +free(addrs);
> 
> Free can deal with NULL pointers, no need for an if.
> 
>>  return 0;
>>  
>>  err_free:
>>  free(buf);
>> +if (addrs)
>> +free(addrs);
>>  return -1;
>>  }
>>  
>> diff --git a/tools/bpf/bpftool/xlated_dumper.c 
>> b/tools/bpf/bpftool/xlated_dumper.c
>> index 7a3173b76c16..dc8e4eca0387 100644
>> --- a/tools/bpf/bpftool/xlated_dumper.c
>> +++ b/tools/bpf/bpftool/xlated_dumper.c
>> @@ -178,8 +178,12 @@ static const char *print_call_pcrel(struct dump_data 
>> *dd,
>>  snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>>   "%+d#%s", insn->off, sym->name);
>>  else
> 
> else if (address)
> 
> saves us the indentation.
> 
>> -snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>> - "%+d#0x%lx", insn->off, address);
>> +if (address)
>> +snprintf(dd->scratch_buff, 

Re: [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter

2018-05-17 Thread Cong Wang
On Thu, May 17, 2018 at 4:23 AM, Toke Høiland-Jørgensen  wrote:
> Eric Dumazet  writes:
>
>> On 05/16/2018 01:29 PM, Toke Høiland-Jørgensen wrote:
>>> The ACK filter is an optional feature of CAKE which is designed to improve
>>> performance on links with very asymmetrical rate limits. On such links
>>> (which are unfortunately quite prevalent, especially for DSL and cable
>>> subscribers), the downstream throughput can be limited by the number of
>>> ACKs capable of being transmitted in the *upstream* direction.
>>>
>>
>> ...
>>
>>>
>>> Signed-off-by: Toke Høiland-Jørgensen 
>>> ---
>>>  net/sched/sch_cake.c |  260 
>>> ++
>>>  1 file changed, 258 insertions(+), 2 deletions(-)
>>>
>>>
>>
>> I have decided to implement ACK compression in TCP stack itself.
>
> Awesome! Will look forward to seeing that!

+1

It is really odd to put into a TC qdisc, TCP stack is a much better
place.


Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy

2018-05-17 Thread Björn Töpel
2018-05-17 23:31 GMT+02:00 Jesper Dangaard Brouer :
>
> On Tue, 15 May 2018 21:06:15 +0200 Björn Töpel  wrote:
>
>> From: Magnus Karlsson 
>>
>> Here, the zero-copy ndo is implemented. As a shortcut, the existing
>> XDP Tx rings are used for zero-copy. This means that and XDP program
>> cannot redirect to an AF_XDP enabled XDP Tx ring.
>
> This "shortcut" is not acceptable, and completely broken.  The
> XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily
> clash with the configured XSK queue_index.  Provided a bit more code
> context below...
>

Yes, and this is the reason we need to go for a solution with
dedicated Tx rings. Again, we chose not to, and simply drops
XDP_REDIRECT where the AF_XDP queue id clashes with the processor id.
The queue id hijacked by AF_XDP's egress side.

> On Tue, 15 May 2018 21:06:15 +0200
> Björn Töpel  wrote:
>
> int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
> {
> struct i40e_netdev_priv *np = netdev_priv(dev);
> unsigned int queue_index = smp_processor_id();
> struct i40e_vsi *vsi = np->vsi;
> int err;
>
> if (test_bit(__I40E_VSI_DOWN, vsi->state))
> return -ENETDOWN;
>
>> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct 
>> xdp_frame *xdpf)
>>   if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>>   return -ENXIO;
>>
>> + if (vsi->xdp_rings[queue_index]->xsk_umem)
>> + return -ENXIO;
>> +
>
> Using the sane errno makes this impossible to debug (via the tracepoints).
>

The rationale was that the situation was similar to an incorrectly
configured receiving (from an XDP_REDIRECT perspective) interface.

We'll rework this! Thanks for looking into this, Jesper!


Björn

>>   err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
>>   if (err != I40E_XDP_TX)
>>   return -ENOSPC;
>> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
>>   if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>>   return;
>>
>> + if (vsi->xdp_rings[queue_index]->xsk_umem)
>> + return;
>> +
>>   i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>>  }
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next PATCH v2 0/4] Symmetric queue selection using XPS for Rx queues

2018-05-17 Thread Tom Herbert
On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
 wrote:
> This patch series implements support for Tx queue selection based on
> Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
> using sysfs attribute. If the user configuration for Rx queues does
> not apply, then the Tx queue selection falls back to XPS using CPUs and
> finally to hashing.
>
> XPS is refactored to support Tx queue selection based on either the
> CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
> enabled. By default no receive queues are configured for the Tx queue.
>
> - /sys/class/net//queues/tx-*/xps_rxqs
>
> This is to enable sending packets on the same Tx-Rx queue pair as this

If I'm reading the patch correctly, isn't this mapping rxq to a set of
txqs (in other words not strictly queue pair which has other
connotations in NIC HW). It is important to make it clear that this
feature is no HW dependent.

> is useful for busy polling multi-threaded workloads where it is not
> possible to pin the threads to a CPU. This is a rework of Sridhar's
> patch for symmetric queueing via socket option:
> https://www.spinics.net/lists/netdev/msg453106.html
>
Please add something about how this was tested and what the
performance gain is to justify the feature.

> v2:
> - Added documentation in networking/scaling.txt
> - Added a simple routine to replace multiple ifdef blocks.
>
> ---
>
> Amritha Nambiar (4):
>   net: Refactor XPS for CPUs and Rx queues
>   net: Enable Tx queue selection based on Rx queues
>   net-sysfs: Add interface for Rx queue map per Tx queue
>   Documentation: Add explanation for XPS using Rx-queue map
>
>
>  Documentation/networking/scaling.txt |   58 +++-
>  include/linux/cpumask.h  |   11 +-
>  include/linux/netdevice.h|   72 ++
>  include/net/sock.h   |   18 +++
>  net/core/dev.c   |  242 
> +++---
>  net/core/net-sysfs.c |   85 
>  net/core/sock.c  |5 +
>  net/ipv4/tcp_input.c |7 +
>  net/ipv4/tcp_ipv4.c  |1
>  net/ipv4/tcp_minisocks.c |1
>  10 files changed, 404 insertions(+), 96 deletions(-)
>
> --


Re: [Cake] [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter

2018-05-17 Thread Eric Dumazet


On 05/17/2018 07:36 PM, Ryan Mounce wrote:
> On 17 May 2018 at 22:41, Toke Høiland-Jørgensen  wrote:
>> Eric Dumazet  writes:
>>
>>> On 05/17/2018 04:23 AM, Toke Høiland-Jørgensen wrote:
>>>

 We don't do full parsing of SACKs, no; we were trying to keep things
 simple... We do detect the presence of SACK options, though, and the
 presence of SACK options on an ACK will make previous ACKs be considered
 redundant.

>>>
>>> But they are not redundant in some cases, particularly when reorders
>>> happen in the network.
>>
>> Huh. I was under the impression that SACKs were basically cumulative
>> until cleared.
>>
>> I.e., in packet sequence ABCDE where B and D are lost, C would have
>> SACK(B) and E would have SACK(B,D). Are you saying that E would only
>> have SACK(D)?
> 
> SACK works by acknowledging additional ranges above those that have
> been ACKed, rather than ACKing up to the largest seen sequence number
> and reporting missing ranges before that.
> 
> A - ACK(A)
> B - lost
> C - ACK(A) + SACK(C)
> D - lost
> E - ACK(A) + SACK(C, E)
> 
> Cake does check that the ACK sequence number is greater, or if it is
> equal and the 'newer' ACK has the SACK option present. It doesn't
> compare the sequence numbers inside two SACKs. If the two SACKs in the
> above example had been reordered before reaching cake's ACK filter in
> aggressive mode, the wrong one will be filtered.
> 
> This is a limitation of my naive SACK handling in cake. The default
> 'conservative' mode happens to mitigate the problem in the above
> scenario, but the issue could still present itself in more
> pathological cases. It's fixable, however I'm not sure this corner
> case is sufficiently common or severe to warrant the extra complexity.

The extra complexity is absolutely requested for inclusion in upstream linux.

I recommend reading rfc 2018, whole section 4 (Generating Sack Options: Data 
Receiver Behavior
)

Proposed ACK filter in Cake is messing the protocol, since the first rule is 
not respected 

* The first SACK block (i.e., the one immediately following the
  kind and length fields in the option) MUST specify the contiguous
  block of data containing the segment which triggered this ACK,
  unless that segment advanced the Acknowledgment Number field in
  the header.  This assures that the ACK with the SACK option
  reflects the most recent change in the data receiver's buffer
  queue.


An ACK filter must either :

Not merge ACK if they contain different SACK blocks.

Or make a precise analysis of the SACK blocks to determine if the merge is 
allowed,
ie no useful information is lost.

The sender should get all the information as which segments were received 
correctly,
assuming no ACK are dropped because of congestion on return path.










Re: [net-next PATCH v2 1/4] net: Refactor XPS for CPUs and Rx queues

2018-05-17 Thread Tom Herbert
On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
 wrote:
> Refactor XPS code to support Tx queue selection based on
> CPU map or Rx queue map.
>
> Signed-off-by: Amritha Nambiar 
> ---
>  include/linux/cpumask.h   |   11 ++
>  include/linux/netdevice.h |   72 +++-
>  net/core/dev.c|  208 
> +
>  net/core/net-sysfs.c  |4 -
>  4 files changed, 215 insertions(+), 80 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index bf53d89..57f20a0 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
>  #define cpu_active(cpu)((cpu) == 0)
>  #endif
>
> -/* verify cpu argument to cpumask_* operators */
> -static inline unsigned int cpumask_check(unsigned int cpu)
> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
>  {
>  #ifdef CONFIG_DEBUG_PER_CPU_MAPS
> -   WARN_ON_ONCE(cpu >= nr_cpumask_bits);
> +   WARN_ON_ONCE(cpu >= bits);
>  #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +}
> +
> +/* verify cpu argument to cpumask_* operators */
> +static inline unsigned int cpumask_check(unsigned int cpu)
> +{
> +   cpu_max_bits_warn(cpu, nr_cpumask_bits);
> return cpu;
>  }
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 03ed492..c2eeb36 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -730,10 +730,21 @@ struct xps_map {
>   */
>  struct xps_dev_maps {
> struct rcu_head rcu;
> -   struct xps_map __rcu *cpu_map[0];
> +   struct xps_map __rcu *attr_map[0];
>  };
> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
> +
> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
> (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
> +
> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
> +   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
> +
> +enum xps_map_type {
> +   XPS_MAP_RXQS,
> +   XPS_MAP_CPUS,
> +   __XPS_MAP_MAX
> +};
> +
>  #endif /* CONFIG_XPS */
>
>  #define TC_MAX_QUEUE   16
> @@ -1891,7 +1902,7 @@ struct net_device {
> int watchdog_timeo;
>
>  #ifdef CONFIG_XPS
> -   struct xps_dev_maps __rcu *xps_maps;
> +   struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
>  #endif
>  #ifdef CONFIG_NET_CLS_ACT
> struct mini_Qdisc __rcu *miniq_egress;
> @@ -3229,6 +3240,61 @@ static inline void netif_wake_subqueue(struct 
> net_device *dev, u16 queue_index)
>  #ifdef CONFIG_XPS
>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> u16 index);
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> + u16 index, enum xps_map_type type);
> +
> +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
> + unsigned int nr_bits)
> +{
> +   cpu_max_bits_warn(j, nr_bits);
> +   return test_bit(j, mask);
> +}
> +
> +static inline bool attr_test_online(unsigned long j,
> +   const unsigned long *online_mask,
> +   unsigned int nr_bits)
> +{
> +   cpu_max_bits_warn(j, nr_bits);
> +
> +   if (online_mask)
> +   return test_bit(j, online_mask);
> +
> +   if (j >= 0 && j < nr_bits)
> +   return true;
> +
> +   return false;
> +}
> +
> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
> +unsigned int nr_bits)
> +{
> +   /* -1 is a legal arg here. */
> +   if (n != -1)
> +   cpu_max_bits_warn(n, nr_bits);
> +
> +   if (srcp)
> +   return find_next_bit(srcp, nr_bits, n + 1);
> +
> +   return n + 1;
> +}
> +
> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
> +   const unsigned long *src2p,
> +   unsigned int nr_bits)
> +{
> +   /* -1 is a legal arg here. */
> +   if (n != -1)
> +   cpu_max_bits_warn(n, nr_bits);
> +
> +   if (src1p && src2p)
> +   return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
> +   else if (src1p)
> +   return find_next_bit(src1p, nr_bits, n + 1);
> +   else if (src2p)
> +   return find_next_bit(src2p, nr_bits, n + 1);
> +
> +   return n + 1;
> +}
>  #else
>  static inline int netif_set_xps_queue(struct net_device *dev,
>   const struct cpumask *mask,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9f43901..7e5dfdb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps 
> *dev_maps,
>   

Re: [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues

2018-05-17 Thread Tom Herbert
On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
 wrote:
> This patch adds support to pick Tx queue based on the Rx queue map
> configuration set by the admin through the sysfs attribute
> for each Tx queue. If the user configuration for receive
> queue map does not apply, then the Tx queue selection falls back
> to CPU map based selection and finally to hashing.
>
> Signed-off-by: Amritha Nambiar 
> Signed-off-by: Sridhar Samudrala 
> ---
>  include/net/sock.h   |   18 ++
>  net/core/dev.c   |   36 +---
>  net/core/sock.c  |5 +
>  net/ipv4/tcp_input.c |7 +++
>  net/ipv4/tcp_ipv4.c  |1 +
>  net/ipv4/tcp_minisocks.c |1 +
>  6 files changed, 61 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 4f7c584..0613f63 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair;
>   * @skc_node: main hash linkage for various protocol lookup tables
>   * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>   * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_rx_queue_mapping: rx queue number for this connection
> + * @skc_rx_ifindex: rx ifindex for this connection
>   * @skc_flags: place holder for sk_flags
>   * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>   * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -215,6 +217,10 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> +#ifdef CONFIG_XPS
> +   int skc_rx_queue_mapping;
> +   int skc_rx_ifindex;

Isn't this increasing size of sock_common for a narrow use case functionality?

> +#endif
> union {
> int skc_incoming_cpu;
> u32 skc_rcv_wnd;
> @@ -326,6 +332,10 @@ struct sock {
>  #define sk_nulls_node  __sk_common.skc_nulls_node
>  #define sk_refcnt  __sk_common.skc_refcnt
>  #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
> +#ifdef CONFIG_XPS
> +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
> +#define sk_rx_ifindex  __sk_common.skc_rx_ifindex
> +#endif
>
>  #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
>  #define sk_dontcopy_end__sk_common.skc_dontcopy_end
> @@ -1696,6 +1706,14 @@ static inline int sk_tx_queue_get(const struct sock 
> *sk)
> return sk ? sk->sk_tx_queue_mapping : -1;
>  }
>
> +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
> +{
> +#ifdef CONFIG_XPS
> +   sk->sk_rx_ifindex = skb->skb_iif;
> +   sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
> +#endif
> +}
> +
>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>  {
> sk_tx_queue_clear(sk);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7e5dfdb..4030368 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3458,18 +3458,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, 
> struct net_device *dev)
>  }
>  #endif /* CONFIG_NET_EGRESS */
>
> -static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> -{
>  #ifdef CONFIG_XPS
> -   struct xps_dev_maps *dev_maps;
> +static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
> +  struct xps_dev_maps *dev_maps, unsigned int 
> tci)
> +{
> struct xps_map *map;
> int queue_index = -1;
>
> -   rcu_read_lock();
> -   dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
> if (dev_maps) {
> -   unsigned int tci = skb->sender_cpu - 1;
> -
> if (dev->num_tc) {
> tci *= dev->num_tc;
> tci += netdev_get_prio_tc_map(dev, skb->priority);
> @@ -3486,6 +3482,32 @@ static inline int get_xps_queue(struct net_device 
> *dev, struct sk_buff *skb)
> queue_index = -1;
> }
> }
> +   return queue_index;
> +}
> +#endif
> +
> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> +#ifdef CONFIG_XPS
> +   enum xps_map_type i = XPS_MAP_RXQS;
> +   struct xps_dev_maps *dev_maps;
> +   struct sock *sk = skb->sk;
> +   int queue_index = -1;
> +   unsigned int tci = 0;
> +
> +   if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
> +   dev->ifindex == sk->sk_rx_ifindex)
> +   tci = sk->sk_rx_queue_mapping;
> +
> +   rcu_read_lock();
> +   while (queue_index < 0 && i < __XPS_MAP_MAX) {
> +   if (i == XPS_MAP_CPUS)

This while loop typifies exactly why I don't think the XPS maps should
be 

Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support

2018-05-17 Thread Alexei Starovoitov

On 5/16/18 11:46 PM, Björn Töpel wrote:

2018-05-04 1:38 GMT+02:00 Alexei Starovoitov :

On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:

On 05/02/2018 01:01 PM, Björn Töpel wrote:

From: Björn Töpel 

This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this patch set, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.


+1, would be great to see it land this cycle. Saw few minor nits here
and there but nothing to hold it up, for the series:

Acked-by: Daniel Borkmann 

Thanks everyone!


Great stuff!

Applied to bpf-next, with one condition.
Upcoming zero-copy patches for both RX and TX need to be posted
and reviewed within this release window.
If netdev community as a whole won't be able to agree on the zero-copy
bits we'd need to revert this feature before the next merge window.

Few other minor nits:
patch 3:
+struct xdp_ring {
+   __u32 producer __attribute__((aligned(64)));
+   __u32 consumer __attribute__((aligned(64)));
+};
It kinda begs for cacheline_aligned_in_smp to be introduced for uapi 
headers.



Hmm, I need some guidance on what a sane uapi variant would be. We
can't have the uapi depend on the kernel build. ARM64, e.g., can have
both 64B and 128B according to the specs. Contemporary IA processors
have 64B.

The simplest, and maybe most future-proof, would be 128B aligned for
all. Another is having 128B for ARM and 64B for all IA. A third option
is having a hand-shaking API (I think virtio has that) for determine
the cache line size, but I'd rather not go down that route.

Thoughts/ideas on how a uapi cacheline_aligned_in_smp version
would look like?


I suspect i40e+arm combination wasn't tested anyway.
The api may have endianness issues too on something like sparc.
I think the way to be backwards compatible in this area
is to make the api usable on x86 only by adding
to include/uapi/linux/if_xdp.h
#if defined(__x86_64__)
#define AF_XDP_CACHE_BYTES 64
#else
#error "AF_XDP support is not yet available for this architecture"
#endif
and doing:
__u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES)));
__u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES)));

And progressively add to this for arm64 and few other archs.
Eventually removing #error and adding some generic define
that's good enough for long tail of architectures that
we really cannot test.



Re: pull-request: bpf 2018-05-18

2018-05-17 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 18 May 2018 02:26:17 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled.

> When this gets later merged into net-next there are a two trivial
> BPF conflicts to resolve:
 ...

Thanks a lot for the conflict guidance.


[PATCH v2] net: qcom/emac: Allocate buffers from local node

2018-05-17 Thread Hemanth Puranik
Currently we use non-NUMA aware allocation for TPD and RRD buffers,
this patch modifies to use NUMA friendly allocation.

Signed-off-by: Hemanth Puranik 
---
Change since v1:
- Addressed comments related to ordering

 drivers/net/ethernet/qualcomm/emac/emac-mac.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c 
b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index 092718a..031f6e6 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -683,10 +683,11 @@ static int emac_tx_q_desc_alloc(struct emac_adapter *adpt,
struct emac_tx_queue *tx_q)
 {
struct emac_ring_header *ring_header = >ring_header;
+   int node = dev_to_node(adpt->netdev->dev.parent);
size_t size;
 
size = sizeof(struct emac_buffer) * tx_q->tpd.count;
-   tx_q->tpd.tpbuff = kzalloc(size, GFP_KERNEL);
+   tx_q->tpd.tpbuff = kzalloc_node(size, GFP_KERNEL, node);
if (!tx_q->tpd.tpbuff)
return -ENOMEM;
 
@@ -723,11 +724,12 @@ static void emac_rx_q_bufs_free(struct emac_adapter *adpt)
 static int emac_rx_descs_alloc(struct emac_adapter *adpt)
 {
struct emac_ring_header *ring_header = >ring_header;
+   int node = dev_to_node(adpt->netdev->dev.parent);
struct emac_rx_queue *rx_q = >rx_q;
size_t size;
 
size = sizeof(struct emac_buffer) * rx_q->rfd.count;
-   rx_q->rfd.rfbuff = kzalloc(size, GFP_KERNEL);
+   rx_q->rfd.rfbuff = kzalloc_node(size, GFP_KERNEL, node);
if (!rx_q->rfd.rfbuff)
return -ENOMEM;
 
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



[net-next:master 1230/1233] arch/sparc/include/asm/io_64.h:177:20: note: in expansion of macro 'writel'

2018-05-17 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   538e2de104cfb4ef1acb35af42427bff42adbe4d
commit: 2652113ff043ca2ce1cb3be529b5ca9270c421d4 [1230/1233] net: ethernet: ti: 
Allow most drivers with COMPILE_TEST
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 2652113ff043ca2ce1cb3be529b5ca9270c421d4
# save the attached .config to linux build tree
make.cross ARCH=sparc64 

All warnings (new ones prefixed by >>):

   In file included from arch/sparc/include/asm/bug.h:25:0,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from include/asm-generic/preempt.h:5,
from ./arch/sparc/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:81,
from include/linux/spinlock.h:51,
from drivers/net/ethernet/ti/davinci_cpdma.c:16:
   drivers/net/ethernet/ti/davinci_cpdma.c: In function 
'cpdma_desc_pool_destroy':
   drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format '%d' expects 
argument of type 'int', but argument 4 has type 'size_t {aka long unsigned 
int}' [-Wformat=]
  "cpdma_desc_pool size %d != avail %d",
  ^
  gen_pool_size(pool->gen_pool),
  ~
   include/asm-generic/bug.h:92:69: note: in definition of macro '__WARN_printf'
#define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg)
^~~
   drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro 
'WARN'
 WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool),
 ^~~~
   drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format '%d' expects 
argument of type 'int', but argument 5 has type 'size_t {aka long unsigned 
int}' [-Wformat=]
  "cpdma_desc_pool size %d != avail %d",
  ^
   drivers/net/ethernet/ti/davinci_cpdma.c:196:7:
  gen_pool_avail(pool->gen_pool));
  ~~
   include/asm-generic/bug.h:92:69: note: in definition of macro '__WARN_printf'
#define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg)
^~~
   drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro 
'WARN'
 WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool),
 ^~~~
   drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit':
   drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 
of 'writel' makes integer from pointer without a cast [-Wint-conversion]
 writel_relaxed(token, >sw_token);
^
   In file included from arch/sparc/include/asm/io.h:5:0,
from include/linux/scatterlist.h:9,
from include/linux/dma-mapping.h:11,
from drivers/net/ethernet/ti/davinci_cpdma.c:21:
   arch/sparc/include/asm/io_64.h:175:16: note: expected 'u32 {aka unsigned 
int}' but argument is of type 'void *'
#define writel writel
   ^
>> arch/sparc/include/asm/io_64.h:177:20: note: in expansion of macro 'writel'
static inline void writel(u32 l, volatile void __iomem *addr)
   ^~
   drivers/net/ethernet/ti/davinci_cpdma.c: In function '__cpdma_chan_free':
   drivers/net/ethernet/ti/davinci_cpdma.c:1126:15: warning: cast to pointer 
from integer of different size [-Wint-to-pointer-cast]
 token  = (void *)desc_read(desc, sw_token);
  ^
--
   In file included from arch/sparc/include/asm/bug.h:25:0,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from include/asm-generic/preempt.h:5,
from ./arch/sparc/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:81,
from include/linux/spinlock.h:51,
from drivers/net//ethernet/ti/davinci_cpdma.c:16:
   drivers/net//ethernet/ti/davinci_cpdma.c: In function 
'cpdma_desc_pool_destroy':
   drivers/net//ethernet/ti/davinci_cpdma.c:194:7: warning: format '%d' expects 
argument of type 'int', but argument 4 has type 'size_t {aka long unsigned 
int}' [-Wformat=]
  "cpdma_desc_pool size %d != avail %d",
  ^
  gen_pool_size(pool->gen_pool),
  ~
   include/asm-generic/bug.h:92:69: note: in definition of macro '__WARN_printf'
#define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg)

Re: [Cake] [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter

2018-05-17 Thread Ryan Mounce
On 17 May 2018 at 22:41, Toke Høiland-Jørgensen  wrote:
> Eric Dumazet  writes:
>
>> On 05/17/2018 04:23 AM, Toke Høiland-Jørgensen wrote:
>>
>>>
>>> We don't do full parsing of SACKs, no; we were trying to keep things
>>> simple... We do detect the presence of SACK options, though, and the
>>> presence of SACK options on an ACK will make previous ACKs be considered
>>> redundant.
>>>
>>
>> But they are not redundant in some cases, particularly when reorders
>> happen in the network.
>
> Huh. I was under the impression that SACKs were basically cumulative
> until cleared.
>
> I.e., in packet sequence ABCDE where B and D are lost, C would have
> SACK(B) and E would have SACK(B,D). Are you saying that E would only
> have SACK(D)?

SACK works by acknowledging additional ranges above those that have
been ACKed, rather than ACKing up to the largest seen sequence number
and reporting missing ranges before that.

A - ACK(A)
B - lost
C - ACK(A) + SACK(C)
D - lost
E - ACK(A) + SACK(C, E)

Cake does check that the ACK sequence number is greater, or if it is
equal and the 'newer' ACK has the SACK option present. It doesn't
compare the sequence numbers inside two SACKs. If the two SACKs in the
above example had been reordered before reaching cake's ACK filter in
aggressive mode, the wrong one will be filtered.

This is a limitation of my naive SACK handling in cake. The default
'conservative' mode happens to mitigate the problem in the above
scenario, but the issue could still present itself in more
pathological cases. It's fixable, however I'm not sure this corner
case is sufficiently common or severe to warrant the extra complexity.

Ryan.


Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports

2018-05-17 Thread Andrew Lunn
> >>> ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS);
> >>>
> >>> It is allocating a switch with 12 ports. However only 4 of them have
> >>> names. So the core only creates slave devices for those 4.
> >>>
> >>> This is a useful test. Real hardware often has unused ports. A WiFi AP
> >>> with a 7 port switch which only uses 6 ports is often seen.
> >>
> >> The following patch should fix this:
> >>
> >>
> >> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> >> index adf50fbc4c13..a06c29ec91f0 100644
> >> --- a/net/dsa/dsa2.c
> >> +++ b/net/dsa/dsa2.c
> >> @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp)
> >>
> >> memset(>devlink_port, 0, sizeof(dp->devlink_port));
> >>
> >> +   if (dp->type == DSA_PORT_TYPE_UNUSED)
> >> +   return 0;
> >> +
> >> err = devlink_port_register(ds->devlink, >devlink_port,
> >> dp->index);
> > 
> > Hi Florian, Jiri
> > 
> > Maybe it is better to add a devlink port type unused?
> 
> The port does not exist on the switch, so it should not even be
> registered IMHO.

Hi Florian

The ports do exist, when you called dsa_switch_alloc() you said the
switch has 12 ports.

   Andrew


[for-next 11/15] net/mlx5e: Add ingress/egress indication for offloaded TC flows

2018-05-17 Thread Saeed Mahameed
From: Or Gerlitz 

When an e-switch TC rule is offloaded through the egdev (egress
device) mechanism, we treat this as egress, all other cases (NIC
and e-switch) are considred ingress.

This is preparation step that will allow us to  identify "wrong"
stat/del offload calls made by the TC core on egdev based flows and
ignore them.

Signed-off-by: Or Gerlitz 
Signed-off-by: Jiri Pirko 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  3 --
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 15 
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 32 
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 38 ++-
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   | 13 +--
 5 files changed, 70 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7c930088e96e..51a1d36a56c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -1118,9 +1118,6 @@ int mlx5e_ethtool_get_ts_info(struct mlx5e_priv *priv,
 int mlx5e_ethtool_flash_device(struct mlx5e_priv *priv,
   struct ethtool_flash *flash);
 
-int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
-   void *cb_priv);
-
 /* mlx5e generic netdev management API */
 struct net_device*
 mlx5e_create_netdev(struct mlx5_core_dev *mdev, const struct mlx5e_profile 
*profile,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 417bf2e8ab85..27e8375a476b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3136,22 +3136,23 @@ static int mlx5e_setup_tc_mqprio(struct net_device 
*netdev,
 
 #ifdef CONFIG_MLX5_ESWITCH
 static int mlx5e_setup_tc_cls_flower(struct mlx5e_priv *priv,
-struct tc_cls_flower_offload *cls_flower)
+struct tc_cls_flower_offload *cls_flower,
+int flags)
 {
switch (cls_flower->command) {
case TC_CLSFLOWER_REPLACE:
-   return mlx5e_configure_flower(priv, cls_flower);
+   return mlx5e_configure_flower(priv, cls_flower, flags);
case TC_CLSFLOWER_DESTROY:
-   return mlx5e_delete_flower(priv, cls_flower);
+   return mlx5e_delete_flower(priv, cls_flower, flags);
case TC_CLSFLOWER_STATS:
-   return mlx5e_stats_flower(priv, cls_flower);
+   return mlx5e_stats_flower(priv, cls_flower, flags);
default:
return -EOPNOTSUPP;
}
 }
 
-int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
-   void *cb_priv)
+static int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
+  void *cb_priv)
 {
struct mlx5e_priv *priv = cb_priv;
 
@@ -3160,7 +3161,7 @@ int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void 
*type_data,
 
switch (type) {
case TC_SETUP_CLSFLOWER:
-   return mlx5e_setup_tc_cls_flower(priv, type_data);
+   return mlx5e_setup_tc_cls_flower(priv, type_data, 
MLX5E_TC_INGRESS);
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index a689f4c90fe3..182b636552a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -723,15 +723,31 @@ static int mlx5e_rep_get_phys_port_name(struct net_device 
*dev,
 
 static int
 mlx5e_rep_setup_tc_cls_flower(struct mlx5e_priv *priv,
- struct tc_cls_flower_offload *cls_flower)
+ struct tc_cls_flower_offload *cls_flower, int 
flags)
 {
switch (cls_flower->command) {
case TC_CLSFLOWER_REPLACE:
-   return mlx5e_configure_flower(priv, cls_flower);
+   return mlx5e_configure_flower(priv, cls_flower, flags);
case TC_CLSFLOWER_DESTROY:
-   return mlx5e_delete_flower(priv, cls_flower);
+   return mlx5e_delete_flower(priv, cls_flower, flags);
case TC_CLSFLOWER_STATS:
-   return mlx5e_stats_flower(priv, cls_flower);
+   return mlx5e_stats_flower(priv, cls_flower, flags);
+   default:
+   return -EOPNOTSUPP;
+   }
+}
+
+static int mlx5e_rep_setup_tc_cb_egdev(enum tc_setup_type type, void 
*type_data,
+  void *cb_priv)
+{
+   struct mlx5e_priv *priv = cb_priv;
+
+   if (!tc_cls_can_offload_and_chain0(priv->netdev, type_data))
+ 

[for-next 14/15] net/mlx5e: Ignore attempts to offload multiple times a TC flow

2018-05-17 Thread Saeed Mahameed
From: Or Gerlitz 

For VF->VF and uplink->VF rules, the TC core (cls_api) attempts
to offload the same flow multiple times into the driver, b/c we
registered to the egdev callback.

Use the flow cookie to ignore attempts to add such flows, we can't
reject them (return error), b/c this will fail the offload attempt,
so we ignore that. We indentify wrong stat/del calls using the flow
ingress/egress flags, here we do return error to the core.

Signed-off-by: Or Gerlitz 
Signed-off-by: Jiri Pirko 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 21 +--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 05c90b4f8a31..674f1d7d2737 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2666,6 +2666,12 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv,
 
get_flags(flags, _flags);
 
+   flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params);
+   if (flow) {
+   netdev_warn_once(priv->netdev, "flow cookie %lx already exists, 
ignoring\n", f->cookie);
+   return 0;
+   }
+
if (esw && esw->mode == SRIOV_OFFLOADS) {
flow_flags |= MLX5E_TC_FLOW_ESWITCH;
attr_size  = sizeof(struct mlx5_esw_flow_attr);
@@ -2728,6 +2734,17 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv,
return err;
 }
 
+#define DIRECTION_MASK (MLX5E_TC_INGRESS | MLX5E_TC_EGRESS)
+#define FLOW_DIRECTION_MASK (MLX5E_TC_FLOW_INGRESS | MLX5E_TC_FLOW_EGRESS)
+
+static bool same_flow_direction(struct mlx5e_tc_flow *flow, int flags)
+{
+   if ((flow->flags & FLOW_DIRECTION_MASK) == (flags & DIRECTION_MASK))
+   return true;
+
+   return false;
+}
+
 int mlx5e_delete_flower(struct mlx5e_priv *priv,
struct tc_cls_flower_offload *f, int flags)
 {
@@ -2735,7 +2752,7 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv,
struct mlx5e_tc_flow *flow;
 
flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params);
-   if (!flow)
+   if (!flow || !same_flow_direction(flow, flags))
return -EINVAL;
 
rhashtable_remove_fast(tc_ht, >node, tc_ht_params);
@@ -2758,7 +2775,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
u64 lastuse;
 
flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params);
-   if (!flow)
+   if (!flow || !same_flow_direction(flow, flags))
return -EINVAL;
 
if (!(flow->flags & MLX5E_TC_FLOW_OFFLOADED))
-- 
2.17.0



[for-next 12/15] net/mlx5e: Prepare for shared table to keep TC eswitch flows

2018-05-17 Thread Saeed Mahameed
From: Or Gerlitz 

This is a refactoring step to be able and store the hash table which
keeps track of offloaded TC flows in a different location for NIC
vs e-switch rules.

Signed-off-by: Or Gerlitz 
Signed-off-by: Jiri Pirko 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 -
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 39 ++-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 51a1d36a56c5..bc91a7335c93 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -634,7 +634,6 @@ struct mlx5e_flow_table {
 struct mlx5e_tc_table {
struct mlx5_flow_table  *t;
 
-   struct rhashtable_paramsht_params;
struct rhashtable   ht;
 
DECLARE_HASHTABLE(mod_hdr_tbl, 8);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 26a1312ec9f8..1c90586d7f58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2634,12 +2634,24 @@ static void get_flags(int flags, u8 *flow_flags)
*flow_flags = __flow_flags;
 }
 
+static const struct rhashtable_params tc_ht_params = {
+   .head_offset = offsetof(struct mlx5e_tc_flow, node),
+   .key_offset = offsetof(struct mlx5e_tc_flow, cookie),
+   .key_len = sizeof(((struct mlx5e_tc_flow *)0)->cookie),
+   .automatic_shrinking = true,
+};
+
+static struct rhashtable *get_tc_ht(struct mlx5e_priv *priv)
+{
+   return >fs.tc.ht;
+}
+
 int mlx5e_configure_flower(struct mlx5e_priv *priv,
   struct tc_cls_flower_offload *f, int flags)
 {
struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
struct mlx5e_tc_flow_parse_attr *parse_attr;
-   struct mlx5e_tc_table *tc = >fs.tc;
+   struct rhashtable *tc_ht = get_tc_ht(priv);
struct mlx5e_tc_flow *flow;
int attr_size, err = 0;
u8 flow_flags = 0;
@@ -2693,8 +2705,7 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv,
!(flow->esw_attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP))
kvfree(parse_attr);
 
-   err = rhashtable_insert_fast(>ht, >node,
-tc->ht_params);
+   err = rhashtable_insert_fast(tc_ht, >node, tc_ht_params);
if (err) {
mlx5e_tc_del_flow(priv, flow);
kfree(flow);
@@ -2711,15 +2722,14 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv,
 int mlx5e_delete_flower(struct mlx5e_priv *priv,
struct tc_cls_flower_offload *f, int flags)
 {
+   struct rhashtable *tc_ht = get_tc_ht(priv);
struct mlx5e_tc_flow *flow;
-   struct mlx5e_tc_table *tc = >fs.tc;
 
-   flow = rhashtable_lookup_fast(>ht, >cookie,
- tc->ht_params);
+   flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params);
if (!flow)
return -EINVAL;
 
-   rhashtable_remove_fast(>ht, >node, tc->ht_params);
+   rhashtable_remove_fast(tc_ht, >node, tc_ht_params);
 
mlx5e_tc_del_flow(priv, flow);
 
@@ -2731,15 +2741,14 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv,
 int mlx5e_stats_flower(struct mlx5e_priv *priv,
   struct tc_cls_flower_offload *f, int flags)
 {
-   struct mlx5e_tc_table *tc = >fs.tc;
+   struct rhashtable *tc_ht = get_tc_ht(priv);
struct mlx5e_tc_flow *flow;
struct mlx5_fc *counter;
u64 bytes;
u64 packets;
u64 lastuse;
 
-   flow = rhashtable_lookup_fast(>ht, >cookie,
- tc->ht_params);
+   flow = rhashtable_lookup_fast(tc_ht, >cookie, tc_ht_params);
if (!flow)
return -EINVAL;
 
@@ -2757,13 +2766,6 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
return 0;
 }
 
-static const struct rhashtable_params mlx5e_tc_flow_ht_params = {
-   .head_offset = offsetof(struct mlx5e_tc_flow, node),
-   .key_offset = offsetof(struct mlx5e_tc_flow, cookie),
-   .key_len = sizeof(((struct mlx5e_tc_flow *)0)->cookie),
-   .automatic_shrinking = true,
-};
-
 int mlx5e_tc_init(struct mlx5e_priv *priv)
 {
struct mlx5e_tc_table *tc = >fs.tc;
@@ -2771,8 +2773,7 @@ int mlx5e_tc_init(struct mlx5e_priv *priv)
hash_init(tc->mod_hdr_tbl);
hash_init(tc->hairpin_tbl);
 
-   tc->ht_params = mlx5e_tc_flow_ht_params;
-   return rhashtable_init(>ht, >ht_params);
+   return rhashtable_init(>ht, _ht_params);
 }
 
 static void _mlx5e_tc_del_flow(void *ptr, void *arg)
-- 
2.17.0



[pull request][for-next 00/15] Mellanox, mlx5 core and netdev updates 2018-05-17

2018-05-17 Thread Saeed Mahameed
Hi Dave and Doug,

Below you can find two pull requests,

1. mlx5 core updates to be shared for both netdev and RDMA, (patches 1..9)
 which is based on the last mlx5-next pull request
 
The following changes since commit a8408f4e6db775e245f20edf12b13fd58cc03a1c:

  net/mlx5: fix spelling mistake: "modfiy" -> "modify" (2018-05-04 12:11:51 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git 
tags/mlx5-updates-2018-05-17

for you to fetch changes up to 10ff5359f883412728ba816046ee3a696625ca02:

  net/mlx5e: Explicitly set source e-switch in offloaded TC rules (2018-05-17 
14:17:35 -0700)

2. mlx5e netdev updates only for net-next branch (patches 10..15) based on 
net-next
and the above pull request.

The following changes since commit 538e2de104cfb4ef1acb35af42427bff42adbe4d:

  Merge branch 'net-Allow-more-drivers-with-COMPILE_TEST' (2018-05-17 17:11:07 
-0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5e-updates-2018-05-17

for you to fetch changes up to a228060a7c9ab88597eeac131e4578595d5d46ae:

  net/mlx5e: Add HW vport counters to representor ethtool stats (2018-05-17 
17:48:54 -0700)


Dave, for your convenience you can either pull 1. and then 2. or pull 2.
directly.

For more information please see tags logs below.
Please pull and let me know if there's any problem.

Thanks,
Saeed.


mlx5-updates-2018-05-17

mlx5 core driver updates for both net-next and rdma-next branches.

>From Christophe JAILLET, first three patches to use kvfree where needed.

From: Or Gerlitz 

Next six patches from Roi and Co adds support for merged
sriov e-switch which comes to serve cases where both PFs, VFs set
on them and both uplinks are to be used in single v-switch SW model.
When merged e-switch is supported, the per-port e-switch is logically
merged into one e-switch that spans both physical ports and all the VFs.

This model allows to offload TC eswitch rules between VFs belonging
to different PFs (and hence have different eswitch affinity), it also
sets the some of the foundations needed for uplink LAG support.


mlx5e-updates-2018-05-17

From: Or Gerlitz 

This series addresses a regression introduced by the
shared block TC changes [1]. Currently, for VF->VF and uplink->VF rules, the
TC core (cls_api) attempts to offload the same flow multiple times into
the driver, as a side effect of the mlx5 registration to the egdev callback.

We use the flow cookie to ignore attempts to add such flows, we can't
reject them (return error), b/c this will fail the offload attempt, so we
ignore that.

The last patch of the series deals with exposing HW stats counters through
ethtool for the vport reps.

Dave - the regression that we are addressing was introduced in 4.15 [1] and 
applies
to nfp and mlx5. Jiri suggested to push driver side fixes to net-next, this is
already done for nfp [2][3]. Once this is upstream, we will submit a small/point
single patch fix for the TC core code which can serve for net and stable, but 
not
carried into net-next, b/c it might limit some future use-cases.

[1] 208c0f4b5237 "net: sched: use tc_setup_cb_call to call per-block callbacks"
[2] c50647d "nfp: flower: ignore duplicate cb requests for same rule"
[3] 54a4a03 "nfp: flower: support offloading multiple rules with same cookie"


Christophe JAILLET (3):
  net/mlx5: Vport, Use 'kvfree()' for memory allocated by 'kvzalloc()'
  net/mlx5: Eswitch, Use 'kvfree()' for memory allocated by 'kvzalloc()'
  IB/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'

Or Gerlitz (5):
  net/mlx5e: Add ingress/egress indication for offloaded TC flows
  net/mlx5e: Prepare for shared table to keep TC eswitch flows
  net/mlx5e: Use shared table for offloaded TC eswitch flows
  net/mlx5e: Ignore attempts to offload multiple times a TC flow
  net/mlx5e: Add HW vport counters to representor ethtool stats

Rabie Loulou (2):
  net/mlx5e: Explicitly set destination e-switch in FDB rules
  net/mlx5e: Offload TC eswitch rules for VFs belonging to different PFs

Roi Dayan (1):
  net/mlx5: Add merged e-switch cap

Saeed Mahameed (1):
  Merge tag 'mlx5-updates-2018-05-17' of 
git://git.kernel.org/.../mellanox/linux

Shahar Klein (4):
  net/mlx5: Properly handle a vport destination when setting FTE
  net/mlx5: Add destination e-switch owner
  net/mlx5: Add source e-switch owner
  net/mlx5e: Explicitly set source e-switch in offloaded TC rules

 drivers/infiniband/hw/mlx5/cq.c|   2 +-
 .../mellanox/mlx5/core/diag/fs_tracepoint.c|   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   4 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  19 +--
 

[for-next 09/15] net/mlx5e: Explicitly set source e-switch in offloaded TC rules

2018-05-17 Thread Saeed Mahameed
From: Shahar Klein 

Set a specific source e-switch when setting a rule that matches on the
ingress port.

Signed-off-by: Shahar Klein 
Reviewed-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 1 +
 .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c| 8 
 3 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 880adc810ccc..1d2ba687b902 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2462,6 +2462,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
 
memset(attr, 0, sizeof(*attr));
attr->in_rep = rpriv->rep;
+   attr->in_mdev = priv->mdev;
 
tcf_exts_to_list(exts, );
list_for_each_entry(a, , list) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index ac5db54823a1..98a306e02640 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -231,6 +231,7 @@ struct mlx5_esw_flow_attr {
struct mlx5_eswitch_rep *in_rep;
struct mlx5_eswitch_rep *out_rep;
struct mlx5_core_dev*out_mdev;
+   struct mlx5_core_dev*in_mdev;
 
int action;
__be16  vlan_proto;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index ea93867d1ab4..6c83eef5141a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -93,8 +93,16 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
misc = MLX5_ADDR_OF(fte_match_param, spec->match_value, 
misc_parameters);
MLX5_SET(fte_match_set_misc, misc, source_port, attr->in_rep->vport);
 
+   if (MLX5_CAP_ESW(esw->dev, merged_eswitch))
+   MLX5_SET(fte_match_set_misc, misc,
+source_eswitch_owner_vhca_id,
+MLX5_CAP_GEN(attr->in_mdev, vhca_id));
+
misc = MLX5_ADDR_OF(fte_match_param, spec->match_criteria, 
misc_parameters);
MLX5_SET_TO_ONES(fte_match_set_misc, misc, source_port);
+   if (MLX5_CAP_ESW(esw->dev, merged_eswitch))
+   MLX5_SET_TO_ONES(fte_match_set_misc, misc,
+source_eswitch_owner_vhca_id);
 
spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS |
  MLX5_MATCH_MISC_PARAMETERS;
-- 
2.17.0



[for-next 03/15] IB/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-17 Thread Saeed Mahameed
From: Christophe JAILLET 

When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.

Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing")
Signed-off-by: Christophe JAILLET 
Acked-by: Jason Gunthorpe 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/cq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 77d257ec899b..6d52ea03574e 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct 
ib_udata *udata,
return 0;
 
 err_cqb:
-   kfree(*cqb);
+   kvfree(*cqb);
 
 err_db:
mlx5_ib_db_unmap_user(to_mucontext(context), >db);
-- 
2.17.0



[for-next 04/15] net/mlx5: Add merged e-switch cap

2018-05-17 Thread Saeed Mahameed
From: Roi Dayan 

When merged e-switch is supported, the per-port e-switch is logically
merged into one e-switch that spans both physical ports and all the VFs.
Under merged eswitch, both the matching on source vport and setting
destination vport can have a 2nd attribute which is the vhca id of the
eswitch owner.

For example:
esw0: {match:  action: fwd to }
is a flow set on eswitch0 matching on source vport=1 from his eswitch
and the action being fwd to dest vport=7 of eswitch1.

Signed-off-by: Roi Dayan 
Reviewed-by: Shahar Klein 
Reviewed-by: Or Gerlitz Klein 
Signed-off-by: Saeed Mahameed 
---
 include/linux/mlx5/mlx5_ifc.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 1aad455538f4..ef15f751a984 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -557,7 +557,8 @@ struct mlx5_ifc_e_switch_cap_bits {
u8 vport_svlan_insert[0x1];
u8 vport_cvlan_insert_if_not_exist[0x1];
u8 vport_cvlan_insert_overwrite[0x1];
-   u8 reserved_at_5[0x19];
+   u8 reserved_at_5[0x18];
+   u8 merged_eswitch[0x1];
u8 nic_vport_node_guid_modify[0x1];
u8 nic_vport_port_guid_modify[0x1];
 
-- 
2.17.0



[for-next 05/15] net/mlx5: Properly handle a vport destination when setting FTE

2018-05-17 Thread Saeed Mahameed
From: Shahar Klein 

When creating FTE, properly distinguish between destination being vport
or tir. The previous code just worked accidentally b/c of both dest being
in the same offset within a union.

Signed-off-by: Shahar Klein 
Reviewed-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index ef5afd7c9325..0bfce6a82c91 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -372,6 +372,9 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
if (dst->dest_attr.type ==
MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE) {
id = dst->dest_attr.ft->id;
+   } else if (dst->dest_attr.type ==
+  MLX5_FLOW_DESTINATION_TYPE_VPORT) {
+   id = dst->dest_attr.vport_num;
} else {
id = dst->dest_attr.tir_num;
}
-- 
2.17.0



[for-next 15/15] net/mlx5e: Add HW vport counters to representor ethtool stats

2018-05-17 Thread Saeed Mahameed
From: Or Gerlitz 

Currently the representor only report the SW (slow-path) traffic
counters.

Add packet/bytes reporting of the HW counters, which account for the
total amount of traffic that was handled by the vport, both slow and
fast (offloaded) paths. The newly exposed counters are named
vport_rx/tx_packets/bytes.

Signed-off-by: Or Gerlitz 
Signed-off-by: Adi Nissim 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 35 +++
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index aa32592a54cb..c3034f58aa33 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -66,18 +66,36 @@ static const struct counter_desc sw_rep_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_bytes) },
 };
 
-#define NUM_VPORT_REP_COUNTERS ARRAY_SIZE(sw_rep_stats_desc)
+struct vport_stats {
+   u64 vport_rx_packets;
+   u64 vport_tx_packets;
+   u64 vport_rx_bytes;
+   u64 vport_tx_bytes;
+};
+
+static const struct counter_desc vport_rep_stats_desc[] = {
+   { MLX5E_DECLARE_STAT(struct vport_stats, vport_rx_packets) },
+   { MLX5E_DECLARE_STAT(struct vport_stats, vport_rx_bytes) },
+   { MLX5E_DECLARE_STAT(struct vport_stats, vport_tx_packets) },
+   { MLX5E_DECLARE_STAT(struct vport_stats, vport_tx_bytes) },
+};
+
+#define NUM_VPORT_REP_SW_COUNTERS ARRAY_SIZE(sw_rep_stats_desc)
+#define NUM_VPORT_REP_HW_COUNTERS ARRAY_SIZE(vport_rep_stats_desc)
 
 static void mlx5e_rep_get_strings(struct net_device *dev,
  u32 stringset, uint8_t *data)
 {
-   int i;
+   int i, j;
 
switch (stringset) {
case ETH_SS_STATS:
-   for (i = 0; i < NUM_VPORT_REP_COUNTERS; i++)
+   for (i = 0; i < NUM_VPORT_REP_SW_COUNTERS; i++)
strcpy(data + (i * ETH_GSTRING_LEN),
   sw_rep_stats_desc[i].format);
+   for (j = 0; j < NUM_VPORT_REP_HW_COUNTERS; j++, i++)
+   strcpy(data + (i * ETH_GSTRING_LEN),
+  vport_rep_stats_desc[j].format);
break;
}
 }
@@ -140,7 +158,7 @@ static void mlx5e_rep_get_ethtool_stats(struct net_device 
*dev,
struct ethtool_stats *stats, u64 *data)
 {
struct mlx5e_priv *priv = netdev_priv(dev);
-   int i;
+   int i, j;
 
if (!data)
return;
@@ -148,18 +166,23 @@ static void mlx5e_rep_get_ethtool_stats(struct net_device 
*dev,
mutex_lock(>state_lock);
if (test_bit(MLX5E_STATE_OPENED, >state))
mlx5e_rep_update_sw_counters(priv);
+   mlx5e_rep_update_hw_counters(priv);
mutex_unlock(>state_lock);
 
-   for (i = 0; i < NUM_VPORT_REP_COUNTERS; i++)
+   for (i = 0; i < NUM_VPORT_REP_SW_COUNTERS; i++)
data[i] = MLX5E_READ_CTR64_CPU(>stats.sw,
   sw_rep_stats_desc, i);
+
+   for (j = 0; j < NUM_VPORT_REP_HW_COUNTERS; j++, i++)
+   data[i] = MLX5E_READ_CTR64_CPU(>stats.vf_vport,
+  vport_rep_stats_desc, j);
 }
 
 static int mlx5e_rep_get_sset_count(struct net_device *dev, int sset)
 {
switch (sset) {
case ETH_SS_STATS:
-   return NUM_VPORT_REP_COUNTERS;
+   return NUM_VPORT_REP_SW_COUNTERS + NUM_VPORT_REP_HW_COUNTERS;
default:
return -EOPNOTSUPP;
}
-- 
2.17.0



[for-next 07/15] net/mlx5e: Explicitly set destination e-switch in FDB rules

2018-05-17 Thread Saeed Mahameed
From: Rabie Loulou 

Set a specific destination e-switch when setting a destination vport.

Signed-off-by: Rabie Loulou 
Reviewed-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Reviewed-by: Shahar Klein 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 5 +
 3 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4197001f9801..880adc810ccc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -836,6 +836,7 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
out_priv = netdev_priv(encap_dev);
rpriv = out_priv->ppriv;
attr->out_rep = rpriv->rep;
+   attr->out_mdev = out_priv->mdev;
}
 
err = mlx5_eswitch_add_vlan_action(esw, attr);
@@ -2501,6 +2502,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
out_priv = netdev_priv(out_dev);
rpriv = out_priv->ppriv;
attr->out_rep = rpriv->rep;
+   attr->out_mdev = out_priv->mdev;
} else if (encap) {
parse_attr->mirred_ifindex = out_dev->ifindex;
parse_attr->tun_info = *info;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 4cd773fa55e3..ac5db54823a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -230,6 +230,7 @@ enum {
 struct mlx5_esw_flow_attr {
struct mlx5_eswitch_rep *in_rep;
struct mlx5_eswitch_rep *out_rep;
+   struct mlx5_core_dev*out_mdev;
 
int action;
__be16  vlan_proto;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 90c8cb31e633..ea93867d1ab4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -72,6 +72,11 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
dest[i].type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
dest[i].vport.num = attr->out_rep->vport;
+   if (MLX5_CAP_ESW(esw->dev, merged_eswitch)) {
+   dest[i].vport.vhca_id =
+   MLX5_CAP_GEN(attr->out_mdev, vhca_id);
+   dest[i].vport.vhca_id_valid = 1;
+   }
i++;
}
if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
-- 
2.17.0



[for-next 13/15] net/mlx5e: Use shared table for offloaded TC eswitch flows

2018-05-17 Thread Saeed Mahameed
From: Or Gerlitz 

Currently, each representor netdev use their own hash table to keep
the mapping from TC flow (f->cookie) to the driver offloaded instance.
The table is the one which originally was added for offloading TC NIC
(not eswitch) rules.

This scheme breaks when the core TC code calls us to add the same flow
twice, (e.g under egdev use case) since we don't spot that and offload
a 2nd flow into the HW with the wrong source vport.

As a pre-step to solve that, we move to use a single table which keeps
all offloaded TC eswitch flows. The table is located at the eswitch
uplink representor object.

Signed-off-by: Or Gerlitz 
Signed-off-by: Jiri Pirko 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  4 +--
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  | 19 ++--
 .../net/ethernet/mellanox/mlx5/core/en_rep.h  |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 29 +++
 .../net/ethernet/mellanox/mlx5/core/en_tc.h   | 11 ---
 5 files changed, 43 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 27e8375a476b..b5a7580b12fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4462,7 +4462,7 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
goto err_destroy_direct_tirs;
}
 
-   err = mlx5e_tc_init(priv);
+   err = mlx5e_tc_nic_init(priv);
if (err)
goto err_destroy_flow_steering;
 
@@ -4483,7 +4483,7 @@ static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
 
 static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv)
 {
-   mlx5e_tc_cleanup(priv);
+   mlx5e_tc_nic_cleanup(priv);
mlx5e_destroy_flow_steering(priv);
mlx5e_destroy_direct_tirs(priv);
mlx5e_destroy_indirect_tirs(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 182b636552a6..aa32592a54cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -981,14 +981,8 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
}
rpriv->vport_rx_rule = flow_rule;
 
-   err = mlx5e_tc_init(priv);
-   if (err)
-   goto err_del_flow_rule;
-
return 0;
 
-err_del_flow_rule:
-   mlx5_del_flow_rules(rpriv->vport_rx_rule);
 err_destroy_direct_tirs:
mlx5e_destroy_direct_tirs(priv);
 err_destroy_direct_rqts:
@@ -1000,7 +994,6 @@ static void mlx5e_cleanup_rep_rx(struct mlx5e_priv *priv)
 {
struct mlx5e_rep_priv *rpriv = priv->ppriv;
 
-   mlx5e_tc_cleanup(priv);
mlx5_del_flow_rules(rpriv->vport_rx_rule);
mlx5e_destroy_direct_tirs(priv);
mlx5e_destroy_direct_rqts(priv);
@@ -1058,8 +1051,15 @@ mlx5e_nic_rep_load(struct mlx5_core_dev *dev, struct 
mlx5_eswitch_rep *rep)
if (err)
goto err_remove_sqs;
 
+   /* init shared tc flow table */
+   err = mlx5e_tc_esw_init(>tc_ht);
+   if (err)
+   goto  err_neigh_cleanup;
+
return 0;
 
+err_neigh_cleanup:
+   mlx5e_rep_neigh_cleanup(rpriv);
 err_remove_sqs:
mlx5e_remove_sqs_fwd_rules(priv);
return err;
@@ -1074,9 +1074,8 @@ mlx5e_nic_rep_unload(struct mlx5_eswitch_rep *rep)
if (test_bit(MLX5E_STATE_OPENED, >state))
mlx5e_remove_sqs_fwd_rules(priv);
 
-   /* clean (and re-init) existing uplink offloaded TC rules */
-   mlx5e_tc_cleanup(priv);
-   mlx5e_tc_init(priv);
+   /* clean uplink offloaded TC rules, delete shared tc flow table */
+   mlx5e_tc_esw_cleanup(>tc_ht);
 
mlx5e_rep_neigh_cleanup(rpriv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
index b9b481f2833a..844d32d5c29f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.h
@@ -59,6 +59,7 @@ struct mlx5e_rep_priv {
struct net_device  *netdev;
struct mlx5_flow_handle *vport_rx_rule;
struct list_head   vport_sqs_list;
+   struct rhashtable  tc_ht; /* valid for uplink rep */
 };
 
 static inline
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 1c90586d7f58..05c90b4f8a31 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -76,6 +76,7 @@ enum {
 
 struct mlx5e_tc_flow {
struct rhash_head   node;
+   struct mlx5e_priv   *priv;
u64 cookie;
u8  flags;
struct mlx5_flow_handle *rule;
@@ 

[for-next 01/15] net/mlx5: Vport, Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-17 Thread Saeed Mahameed
From: Christophe JAILLET 

When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.

Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to query vport 
RoCE fields")
Signed-off-by: Christophe JAILLET 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/vport.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 177e076b8d17..719cecb182c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct 
mlx5_core_dev *mdev,
*system_image_guid = MLX5_GET64(query_nic_vport_context_out, out,
nic_vport_context.system_image_guid);
 
-   kfree(out);
+   kvfree(out);
 
return 0;
 }
@@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev 
*mdev, u64 *node_guid)
*node_guid = MLX5_GET64(query_nic_vport_context_out, out,
nic_vport_context.node_guid);
 
-   kfree(out);
+   kvfree(out);
 
return 0;
 }
@@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct 
mlx5_core_dev *mdev,
*qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out,
   nic_vport_context.qkey_violation_counter);
 
-   kfree(out);
+   kvfree(out);
 
return 0;
 }
-- 
2.17.0



[for-next 02/15] net/mlx5: Eswitch, Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-17 Thread Saeed Mahameed
From: Christophe JAILLET 

When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.

Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx rules")
Signed-off-by: Christophe JAILLET 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 35e256eb2f6e..b123f8a52ad8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch 
*esw)
 
esw->offloads.vport_rx_group = g;
 out:
-   kfree(flow_group_in);
+   kvfree(flow_group_in);
return err;
 }
 
-- 
2.17.0



[for-next 08/15] net/mlx5: Add source e-switch owner

2018-05-17 Thread Saeed Mahameed
From: Shahar Klein 

The source e-switch owner allows a vport on one e-switch port be associated
with a rule defined on the second port e-switch.

The role of the source eswitch owner valid bit in the flow group is to
allow the firmware fail driver attempts to wild card the source eswitch
match field. If this bit is not set, the firmware ignores the source
eswitch owner field totally.

Signed-off-by: Shahar Klein 
Reviewed-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 10 ++
 include/linux/mlx5/mlx5_ifc.h |  6 --
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 5a80279b052a..b1a2ca0ff320 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1372,6 +1372,8 @@ static int create_auto_flow_group(struct mlx5_flow_table 
*ft,
struct mlx5_core_dev *dev = get_dev(>node);
int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
void *match_criteria_addr;
+   u8 src_esw_owner_mask_on;
+   void *misc;
int err;
u32 *in;
 
@@ -1384,6 +1386,14 @@ static int create_auto_flow_group(struct mlx5_flow_table 
*ft,
MLX5_SET(create_flow_group_in, in, start_flow_index, fg->start_index);
MLX5_SET(create_flow_group_in, in, end_flow_index,   fg->start_index +
 fg->max_ftes - 1);
+
+   misc = MLX5_ADDR_OF(fte_match_param, fg->mask.match_criteria,
+   misc_parameters);
+   src_esw_owner_mask_on = !!MLX5_GET(fte_match_set_misc, misc,
+source_eswitch_owner_vhca_id);
+   MLX5_SET(create_flow_group_in, in,
+source_eswitch_owner_vhca_id_valid, src_esw_owner_mask_on);
+
match_criteria_addr = MLX5_ADDR_OF(create_flow_group_in,
   in, match_criteria);
memcpy(match_criteria_addr, fg->mask.match_criteria,
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 3d17709bc30c..9c3538f1b8b9 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -412,7 +412,7 @@ struct mlx5_ifc_fte_match_set_misc_bits {
u8 reserved_at_0[0x8];
u8 source_sqn[0x18];
 
-   u8 reserved_at_20[0x10];
+   u8 source_eswitch_owner_vhca_id[0x10];
u8 source_port[0x10];
 
u8 outer_second_prio[0x3];
@@ -6995,7 +6995,9 @@ struct mlx5_ifc_create_flow_group_in_bits {
u8 reserved_at_a0[0x8];
u8 table_id[0x18];
 
-   u8 reserved_at_c0[0x20];
+   u8 source_eswitch_owner_vhca_id_valid[0x1];
+
+   u8 reserved_at_c1[0x1f];
 
u8 start_flow_index[0x20];
 
-- 
2.17.0



[for-next 06/15] net/mlx5: Add destination e-switch owner

2018-05-17 Thread Saeed Mahameed
From: Shahar Klein 

The destination e-switch owner allows a rule in namespace of one e-switch
owner to point to a vport that is natively associated with another
e-switch owner.

Signed-off-by: Shahar Klein 
Reviewed-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c  | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c| 6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c  | 8 +++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +-
 include/linux/mlx5/fs.h   | 6 +-
 include/linux/mlx5/mlx5_ifc.h | 5 +++--
 7 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
index d93ff567b40d..b3820a34e773 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.c
@@ -235,7 +235,7 @@ const char *parse_fs_dst(struct trace_seq *p,
 
switch (dst->type) {
case MLX5_FLOW_DESTINATION_TYPE_VPORT:
-   trace_seq_printf(p, "vport=%u\n", dst->vport_num);
+   trace_seq_printf(p, "vport=%u\n", dst->vport.num);
break;
case MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE:
trace_seq_printf(p, "ft=%p\n", dst->ft);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 332bc56306bf..9a24314b817a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -192,7 +192,7 @@ __esw_fdb_set_vport_rule(struct mlx5_eswitch *esw, u32 
vport, bool rx_rule,
}
 
dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
-   dest.vport_num = vport;
+   dest.vport.num = vport;
 
esw_debug(esw->dev,
  "\tFDB add rule dmac_v(%pM) dmac_c(%pM) -> vport(%d)\n",
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index b123f8a52ad8..90c8cb31e633 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -71,7 +71,7 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 
if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
dest[i].type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
-   dest[i].vport_num = attr->out_rep->vport;
+   dest[i].vport.num = attr->out_rep->vport;
i++;
}
if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
@@ -343,7 +343,7 @@ mlx5_eswitch_add_send_to_vport_rule(struct mlx5_eswitch 
*esw, int vport, u32 sqn
 
spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
-   dest.vport_num = vport;
+   dest.vport.num = vport;
flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 
flow_rule = mlx5_add_flow_rules(esw->fdb_table.offloads.fdb, spec,
@@ -387,7 +387,7 @@ static int esw_add_fdb_miss_rule(struct mlx5_eswitch *esw)
dmac_c[0] = 0x01;
 
dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
-   dest.vport_num = 0;
+   dest.vport.num = 0;
flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 
flow_rule = mlx5_add_flow_rules(esw->fdb_table.offloads.fdb, spec,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 0bfce6a82c91..5a00deff5457 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -374,7 +374,13 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
id = dst->dest_attr.ft->id;
} else if (dst->dest_attr.type ==
   MLX5_FLOW_DESTINATION_TYPE_VPORT) {
-   id = dst->dest_attr.vport_num;
+   id = dst->dest_attr.vport.num;
+   MLX5_SET(dest_format_struct, in_dests,
+
destination_eswitch_owner_vhca_id_valid,
+dst->dest_attr.vport.vhca_id_valid);
+   MLX5_SET(dest_format_struct, in_dests,
+destination_eswitch_owner_vhca_id,
+dst->dest_attr.vport.vhca_id);
} else {
id = dst->dest_attr.tir_num;
}
diff --git 

[for-next 10/15] net/mlx5e: Offload TC eswitch rules for VFs belonging to different PFs

2018-05-17 Thread Saeed Mahameed
From: Rabie Loulou 

When the merged eswitch capability is supported, allow offloading rules
between VFs which belong to different PFs (and hence have different
eswitch affinity).

Signed-off-by: Rabie Loulou 
Reviewed-by: Or Gerlitz 
Reviewed-by: Roi Dayan 
Reviewed-by: Shahar Klein 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 630dd6dcabb9..77c3f8b8ae96 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2077,6 +2077,20 @@ static int mlx5e_route_lookup_ipv4(struct mlx5e_priv 
*priv,
return 0;
 }
 
+static bool is_merged_eswitch_dev(struct mlx5e_priv *priv,
+ struct net_device *peer_netdev)
+{
+   struct mlx5e_priv *peer_priv;
+
+   peer_priv = netdev_priv(peer_netdev);
+
+   return (MLX5_CAP_ESW(priv->mdev, merged_eswitch) &&
+   (priv->netdev->netdev_ops == peer_netdev->netdev_ops) &&
+   same_hw_devs(priv, peer_priv) &&
+   MLX5_VPORT_MANAGER(peer_priv->mdev) &&
+   (peer_priv->mdev->priv.eswitch->mode == SRIOV_OFFLOADS));
+}
+
 static int mlx5e_route_lookup_ipv6(struct mlx5e_priv *priv,
   struct net_device *mirred_dev,
   struct net_device **out_dev,
@@ -2535,7 +2549,8 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
out_dev = tcf_mirred_dev(a);
 
if (switchdev_port_same_parent_id(priv->netdev,
- out_dev)) {
+ out_dev) ||
+   is_merged_eswitch_dev(priv, out_dev)) {
action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST |
  MLX5_FLOW_CONTEXT_ACTION_COUNT;
out_priv = netdev_priv(out_dev);
-- 
2.17.0



RE: [PATCH net-next v3 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST

2018-05-17 Thread Andy Duan
From: Florian Fainelli  Sent: 2018年5月18日 4:08
> The Freescale FEC driver builds fine with COMPILE_TEST, so make that
> possible.
> 
> Signed-off-by: Florian Fainelli 

Acked-by: Fugang Duan 

> ---
>  drivers/net/ethernet/freescale/Kconfig| 2 +-
>  drivers/net/ethernet/freescale/fec.h  | 2 +-
>  drivers/net/ethernet/freescale/fec_main.c | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/Kconfig
> b/drivers/net/ethernet/freescale/Kconfig
> index 6e490fd2345d..a580a3dcbe59 100644
> --- a/drivers/net/ethernet/freescale/Kconfig
> +++ b/drivers/net/ethernet/freescale/Kconfig
> @@ -22,7 +22,7 @@ if NET_VENDOR_FREESCALE  config FEC
>   tristate "FEC ethernet controller (of ColdFire and some i.MX CPUs)"
>   depends on (M523x || M527x || M5272 || M528x || M520x ||
> M532x || \
> -ARCH_MXC || SOC_IMX28)
> +ARCH_MXC || SOC_IMX28 || COMPILE_TEST)
>   default ARCH_MXC || SOC_IMX28 if ARM
>   select PHYLIB
>   imply PTP_1588_CLOCK
> diff --git a/drivers/net/ethernet/freescale/fec.h
> b/drivers/net/ethernet/freescale/fec.h
> index e7381f8ef89d..4778b663653e 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -21,7 +21,7 @@
> 
>  #if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
> defined(CONFIG_M528x) || \
>  defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
> defined(CONFIG_ARM) || \
> -defined(CONFIG_ARM64)
> +defined(CONFIG_ARM64) || defined(CONFIG_COMPILE_TEST)
>  /*
>   *   Just figures, Motorola would have to change the offsets for
>   *   registers in the same peripheral device on different models
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index f3e43db0d6cb..4358f586e28f 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -2107,7 +2107,7 @@ static int fec_enet_get_regs_len(struct
> net_device *ndev)
>  /* List of registers that can be safety be read to dump them with ethtool
> */  #if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
> defined(CONFIG_M528x) || \
>   defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
> defined(CONFIG_ARM) || \
> - defined(CONFIG_ARM64)
> + defined(CONFIG_ARM64) || defined(CONFIG_COMPILE_TEST)
>  static u32 fec_enet_register_offset[] = {
>   FEC_IEVENT, FEC_IMASK, FEC_R_DES_ACTIVE_0,
> FEC_X_DES_ACTIVE_0,
>   FEC_ECNTRL, FEC_MII_DATA, FEC_MII_SPEED, FEC_MIB_CTRLSTAT,
> FEC_R_CNTRL,
> --
> 2.14.1



Re: [PATCH net-next ] net: mscc: Add SPDX identifier

2018-05-17 Thread Joe Perches
On Thu, 2018-05-17 at 21:39 +0200, Alexandre Belloni wrote:
> On 17/05/2018 12:28:59-0700, Joe Perches wrote:
> > On Thu, 2018-05-17 at 21:23 +0200, Alexandre Belloni wrote:
> > > ocelot_qsys.h is missing the SPDX identfier, fix that.
> > > 
> > > Signed-off-by: Alexandre Belloni 
> > 
> > Only the copyright holders should ideally be modifying
> > these and also removing other license content.
> > 
> > For instance, what's the real intent here?
> > 
> 
> Well, if you have a look, I submitted that file this cycle and it is the
> only one that doesn't have the proper SPDX identifier. This is a mistake
> I'm fixing.

Just because you submitted it does not mean you
are the copyright holder.

> > > diff --git a/drivers/net/ethernet/mscc/ocelot_qsys.h 
> > > b/drivers/net/ethernet/mscc/ocelot_qsys.h
> > 
> > []
> > > @@ -1,7 +1,7 @@
> > > +/* SPDX-License-Identifier: (GPL-2.0 OR MIT) */
> > 
> > GPL 2.0+ or 2.0?
> > 
> 
> 2.0

How do you know that?



Re: [PATCH bpf-next 3/3] bpf: Add mtu checking to FIB forwarding helper

2018-05-17 Thread David Ahern
On 5/17/18 4:22 PM, Daniel Borkmann wrote:
> On 05/17/2018 06:09 PM, David Ahern wrote:
>> Add check that egress MTU can handle packet to be forwarded. If
>> the MTU is less than the packet lenght, return 0 meaning the
>> packet is expected to continue up the stack for help - eg.,
>> fragmenting the packet or sending an ICMP.
>>
>> Signed-off-by: David Ahern 
>> ---
>>  net/core/filter.c | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 6d0d1560bd70..c47c47a75d4b 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4098,6 +4098,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
>> bpf_fib_lookup *params,
>>  struct fib_nh *nh;
>>  struct flowi4 fl4;
>>  int err;
>> +u32 mtu;
>>  
>>  dev = dev_get_by_index_rcu(net, params->ifindex);
>>  if (unlikely(!dev))
>> @@ -4149,6 +4150,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, 
>> struct bpf_fib_lookup *params,
>>  if (res.fi->fib_nhs > 1)
>>  fib_select_path(net, , , NULL);
>>  
>> +mtu = ip_mtu_from_fib_result(, params->ipv4_dst);
>> +if (params->tot_len > mtu)
>> +return 0;
>> +
>>  nh = >fib_nh[res.nh_sel];
>>  
>>  /* do not handle lwt encaps right now */
>> @@ -4188,6 +4193,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
>> bpf_fib_lookup *params,
>>  struct flowi6 fl6;
>>  int strict = 0;
>>  int oif;
>> +u32 mtu;
>>  
>>  /* link local addresses are never forwarded */
>>  if (rt6_need_strict(dst) || rt6_need_strict(src))
>> @@ -4250,6 +4256,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, 
>> struct bpf_fib_lookup *params,
>> fl6.flowi6_oif, NULL,
>> strict);
>>  
>> +mtu = ip6_mtu_from_fib6(f6i, dst, src);
>> +if (params->tot_len > mtu)
>> +return 0;
>> +
>>  if (f6i->fib6_nh.nh_lwtstate)
>>  return 0;
> 
> Could you elaborate how this interacts in tc BPF use case where you have e.g.
> GSO packets and tot_len from aggregated packets would definitely be larger
> than MTU (e.g. see is_skb_forwardable() as one example on such checks)? Should
> this be an opt-in via a new flag for the helper?

It should not be opt-in for XDP.

I could add a flag to the internal call -- bpf_skb_fib_lookup sets the
flag to skip the MTU check in bpf_ipv4_fib_lookup and bpf_ipv6_fib_lookup.

For the skb case do you want bpf_skb_fib_lookup call is_skb_forwardable
or leave that to the BPF program?


pull-request: bpf 2018-05-18

2018-05-17 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix two bugs in sockmap, a use after free in sockmap's error path
   from sock_map_ctx_update_elem() where we mistakenly drop a reference
   we didn't take prior to that, and in the same function fix a race
   in bpf_prog_inc_not_zero() where we didn't use the progs from prior
   READ_ONCE(), from John.

2) Reject program expansions once we figure out that their jump target
   which crosses patchlet boundaries could otherwise get truncated in
   insn->off space, from Daniel.

3) Check the return value of fopen() in BPF selftest's test_verifier
   where we determine whether unpriv BPF is disabled, and iff we do
   fail there then just assume it is disabled. This fixes a segfault
   when used with older kernels, from Jesper.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

When this gets later merged into net-next there are a two trivial
BPF conflicts to resolve:

In kernel/bpf/sockmap.c the bpf_prog_inc_not_zero() cases must
use verdict, parse and tx_msg as their arguments as opposed to
the buggy old version where progs->bpf_{verdict,parse,tx_msg}
were used as passed args.

In tools/lib/bpf/libbpf.c use the hunk from net-next with the
__bpf_object__open() + IS_ERR(obj) test combination. Thus, net-next
code only is sufficient here.

Thanks a lot!



The following changes since commit 02f99df1875c11330cd0be69a40fa8ccd14749b2:

  erspan: fix invalid erspan version. (2018-05-17 15:48:49 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 050fad7c4534c13c8eb1d9c2ba66012e014773cb:

  bpf: fix truncated jump targets on heavy expansions (2018-05-17 16:05:35 
-0700)


Daniel Borkmann (1):
  bpf: fix truncated jump targets on heavy expansions

Jesper Dangaard Brouer (1):
  selftests/bpf: check return value of fopen in test_verifier.c

John Fastabend (2):
  bpf: sockmap update rollback on error can incorrectly dec prog refcnt
  bpf: parse and verdict prog attach may race with bpf map update

 kernel/bpf/core.c   | 100 +---
 kernel/bpf/sockmap.c|  18 ++---
 net/core/filter.c   |  11 ++-
 tools/testing/selftests/bpf/test_verifier.c |   5 ++
 4 files changed, 98 insertions(+), 36 deletions(-)


Re: [PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link

2018-05-17 Thread David Ahern
On 5/17/18 4:36 PM, Stephen Hemminger wrote:
> On Thu, 17 May 2018 16:22:37 -0600
> dsah...@kernel.org wrote:
> 
>> From: David Ahern 
>>
>> Using iproute2 to create a bridge and add 4094 vlans to it can take from
>> 2 to 3 *minutes*. The reason is the extraneous call to ll_name_to_index.
>> ll_name_to_index results in an ioctl(SIOCGIFINDEX) call which in turn
>> invokes dev_load. If the index does not exist, which it won't when
>> creating a new link, dev_load calls modprobe twice -- once for
>> netdev-NAME and again for NAME. This is unnecessary overhead for each
>> link create.
>>
>> When ip link is invoked for a new device, there is no reason to
>> call ll_name_to_index for the new device. With this patch, creating
>> a bridge and adding 4094 vlans takes less than 3 *seconds*.
>>
>> Signed-off-by: David Ahern 
> 
> Yes this looks like a real problem.
> Isn't the cache supposed to reduce this?
> 
> Don't like to make lots of special case flags.
> 

The device does not exist, so it won't be in any cache. ll_name_to_index
already checks it though before calling if_nametoindex.


[PATCH net v2] net: dsa: Do not register devlink for unused ports

2018-05-17 Thread Florian Fainelli
Even if commit 1d27732f411d ("net: dsa: setup and teardown ports") indicated
that registering a devlink instance for unused ports is not a problem, and this
is true, this can be confusing nonetheless, so let's not do it.

Fixes: 1d27732f411d ("net: dsa: setup and teardown ports")
Reported-by: Jiri Pirko 
Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa2.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index adf50fbc4c13..47725250b4ca 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -258,11 +258,13 @@ static void dsa_tree_teardown_default_cpu(struct 
dsa_switch_tree *dst)
 static int dsa_port_setup(struct dsa_port *dp)
 {
struct dsa_switch *ds = dp->ds;
-   int err;
+   int err = 0;
 
memset(>devlink_port, 0, sizeof(dp->devlink_port));
 
-   err = devlink_port_register(ds->devlink, >devlink_port, dp->index);
+   if (dp->type != DSA_PORT_TYPE_UNUSED)
+   err = devlink_port_register(ds->devlink, >devlink_port,
+   dp->index);
if (err)
return err;
 
@@ -293,7 +295,8 @@ static int dsa_port_setup(struct dsa_port *dp)
 
 static void dsa_port_teardown(struct dsa_port *dp)
 {
-   devlink_port_unregister(>devlink_port);
+   if (dp->type != DSA_PORT_TYPE_UNUSED)
+   devlink_port_unregister(>devlink_port);
 
switch (dp->type) {
case DSA_PORT_TYPE_UNUSED:
-- 
2.14.1



[net-next:master 1230/1233] arch/mips/include/asm/io.h:422:1: note: in expansion of macro '__BUILD_MEMORY_SINGLE'

2018-05-17 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   538e2de104cfb4ef1acb35af42427bff42adbe4d
commit: 2652113ff043ca2ce1cb3be529b5ca9270c421d4 [1230/1233] net: ethernet: ti: 
Allow most drivers with COMPILE_TEST
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 2652113ff043ca2ce1cb3be529b5ca9270c421d4
# save the attached .config to linux build tree
make.cross ARCH=mips 

All warnings (new ones prefixed by >>):

   drivers/net//ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit':
   drivers/net//ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 
1 of 'writel' makes integer from pointer without a cast [-Wint-conversion]
 writel_relaxed(token, >sw_token);
^
   In file included from arch/mips/include/asm/page.h:194:0,
from include/linux/mmzone.h:21,
from include/linux/gfp.h:6,
from include/linux/idr.h:16,
from include/linux/kernfs.h:14,
from include/linux/sysfs.h:16,
from include/linux/kobject.h:20,
from include/linux/device.h:16,
from drivers/net//ethernet/ti/davinci_cpdma.c:17:
   arch/mips/include/asm/io.h:315:25: note: expected 'u32 {aka unsigned int}' 
but argument is of type 'void *'
static inline void pfx##write##bwlq(type val,\
^
>> arch/mips/include/asm/io.h:422:1: note: in expansion of macro 
>> '__BUILD_MEMORY_SINGLE'
__BUILD_MEMORY_SINGLE(bus, bwlq, type, 1)
^
>> arch/mips/include/asm/io.h:427:1: note: in expansion of macro 
>> '__BUILD_MEMORY_PFX'
__BUILD_MEMORY_PFX(, bwlq, type) \
^~
>> arch/mips/include/asm/io.h:432:1: note: in expansion of macro 'BUILDIO_MEM'
BUILDIO_MEM(l, u32)
^~~
--
   drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit':
   drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 
of 'writel' makes integer from pointer without a cast [-Wint-conversion]
 writel_relaxed(token, >sw_token);
^
   In file included from arch/mips/include/asm/page.h:194:0,
from include/linux/mmzone.h:21,
from include/linux/gfp.h:6,
from include/linux/idr.h:16,
from include/linux/kernfs.h:14,
from include/linux/sysfs.h:16,
from include/linux/kobject.h:20,
from include/linux/device.h:16,
from drivers/net/ethernet/ti/davinci_cpdma.c:17:
   arch/mips/include/asm/io.h:315:25: note: expected 'u32 {aka unsigned int}' 
but argument is of type 'void *'
static inline void pfx##write##bwlq(type val,\
^
>> arch/mips/include/asm/io.h:422:1: note: in expansion of macro 
>> '__BUILD_MEMORY_SINGLE'
__BUILD_MEMORY_SINGLE(bus, bwlq, type, 1)
^
>> arch/mips/include/asm/io.h:427:1: note: in expansion of macro 
>> '__BUILD_MEMORY_PFX'
__BUILD_MEMORY_PFX(, bwlq, type) \
^~
>> arch/mips/include/asm/io.h:432:1: note: in expansion of macro 'BUILDIO_MEM'
BUILDIO_MEM(l, u32)
^~~

vim +/__BUILD_MEMORY_SINGLE +422 arch/mips/include/asm/io.h

8faca49a6 arch/mips/include/asm/io.h David Daney   2008-12-11  312  
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  313  #define 
__BUILD_MEMORY_SINGLE(pfx, bwlq, type, irq) \
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  314  
\
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16 @315  static 
inline void pfx##write##bwlq(type val,   \
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  316  
volatile void __iomem *mem) \
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  317  {   
\
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  318  
volatile type *__mem;   \
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  319  
type __val; \
^1da177e4 include/asm-mips/io.h  Linus Torvalds2005-04-16  320  
\
1e820da3c arch/mips/include/asm/io.h Huacai Chen   2016-03-03  321  
war_io_reorder_wmb();   \

Re: [PATCH bpf-next 2/7] bpf: introduce bpf subcommand BPF_PERF_EVENT_QUERY

2018-05-17 Thread kbuild test robot
Hi Yonghong,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Yonghong-Song/bpf-implement-BPF_PERF_EVENT_QUERY-for-perf-event-query/20180518-060508
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-x000-201819 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   kernel/trace/trace_kprobe.c: In function 'bpf_get_kprobe_info':
>> kernel/trace/trace_kprobe.c:1315:17: warning: cast from pointer to integer 
>> of different size [-Wpointer-to-int-cast]
  *probe_addr = (u64)tk->rp.kp.addr;
^

vim +1315 kernel/trace/trace_kprobe.c

  1290  
  1291  int bpf_get_kprobe_info(struct perf_event *event, u32 *prog_info,
  1292  const char **symbol, u64 *probe_offset,
  1293  u64 *probe_addr, bool perf_type_tracepoint)
  1294  {
  1295  const char *pevent = trace_event_name(event->tp_event);
  1296  const char *group = event->tp_event->class->system;
  1297  struct trace_kprobe *tk;
  1298  
  1299  if (perf_type_tracepoint)
  1300  tk = find_trace_kprobe(pevent, group);
  1301  else
  1302  tk = event->tp_event->data;
  1303  if (!tk)
  1304  return -EINVAL;
  1305  
  1306  *prog_info = trace_kprobe_is_return(tk) ? 
BPF_PERF_INFO_KRETPROBE
  1307  : BPF_PERF_INFO_KPROBE;
  1308  if (tk->symbol) {
  1309  *symbol = tk->symbol;
  1310  *probe_offset = tk->rp.kp.offset;
  1311  *probe_addr = 0;
  1312  } else {
  1313  *symbol = NULL;
  1314  *probe_offset = 0;
> 1315  *probe_addr = (u64)tk->rp.kp.addr;
  1316  }
  1317  return 0;
  1318  }
  1319  #endif  /* CONFIG_PERF_EVENTS */
  1320  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Greetings

2018-05-17 Thread Miss.Zeliha Omer Faruk



Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey



[PATCH iproute2] tc: allow 0% for percent options

2018-05-17 Thread Stephen Hemminger
Allowing 0% is sometimes useful for example in netem loss and drop
or perhaps dropping all traffic in a HTB bin.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199745
Reported-by: stuartmars...@gmail.com
Fixes: 927e3cfb52b5 ("tc: B.W limits can now be specified in %.")
Signed-off-by: Stephen Hemminger 
---
 lib/utils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/utils.c b/lib/utils.c
index 7b2c6dd19268..02ce67721915 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -105,7 +105,7 @@ int parse_percent(double *val, const char *str)
*val = strtod(str, ) / 100.;
if (*val == HUGE_VALF || *val == HUGE_VALL)
return 1;
-   if (*val == 0.0 || (*p && strcmp(p, "%")))
+   if (*p && strcmp(p, "%"))
return -1;
 
return 0;
-- 
2.17.0



Re: [PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Toke Høiland-Jørgensen
Eric Dumazet  writes:

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.
>
> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.
>
> This patch adds a high resolution timer and tp->compressed_ack counter.
>
> Instead of sending a SACK, we program this timer with a small delay,
> based on RTT and capped to 1 ms :
>
>   delay = min ( 5 % of RTT, 1 ms)
>
> If subsequent SACKs need to be sent while the timer has not yet
> expired, we simply increment tp->compressed_ack.
>
> When timer expires, a SACK is sent with the latest information.
> Whenever an ACK is sent (if data is sent, or if in-order
> data is received) timer is canceled.
>
> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
> if the sack blocks need to be shuffled, even if the timer has not
> expired.
>
> A new SNMP counter is added in the following patch.
>
> Two other patches add sysctls to allow changing the 1,000,000 and 44
> values that this commit hard-coded.
>
> Signed-off-by: Eric Dumazet 

Acked-by: Toke Høiland-Jørgensen 



Re: [PATCH] ath10k: transmit queued frames after waking queues

2018-05-17 Thread Adrian Chadd
On Thu, 17 May 2018 at 16:16, Niklas Cassel 
wrote:

> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
> index cda164f6e9f6..1d3b2d2c3fee 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>  wake_up(>empty_tx_wq);
>  spin_unlock_bh(>tx_lock);

> +   if (htt->num_pending_tx <= 3 && !list_empty(>txqs))
> +   ath10k_mac_tx_push_pending(ar);
> +

Just sanity checking - what's protecting htt->num_pending_tx? or is it
serialised some other way?

>  dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);

>  ath10k_report_offchan_tx(htt->ar, msdu);
> --
> 2.17.0


Regression bisected to: softirq: Let ksoftirqd do its job

2018-05-17 Thread Ben Greear

One of my out-of-tree patches is a network impairment tool that acts a lot like
an Ethernet bridge with latency, jitter, etc.

We noticed recently that we were seeing igb adapter errors when testing with 
our emulator
at high speeds.  For whatever reason, it is only easily reproduced when we add 
jitter
to our emulator.  This would cause a bit more CPU usage and lock contention in 
our software,
and would increase the skb pkts allocated at any given time.

I bisected the problem to the commit below:

Author: Eric Dumazet 
Date:   Wed Aug 31 10:42:29 2016 -0700

softirq: Let ksoftirqd do its job

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)


If I replace my emulator with a bridge, then I do not see the problem.  But, I 
also do not
(or very rarely?) see the problem when configuring the emulator with zero 
latency and jitter,
which is how the bridge would act.

Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout?

If you have any interest, I will be happy to email you my out-of-tree patches 
and
instructions to reproduce the problem.


The kernel splat looks like this, and repeats often:


May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

May 17 16:03:39 localhost.localdomain kernel: [ cut here 
]
May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at 
/home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 
dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): 
transmit queue 0 timed out
May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen 
cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich 
i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp 
pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack]

May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 
Tainted: G   O4.8.0-rc7+ #132
May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc 
CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
May 17 16:03:39 localhost.localdomain kernel:   
88087fd43d78 81417eb1 88087fd43dc8
May 17 16:03:39 localhost.localdomain kernel:   
88087fd43db8 81103556 013c7fd43da8
May 17 16:03:39 localhost.localdomain kernel:   
880854221940 0005 880854bb8000
May 17 16:03:39 localhost.localdomain kernel: Call Trace:
May 17 16:03:39 localhost.localdomain kernel:[] 
dump_stack+0x63/0x82
May 17 16:03:39 localhost.localdomain kernel:  [] 
__warn+0xc6/0xe0
May 17 16:03:39 localhost.localdomain kernel:  [] 
warn_slowpath_fmt+0x4a/0x50
May 17 16:03:39 localhost.localdomain kernel:  [] 
dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
call_timer_fn+0x30/0x150
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
run_timer_softirq+0x1ea/0x450
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
ktime_get+0x37/0xa0
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
lapic_next_deadline+0x21/0x30
May 17 16:03:39 localhost.localdomain kernel:  [] ? 
clockevents_program_event+0x7d/0x120
May 17 16:03:39 localhost.localdomain kernel:  [] 
__do_softirq+0xca/0x2d0
May 17 16:03:39 localhost.localdomain kernel:  [] 
irq_exit+0xb3/0xc0
May 17 16:03:39 localhost.localdomain kernel:  [] 
smp_apic_timer_interrupt+0x3d/0x50
May 17 16:03:39 localhost.localdomain kernel:  [] 
apic_timer_interrupt+0x82/0x90
May 17 16:03:39 localhost.localdomain kernel:[] ? 
cpuidle_enter_state+0x126/0x300
May 17 16:03:39 localhost.localdomain kernel:  [] 
cpuidle_enter+0x12/0x20
May 17 16:03:39 localhost.localdomain kernel:  [] 
call_cpuidle+0x25/0x40
May 17 16:03:39 localhost.localdomain kernel:  [] 
cpu_startup_entry+0x2ba/0x380
May 17 16:03:39 localhost.localdomain kernel:  [] 
start_secondary+0x149/0x170
May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f 
]---


Thanks,
Ben

--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



[PATCH] ath10k: transmit queued frames after waking queues

2018-05-17 Thread Niklas Cassel
The following problem was observed when running iperf:

[  3]  0.0- 1.0 sec  2.00 MBytes  16.8 Mbits/sec
[  3]  1.0- 2.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3]  2.0- 3.0 sec  3.25 MBytes  27.3 Mbits/sec
[  3]  3.0- 4.0 sec   655 KBytes  5.36 Mbits/sec
[  3]  4.0- 5.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  5.0- 6.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  6.0- 7.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  7.0- 8.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  8.0- 9.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  9.0-10.0 sec  0.00 Bytes  0.00 bits/sec
[  3]  0.0-10.3 sec  9.01 MBytes  7.32 Mbits/sec

There are frames in the ieee80211_txq and there are frames that have
been removed from from this queue, but haven't yet been sent on the wire
(num_pending_tx).

When num_pending_tx reaches max_num_pending_tx, we will stop the queues
by calling ieee80211_stop_queues().

As frames that have previously been sent for transmission
(num_pending_tx) are completed, we will decrease num_pending_tx and wake
the queues by calling ieee80211_wake_queue(). ieee80211_wake_queue()
does not call wake_tx_queue, so we might still have frames in the
queue at this point.

While the queues were stopped, the socket buffer might have filled up,
and in order for user space to write more, we need to free the frames
in the queue, since they are accounted to the socket. In order to free
them, we first need to transmit them.

In order to avoid trying to flush the queue every time we free a frame,
only do this when there are 3 or less frames pending, and while we
actually have frames in the queue. This logic was copied from
mt76_txq_schedule (mt76), one of few other drivers that are actually
using wake_tx_queue.

Suggested-by: Toke Høiland-Jørgensen 
Signed-off-by: Niklas Cassel 
---
 drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/txrx.c 
b/drivers/net/wireless/ath/ath10k/txrx.c
index cda164f6e9f6..1d3b2d2c3fee 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
wake_up(>empty_tx_wq);
spin_unlock_bh(>tx_lock);
 
+   if (htt->num_pending_tx <= 3 && !list_empty(>txqs))
+   ath10k_mac_tx_push_pending(ar);
+
dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);
 
ath10k_report_offchan_tx(htt->ar, msdu);
-- 
2.17.0



Re: [PATCH bpf] bpf: fix truncated jump targets on heavy expansions

2018-05-17 Thread Alexei Starovoitov
On Thu, May 17, 2018 at 01:44:11AM +0200, Daniel Borkmann wrote:
> Recently during testing, I ran into the following panic:
> 
> Therefore it becomes necessary to detect and reject any such occasions
> in a generic way for native eBPF and cBPF to eBPF migrations. For
> the latter we can simply check bounds in the bpf_convert_filter()'s
> BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
> bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
> of subsequent hardening) is a bit more complex in that we need to
> detect such truncations before hitting the bpf_prog_realloc(). Thus
> the latter is split into an extra pass to probe problematic offsets
> on the original program in order to fail early. With that in place
> and carefully tested I no longer hit the panic and the rewrites are
> rejected properly. The above example panic I've seen on bpf-next,
> though the issue itself is generic in that a guard against this issue
> in bpf seems more appropriate in this case.
> 
> Signed-off-by: Daniel Borkmann 

Nice catch! Applied.



Re: [PATCH ghak81 V3 3/3] audit: collect audit task parameters

2018-05-17 Thread Paul Moore
On Wed, May 16, 2018 at 7:55 AM, Richard Guy Briggs  wrote:
> The audit-related parameters in struct task_struct should ideally be
> collected together and accessed through a standard audit API.
>
> Collect the existing loginuid, sessionid and audit_context together in a
> new struct audit_task_info called "audit" in struct task_struct.
>
> Use kmem_cache to manage this pool of memory.
> Un-inline audit_free() to be able to always recover that memory.
>
> See: https://github.com/linux-audit/audit-kernel/issues/81
>
> Signed-off-by: Richard Guy Briggs 
> ---
>  include/linux/audit.h | 34 --
>  include/linux/sched.h |  5 +
>  init/init_task.c  |  3 +--
>  init/main.c   |  2 ++
>  kernel/auditsc.c  | 51 
> ++-
>  kernel/fork.c |  2 +-
>  6 files changed, 71 insertions(+), 26 deletions(-)

As discussed on-list and offline, I'm going to hold off on this change
until the audit container ID work is father along.  That is the main
driver for this change, and until that is closer to ready I just can't
justify the extra overhead.

> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 69c7847..4f824c4 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -216,8 +216,15 @@ static inline void audit_log_task_info(struct 
> audit_buffer *ab,
>
>  /* These are defined in auditsc.c */
> /* Public API */
> +struct audit_task_info {
> +   kuid_t  loginuid;
> +   unsigned intsessionid;
> +   struct audit_context*ctx;
> +};
> +extern struct audit_task_info init_struct_audit;
> +extern void __init audit_task_init(void);
>  extern int  audit_alloc(struct task_struct *task);
> -extern void __audit_free(struct task_struct *task);
> +extern void audit_free(struct task_struct *task);
>  extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long 
> a1,
>   unsigned long a2, unsigned long a3);
>  extern void __audit_syscall_exit(int ret_success, long ret_value);
> @@ -239,12 +246,15 @@ extern void audit_seccomp_actions_logged(const char 
> *names,
>
>  static inline void audit_set_context(struct task_struct *task, struct 
> audit_context *ctx)
>  {
> -   task->audit_context = ctx;
> +   task->audit->ctx = ctx;
>  }
>
>  static inline struct audit_context *audit_context(void)
>  {
> -   return current->audit_context;
> +   if (current->audit)
> +   return current->audit->ctx;
> +   else
> +   return NULL;
>  }
>
>  static inline bool audit_dummy_context(void)
> @@ -252,11 +262,7 @@ static inline bool audit_dummy_context(void)
> void *p = audit_context();
> return !p || *(int *)p;
>  }
> -static inline void audit_free(struct task_struct *task)
> -{
> -   if (unlikely(task->audit_context))
> -   __audit_free(task);
> -}
> +
>  static inline void audit_syscall_entry(int major, unsigned long a0,
>unsigned long a1, unsigned long a2,
>unsigned long a3)
> @@ -328,12 +334,18 @@ extern int auditsc_get_stamp(struct audit_context *ctx,
>
>  static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
>  {
> -   return tsk->loginuid;
> +   if (tsk->audit)
> +   return tsk->audit->loginuid;
> +   else
> +   return INVALID_UID;
>  }
>
>  static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
>  {
> -   return tsk->sessionid;
> +   if (tsk->audit)
> +   return tsk->audit->sessionid;
> +   else
> +   return AUDIT_SID_UNSET;
>  }
>
>  extern void __audit_ipc_obj(struct kern_ipc_perm *ipcp);
> @@ -458,6 +470,8 @@ static inline void audit_fanotify(unsigned int response)
>  extern int audit_n_rules;
>  extern int audit_signals;
>  #else /* CONFIG_AUDITSYSCALL */
> +static inline void __init audit_task_init(void)
> +{ }
>  static inline int audit_alloc(struct task_struct *task)
>  {
> return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b3d697f..6a5db0e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -29,7 +29,6 @@
>  #include 
>
>  /* task_struct member predeclarations (sorted alphabetically): */
> -struct audit_context;
>  struct backing_dev_info;
>  struct bio_list;
>  struct blk_plug;
> @@ -832,10 +831,8 @@ struct task_struct {
>
> struct callback_head*task_works;
>
> -   struct audit_context*audit_context;
>  #ifdef CONFIG_AUDITSYSCALL
> -   kuid_t  loginuid;
> -   unsigned intsessionid;
> +   struct audit_task_info  *audit;
>  #endif
> struct seccomp  seccomp;
>
> diff --git a/init/init_task.c b/init/init_task.c
> index 

Re: [PATCH ghak81 V3 1/3] audit: use new audit_context access funciton for seccomp_actions_logged

2018-05-17 Thread Paul Moore
On Wed, May 16, 2018 at 7:55 AM, Richard Guy Briggs  wrote:
> On the rebase of the following commit on the new seccomp actions_logged
> function, one audit_context access was missed.
>
> commit cdfb6b341f0f2409aba24b84f3b4b2bba50be5c5
> ("audit: use inline function to get audit context")
>
> Signed-off-by: Richard Guy Briggs 
> ---
>  kernel/auditsc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Merged into audit/next, thanks for the follow-up.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index cbab0da..f3d3dc6 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -2497,7 +2497,7 @@ void audit_seccomp_actions_logged(const char *names, 
> const char *old_names,
> if (!audit_enabled)
> return;
>
> -   ab = audit_log_start(current->audit_context, GFP_KERNEL,
> +   ab = audit_log_start(audit_context(), GFP_KERNEL,
>  AUDIT_CONFIG_CHANGE);
> if (unlikely(!ab))
> return;
> --
> 1.8.3.1

-- 
paul moore
www.paul-moore.com


Re: [PATCH ghak81 V3 2/3] audit: normalize loginuid read access

2018-05-17 Thread Paul Moore
On Wed, May 16, 2018 at 7:55 AM, Richard Guy Briggs  wrote:
> Recognizing that the loginuid is an internal audit value, use an access
> function to retrieve the audit loginuid value for the task rather than
> reaching directly into the task struct to get it.
>
> Signed-off-by: Richard Guy Briggs 
> ---
>  kernel/auditsc.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)

Also merged into audit/next.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index f3d3dc6..ef3e189 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -374,7 +374,7 @@ static int audit_field_compare(struct task_struct *tsk,
> case AUDIT_COMPARE_EGID_TO_OBJ_GID:
> return audit_compare_gid(cred->egid, name, f, ctx);
> case AUDIT_COMPARE_AUID_TO_OBJ_UID:
> -   return audit_compare_uid(tsk->loginuid, name, f, ctx);
> +   return audit_compare_uid(audit_get_loginuid(tsk), name, f, 
> ctx);
> case AUDIT_COMPARE_SUID_TO_OBJ_UID:
> return audit_compare_uid(cred->suid, name, f, ctx);
> case AUDIT_COMPARE_SGID_TO_OBJ_GID:
> @@ -385,7 +385,8 @@ static int audit_field_compare(struct task_struct *tsk,
> return audit_compare_gid(cred->fsgid, name, f, ctx);
> /* uid comparisons */
> case AUDIT_COMPARE_UID_TO_AUID:
> -   return audit_uid_comparator(cred->uid, f->op, tsk->loginuid);
> +   return audit_uid_comparator(cred->uid, f->op,
> +   audit_get_loginuid(tsk));
> case AUDIT_COMPARE_UID_TO_EUID:
> return audit_uid_comparator(cred->uid, f->op, cred->euid);
> case AUDIT_COMPARE_UID_TO_SUID:
> @@ -394,11 +395,14 @@ static int audit_field_compare(struct task_struct *tsk,
> return audit_uid_comparator(cred->uid, f->op, cred->fsuid);
> /* auid comparisons */
> case AUDIT_COMPARE_AUID_TO_EUID:
> -   return audit_uid_comparator(tsk->loginuid, f->op, cred->euid);
> +   return audit_uid_comparator(audit_get_loginuid(tsk), f->op,
> +   cred->euid);
> case AUDIT_COMPARE_AUID_TO_SUID:
> -   return audit_uid_comparator(tsk->loginuid, f->op, cred->suid);
> +   return audit_uid_comparator(audit_get_loginuid(tsk), f->op,
> +   cred->suid);
> case AUDIT_COMPARE_AUID_TO_FSUID:
> -   return audit_uid_comparator(tsk->loginuid, f->op, 
> cred->fsuid);
> +   return audit_uid_comparator(audit_get_loginuid(tsk), f->op,
> +   cred->fsuid);
> /* euid comparisons */
> case AUDIT_COMPARE_EUID_TO_SUID:
> return audit_uid_comparator(cred->euid, f->op, cred->suid);
> @@ -611,7 +615,8 @@ static int audit_filter_rules(struct task_struct *tsk,
> result = match_tree_refs(ctx, rule->tree);
> break;
> case AUDIT_LOGINUID:
> -   result = audit_uid_comparator(tsk->loginuid, f->op, 
> f->uid);
> +   result = audit_uid_comparator(audit_get_loginuid(tsk),
> + f->op, f->uid);
> break;
> case AUDIT_LOGINUID_SET:
> result = audit_comparator(audit_loginuid_set(tsk), 
> f->op, f->val);
> @@ -2278,14 +2283,15 @@ int audit_signal_info(int sig, struct task_struct *t)
>  {
> struct audit_aux_data_pids *axp;
> struct audit_context *ctx = audit_context();
> -   kuid_t uid = current_uid(), t_uid = task_uid(t);
> +   kuid_t uid = current_uid(), auid, t_uid = task_uid(t);
>
> if (auditd_test_task(t) &&
> (sig == SIGTERM || sig == SIGHUP ||
>  sig == SIGUSR1 || sig == SIGUSR2)) {
> audit_sig_pid = task_tgid_nr(current);
> -   if (uid_valid(current->loginuid))
> -   audit_sig_uid = current->loginuid;
> +   auid = audit_get_loginuid(current);
> +   if (uid_valid(auid))
> +   audit_sig_uid = auid;
> else
> audit_sig_uid = uid;
> security_task_getsecid(current, _sig_sid);
> --
> 1.8.3.1
>



-- 
paul moore
www.paul-moore.com


Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports

2018-05-17 Thread Florian Fainelli
On 05/17/2018 03:40 PM, Andrew Lunn wrote:
> On Thu, May 17, 2018 at 03:06:36PM -0700, Florian Fainelli wrote:
>> On 05/17/2018 02:08 PM, Andrew Lunn wrote:
>>> On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote:
 Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote:
> On 05/17/2018 10:39 AM, Jiri Pirko wrote:
 That is compiled inside "fixed_phy", isn't it?
>>>
>>> It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes
>>> built-in, if is modular, it is also modular, this was fixed with
>>> 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional
>>> dsa-loop dependency on FIXED_PHY")
>>
>> Now I have it compiled as module, and after modprobe dsa_loop I see:
>> [ 1168.129202] libphy: Fixed MDIO Bus: probed
>> [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f
>>
>> This messages I did not see when I had fixed_phy compiled as buildin.
>>
>> But I still see no netdevs :/
>
> The platform data assumes there is a network device named "eth0" as the

 Oups, I missed, I created dummy device and modprobed again. Now I see:

 $ sudo devlink port
 mdio_bus/fixed-0:1f/0: type eth netdev lan1
 mdio_bus/fixed-0:1f/1: type eth netdev lan2
 mdio_bus/fixed-0:1f/2: type eth netdev lan3
 mdio_bus/fixed-0:1f/3: type eth netdev lan4
 mdio_bus/fixed-0:1f/4: type notset
 mdio_bus/fixed-0:1f/5: type notset
 mdio_bus/fixed-0:1f/6: type notset
 mdio_bus/fixed-0:1f/7: type notset
 mdio_bus/fixed-0:1f/8: type notset
 mdio_bus/fixed-0:1f/9: type notset
 mdio_bus/fixed-0:1f/10: type notset
 mdio_bus/fixed-0:1f/11: type notset

 I wonder why there are ports 4-11
>>>
>>> Hi Jiri
>>>
>>> ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS);
>>>
>>> It is allocating a switch with 12 ports. However only 4 of them have
>>> names. So the core only creates slave devices for those 4.
>>>
>>> This is a useful test. Real hardware often has unused ports. A WiFi AP
>>> with a 7 port switch which only uses 6 ports is often seen.
>>
>> The following patch should fix this:
>>
>>
>> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
>> index adf50fbc4c13..a06c29ec91f0 100644
>> --- a/net/dsa/dsa2.c
>> +++ b/net/dsa/dsa2.c
>> @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp)
>>
>> memset(>devlink_port, 0, sizeof(dp->devlink_port));
>>
>> +   if (dp->type == DSA_PORT_TYPE_UNUSED)
>> +   return 0;
>> +
>> err = devlink_port_register(ds->devlink, >devlink_port,
>> dp->index);
> 
> Hi Florian, Jiri
> 
> Maybe it is better to add a devlink port type unused?

The port does not exist on the switch, so it should not even be
registered IMHO.
-- 
Florian


Re: [net-next PATCH v2 3/4] net-sysfs: Add interface for Rx queue map per Tx queue

2018-05-17 Thread Nambiar, Amritha
On 5/17/2018 12:05 PM, Florian Fainelli wrote:
> On 05/15/2018 06:26 PM, Amritha Nambiar wrote:
>> Extend transmit queue sysfs attribute to configure Rx queue map
>> per Tx queue. By default no receive queues are configured for the
>> Tx queue.
>>
>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
> 
> Please include an update to Documentation/ABI/testing/sysfs-class-net
> with your new attribute.
> 
Will do in the next version.
Thanks.


Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports

2018-05-17 Thread Andrew Lunn
On Thu, May 17, 2018 at 03:06:36PM -0700, Florian Fainelli wrote:
> On 05/17/2018 02:08 PM, Andrew Lunn wrote:
> > On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote:
> >> Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote:
> >>> On 05/17/2018 10:39 AM, Jiri Pirko wrote:
> >> That is compiled inside "fixed_phy", isn't it?
> >
> > It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes
> > built-in, if is modular, it is also modular, this was fixed with
> > 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional
> > dsa-loop dependency on FIXED_PHY")
> 
>  Now I have it compiled as module, and after modprobe dsa_loop I see:
>  [ 1168.129202] libphy: Fixed MDIO Bus: probed
>  [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f
> 
>  This messages I did not see when I had fixed_phy compiled as buildin.
> 
>  But I still see no netdevs :/
> >>>
> >>> The platform data assumes there is a network device named "eth0" as the
> >>
> >> Oups, I missed, I created dummy device and modprobed again. Now I see:
> >>
> >> $ sudo devlink port
> >> mdio_bus/fixed-0:1f/0: type eth netdev lan1
> >> mdio_bus/fixed-0:1f/1: type eth netdev lan2
> >> mdio_bus/fixed-0:1f/2: type eth netdev lan3
> >> mdio_bus/fixed-0:1f/3: type eth netdev lan4
> >> mdio_bus/fixed-0:1f/4: type notset
> >> mdio_bus/fixed-0:1f/5: type notset
> >> mdio_bus/fixed-0:1f/6: type notset
> >> mdio_bus/fixed-0:1f/7: type notset
> >> mdio_bus/fixed-0:1f/8: type notset
> >> mdio_bus/fixed-0:1f/9: type notset
> >> mdio_bus/fixed-0:1f/10: type notset
> >> mdio_bus/fixed-0:1f/11: type notset
> >>
> >> I wonder why there are ports 4-11
> > 
> > Hi Jiri
> > 
> > ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS);
> > 
> > It is allocating a switch with 12 ports. However only 4 of them have
> > names. So the core only creates slave devices for those 4.
> > 
> > This is a useful test. Real hardware often has unused ports. A WiFi AP
> > with a 7 port switch which only uses 6 ports is often seen.
> 
> The following patch should fix this:
> 
> 
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index adf50fbc4c13..a06c29ec91f0 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp)
> 
> memset(>devlink_port, 0, sizeof(dp->devlink_port));
> 
> +   if (dp->type == DSA_PORT_TYPE_UNUSED)
> +   return 0;
> +
> err = devlink_port_register(ds->devlink, >devlink_port,
> dp->index);

Hi Florian, Jiri

Maybe it is better to add a devlink port type unused?

  Andrew


Re: [PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link

2018-05-17 Thread Stephen Hemminger
On Thu, 17 May 2018 16:22:37 -0600
dsah...@kernel.org wrote:

> From: David Ahern 
> 
> Using iproute2 to create a bridge and add 4094 vlans to it can take from
> 2 to 3 *minutes*. The reason is the extraneous call to ll_name_to_index.
> ll_name_to_index results in an ioctl(SIOCGIFINDEX) call which in turn
> invokes dev_load. If the index does not exist, which it won't when
> creating a new link, dev_load calls modprobe twice -- once for
> netdev-NAME and again for NAME. This is unnecessary overhead for each
> link create.
> 
> When ip link is invoked for a new device, there is no reason to
> call ll_name_to_index for the new device. With this patch, creating
> a bridge and adding 4094 vlans takes less than 3 *seconds*.
> 
> Signed-off-by: David Ahern 

Yes this looks like a real problem.
Isn't the cache supposed to reduce this?

Don't like to make lots of special case flags.



Re: [bpf PATCH v2 1/2] bpf: sockmap update rollback on error can incorrectly dec prog refcnt

2018-05-17 Thread Daniel Borkmann
On 05/17/2018 11:06 PM, John Fastabend wrote:
> If the user were to only attach one of the parse or verdict programs
> then it is possible a subsequent sockmap update could incorrectly
> decrement the refcnt on the program. This happens because in the
> rollback logic, after an error, we have to decrement the program
> reference count when its been incremented. However, we only increment
> the program reference count if the user has both a verdict and a
> parse program. The reason for this is because, at least at the
> moment, both are required for any one to be meaningful. The problem
> fixed here is in the rollback path we decrement the program refcnt
> even if only one existing. But we never incremented the refcnt in
> the first place creating an imbalance.
> 
> This patch fixes the error path to handle this case.
> 
> Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add 
> multi-map support")
> Reported-by: Daniel Borkmann 
> Signed-off-by: John Fastabend 
> Acked-by: Martin KaFai Lau 

Applied to bpf tree, thanks!


Re: [PATCH net] net: dsa: Do not register devlink for unused ports

2018-05-17 Thread Florian Fainelli
On 05/17/2018 03:16 PM, Florian Fainelli wrote:
> Even if commit 1d27732f411d ("net: dsa: setup and teardown ports") indicated
> that registering a devlink instance for unused ports is not a problem, and 
> this
> is true, this can be confusing nonetheless, so let's not do it.
> 
> Fixes: 1d27732f411d ("net: dsa: setup and teardown ports")
> Reported-by: Jiri Pirko 
> Signed-off-by: Florian Fainelli 
> ---
>  net/dsa/dsa2.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index adf50fbc4c13..cc45a8ca45fb 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp)
>  
>   memset(>devlink_port, 0, sizeof(dp->devlink_port));
>  
> + if (dp->type == DSA_PORT_TYPE_UNUSED)
> + return 0;
> +
>   err = devlink_port_register(ds->devlink, >devlink_port, dp->index);
>   if (err)
>   return err;
>  
>   switch (dp->type) {
> - case DSA_PORT_TYPE_UNUSED:
> - break;
>   case DSA_PORT_TYPE_CPU:
>   case DSA_PORT_TYPE_DSA:
>   err = dsa_port_link_register_of(dp);
> @@ -293,11 +294,12 @@ static int dsa_port_setup(struct dsa_port *dp)
>  
>  static void dsa_port_teardown(struct dsa_port *dp)
>  {
> + if (dp->type == DSA_PORT_TYPE_UNUSED)
> + return;
> +
>   devlink_port_unregister(>devlink_port);
>  
>   switch (dp->type) {
> - case DSA_PORT_TYPE_UNUSED:
> - break;

Actually those should be kept in there in order not to generate a
warning about DSA_PORT_TYPE_UNUSED not being handled by the switch()
case statement, I will resubmit that shortly, or we could even move the
registration until after, either way is likely fine.
-- 
Florian


Re: [PATCH bpf-next 3/3] bpf: Add mtu checking to FIB forwarding helper

2018-05-17 Thread Daniel Borkmann
On 05/17/2018 06:09 PM, David Ahern wrote:
> Add check that egress MTU can handle packet to be forwarded. If
> the MTU is less than the packet lenght, return 0 meaning the
> packet is expected to continue up the stack for help - eg.,
> fragmenting the packet or sending an ICMP.
> 
> Signed-off-by: David Ahern 
> ---
>  net/core/filter.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6d0d1560bd70..c47c47a75d4b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4098,6 +4098,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
> bpf_fib_lookup *params,
>   struct fib_nh *nh;
>   struct flowi4 fl4;
>   int err;
> + u32 mtu;
>  
>   dev = dev_get_by_index_rcu(net, params->ifindex);
>   if (unlikely(!dev))
> @@ -4149,6 +4150,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
> bpf_fib_lookup *params,
>   if (res.fi->fib_nhs > 1)
>   fib_select_path(net, , , NULL);
>  
> + mtu = ip_mtu_from_fib_result(, params->ipv4_dst);
> + if (params->tot_len > mtu)
> + return 0;
> +
>   nh = >fib_nh[res.nh_sel];
>  
>   /* do not handle lwt encaps right now */
> @@ -4188,6 +4193,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
> bpf_fib_lookup *params,
>   struct flowi6 fl6;
>   int strict = 0;
>   int oif;
> + u32 mtu;
>  
>   /* link local addresses are never forwarded */
>   if (rt6_need_strict(dst) || rt6_need_strict(src))
> @@ -4250,6 +4256,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
> bpf_fib_lookup *params,
>  fl6.flowi6_oif, NULL,
>  strict);
>  
> + mtu = ip6_mtu_from_fib6(f6i, dst, src);
> + if (params->tot_len > mtu)
> + return 0;
> +
>   if (f6i->fib6_nh.nh_lwtstate)
>   return 0;

Could you elaborate how this interacts in tc BPF use case where you have e.g.
GSO packets and tot_len from aggregated packets would definitely be larger
than MTU (e.g. see is_skb_forwardable() as one example on such checks)? Should
this be an opt-in via a new flag for the helper?

Thanks,
Daniel


[PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link

2018-05-17 Thread dsahern
From: David Ahern 

Using iproute2 to create a bridge and add 4094 vlans to it can take from
2 to 3 *minutes*. The reason is the extraneous call to ll_name_to_index.
ll_name_to_index results in an ioctl(SIOCGIFINDEX) call which in turn
invokes dev_load. If the index does not exist, which it won't when
creating a new link, dev_load calls modprobe twice -- once for
netdev-NAME and again for NAME. This is unnecessary overhead for each
link create.

When ip link is invoked for a new device, there is no reason to
call ll_name_to_index for the new device. With this patch, creating
a bridge and adding 4094 vlans takes less than 3 *seconds*.

Signed-off-by: David Ahern 
---
 ip/ip_common.h|  3 ++-
 ip/iplink.c   | 22 +-
 ip/iplink_vxcan.c |  3 ++-
 ip/link_veth.c|  3 ++-
 4 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 1b89795caa58..67f413474631 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -132,7 +132,8 @@ struct link_util {
 
 struct link_util *get_link_kind(const char *kind);
 
-int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type);
+int iplink_parse(int argc, char **argv, struct iplink_req *req,
+char **type, bool is_add_cmd);
 
 /* iplink_bridge.c */
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
diff --git a/ip/iplink.c b/ip/iplink.c
index e6bb4493120e..c8bf49ed3d24 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -571,7 +571,8 @@ static int iplink_parse_vf(int vf, int *argcp, char 
***argvp,
return 0;
 }
 
-int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
+int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type,
+bool is_add_cmd)
 {
char *name = NULL;
char *dev = NULL;
@@ -610,7 +611,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req, char **type)
name = *argv;
if (!dev) {
dev = name;
-   dev_index = ll_name_to_index(dev);
+   if (!is_add_cmd)
+   dev_index = ll_name_to_index(dev);
}
} else if (strcmp(*argv, "index") == 0) {
NEXT_ARG();
@@ -919,7 +921,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req, char **type)
if (check_ifname(*argv))
invarg("\"dev\" not a valid ifname", *argv);
dev = *argv;
-   dev_index = ll_name_to_index(dev);
+   if (!is_add_cmd)
+   dev_index = ll_name_to_index(dev);
}
argc--; argv++;
}
@@ -1011,7 +1014,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req, char **type)
return ret;
 }
 
-static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
+static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv,
+bool is_add_cmd)
 {
char *type = NULL;
struct iplink_req req = {
@@ -1022,7 +1026,7 @@ static int iplink_modify(int cmd, unsigned int flags, int 
argc, char **argv)
};
int ret;
 
-   ret = iplink_parse(argc, argv, , );
+   ret = iplink_parse(argc, argv, , , is_add_cmd);
if (ret < 0)
return ret;
 
@@ -1630,18 +1634,18 @@ int do_iplink(int argc, char **argv)
if (matches(*argv, "add") == 0)
return iplink_modify(RTM_NEWLINK,
 NLM_F_CREATE|NLM_F_EXCL,
-argc-1, argv+1);
+argc-1, argv+1, true);
if (matches(*argv, "set") == 0 ||
matches(*argv, "change") == 0)
return iplink_modify(RTM_NEWLINK, 0,
-argc-1, argv+1);
+argc-1, argv+1, false);
if (matches(*argv, "replace") == 0)
return iplink_modify(RTM_NEWLINK,
 NLM_F_CREATE|NLM_F_REPLACE,
-argc-1, argv+1);
+argc-1, argv+1, false);
if (matches(*argv, "delete") == 0)
return iplink_modify(RTM_DELLINK, 0,
-argc-1, argv+1);
+argc-1, argv+1, false);
} else {
 #if IPLINK_IOCTL_COMPAT
if (matches(*argv, "set") == 0)
diff --git a/ip/iplink_vxcan.c b/ip/iplink_vxcan.c
index 8b08c9a70c65..e30a784d9851 100644
--- a/ip/iplink_vxcan.c
+++ 

[PATCH net] net: dsa: Do not register devlink for unused ports

2018-05-17 Thread Florian Fainelli
Even if commit 1d27732f411d ("net: dsa: setup and teardown ports") indicated
that registering a devlink instance for unused ports is not a problem, and this
is true, this can be confusing nonetheless, so let's not do it.

Fixes: 1d27732f411d ("net: dsa: setup and teardown ports")
Reported-by: Jiri Pirko 
Signed-off-by: Florian Fainelli 
---
 net/dsa/dsa2.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index adf50fbc4c13..cc45a8ca45fb 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp)
 
memset(>devlink_port, 0, sizeof(dp->devlink_port));
 
+   if (dp->type == DSA_PORT_TYPE_UNUSED)
+   return 0;
+
err = devlink_port_register(ds->devlink, >devlink_port, dp->index);
if (err)
return err;
 
switch (dp->type) {
-   case DSA_PORT_TYPE_UNUSED:
-   break;
case DSA_PORT_TYPE_CPU:
case DSA_PORT_TYPE_DSA:
err = dsa_port_link_register_of(dp);
@@ -293,11 +294,12 @@ static int dsa_port_setup(struct dsa_port *dp)
 
 static void dsa_port_teardown(struct dsa_port *dp)
 {
+   if (dp->type == DSA_PORT_TYPE_UNUSED)
+   return;
+
devlink_port_unregister(>devlink_port);
 
switch (dp->type) {
-   case DSA_PORT_TYPE_UNUSED:
-   break;
case DSA_PORT_TYPE_CPU:
case DSA_PORT_TYPE_DSA:
dsa_port_link_unregister_of(dp);
-- 
2.14.1



Proposal

2018-05-17 Thread Miss Zeliha Omer Faruk



Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey



Re: [PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Yuchung Cheng
On Thu, May 17, 2018 at 2:57 PM, Neal Cardwell  wrote:
> On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:
>
>> When TCP receives an out-of-order packet, it immediately sends
>> a SACK packet, generating network load but also forcing the
>> receiver to send 1-MSS pathological packets, increasing its
>> RTX queue length/depth, and thus processing time.
>
>> Wifi networks suffer from this aggressive behavior, but generally
>> speaking, all these SACK packets add fuel to the fire when networks
>> are under congestion.
>
>> This patch adds a high resolution timer and tp->compressed_ack counter.
>
>> Instead of sending a SACK, we program this timer with a small delay,
>> based on RTT and capped to 1 ms :
>
>>  delay = min ( 5 % of RTT, 1 ms)
>
>> If subsequent SACKs need to be sent while the timer has not yet
>> expired, we simply increment tp->compressed_ack.
>
>> When timer expires, a SACK is sent with the latest information.
>> Whenever an ACK is sent (if data is sent, or if in-order
>> data is received) timer is canceled.
>
>> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
>> if the sack blocks need to be shuffled, even if the timer has not
>> expired.
>
>> A new SNMP counter is added in the following patch.
>
>> Two other patches add sysctls to allow changing the 1,000,000 and 44
>> values that this commit hard-coded.
>
>> Signed-off-by: Eric Dumazet 
>> ---
>
> Very nice. I like the constants and the min(rcv_rtt, srtt).
>
> Acked-by: Neal Cardwell 
Acked-by: Yuchung Cheng 

Great work. Hopefully this would save middle-boxes' from handling
TCP-ACK themselves.

>
> Thanks!
>
> neal


Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports

2018-05-17 Thread Florian Fainelli
On 05/17/2018 02:08 PM, Andrew Lunn wrote:
> On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote:
>> Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote:
>>> On 05/17/2018 10:39 AM, Jiri Pirko wrote:
>> That is compiled inside "fixed_phy", isn't it?
>
> It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes
> built-in, if is modular, it is also modular, this was fixed with
> 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional
> dsa-loop dependency on FIXED_PHY")

 Now I have it compiled as module, and after modprobe dsa_loop I see:
 [ 1168.129202] libphy: Fixed MDIO Bus: probed
 [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f

 This messages I did not see when I had fixed_phy compiled as buildin.

 But I still see no netdevs :/
>>>
>>> The platform data assumes there is a network device named "eth0" as the
>>
>> Oups, I missed, I created dummy device and modprobed again. Now I see:
>>
>> $ sudo devlink port
>> mdio_bus/fixed-0:1f/0: type eth netdev lan1
>> mdio_bus/fixed-0:1f/1: type eth netdev lan2
>> mdio_bus/fixed-0:1f/2: type eth netdev lan3
>> mdio_bus/fixed-0:1f/3: type eth netdev lan4
>> mdio_bus/fixed-0:1f/4: type notset
>> mdio_bus/fixed-0:1f/5: type notset
>> mdio_bus/fixed-0:1f/6: type notset
>> mdio_bus/fixed-0:1f/7: type notset
>> mdio_bus/fixed-0:1f/8: type notset
>> mdio_bus/fixed-0:1f/9: type notset
>> mdio_bus/fixed-0:1f/10: type notset
>> mdio_bus/fixed-0:1f/11: type notset
>>
>> I wonder why there are ports 4-11
> 
> Hi Jiri
> 
> ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS);
> 
> It is allocating a switch with 12 ports. However only 4 of them have
> names. So the core only creates slave devices for those 4.
> 
> This is a useful test. Real hardware often has unused ports. A WiFi AP
> with a 7 port switch which only uses 6 ports is often seen.

The following patch should fix this:


diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index adf50fbc4c13..a06c29ec91f0 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -262,13 +262,14 @@ static int dsa_port_setup(struct dsa_port *dp)

memset(>devlink_port, 0, sizeof(dp->devlink_port));

+   if (dp->type == DSA_PORT_TYPE_UNUSED)
+   return 0;
+
err = devlink_port_register(ds->devlink, >devlink_port,
dp->index);
if (err)
return err;

switch (dp->type) {
-   case DSA_PORT_TYPE_UNUSED:
-   break;
case DSA_PORT_TYPE_CPU:
case DSA_PORT_TYPE_DSA:
err = dsa_port_link_register_of(dp);
@@ -286,6 +287,8 @@ static int dsa_port_setup(struct dsa_port *dp)
else
devlink_port_type_eth_set(>devlink_port,
dp->slave);
break;
+   default:
+   break;
}

return 0;
@@ -293,11 +296,12 @@ static int dsa_port_setup(struct dsa_port *dp)

 static void dsa_port_teardown(struct dsa_port *dp)
 {
+   if (dp->type == DSA_PORT_TYPE_UNUSED)
+   return;
+
devlink_port_unregister(>devlink_port);

switch (dp->type) {
-   case DSA_PORT_TYPE_UNUSED:
-   break;
case DSA_PORT_TYPE_CPU:
case DSA_PORT_TYPE_DSA:
dsa_port_link_unregister_of(dp);
@@ -308,6 +312,8 @@ static void dsa_port_teardown(struct dsa_port *dp)
dp->slave = NULL;
}
break;
+   default:
+   break;
}
 }


-- 
Florian


Re: [PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl

2018-05-17 Thread Neal Cardwell
On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:

> This per netns sysctl allows for TCP SACK compression fine-tuning.

> This limits number of SACK that can be compressed.
> Using 0 disables SACK compression.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal


Re: [PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl

2018-05-17 Thread Neal Cardwell
On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:

> This per netns sysctl allows for TCP SACK compression fine-tuning.

> Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal


[net PATCH] net: Fix a bug in removing queues from XPS map

2018-05-17 Thread Amritha Nambiar
While removing queues from the XPS map, the individual CPU ID
alone was used to index the CPUs map, this should be changed to also
factor in the traffic class mapping for the CPU-to-queue lookup.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Amritha Nambiar 
Acked-by: Alexander Duyck 
---
 net/core/dev.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9f43901..9397577 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2125,7 +2125,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
int i, j;
 
for (i = count, j = offset; i--; j++) {
-   if (!remove_xps_queue(dev_maps, cpu, j))
+   if (!remove_xps_queue(dev_maps, tci, j))
break;
}
 



Re: [PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell
On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.

> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.

> This patch adds a high resolution timer and tp->compressed_ack counter.

> Instead of sending a SACK, we program this timer with a small delay,
> based on RTT and capped to 1 ms :

>  delay = min ( 5 % of RTT, 1 ms)

> If subsequent SACKs need to be sent while the timer has not yet
> expired, we simply increment tp->compressed_ack.

> When timer expires, a SACK is sent with the latest information.
> Whenever an ACK is sent (if data is sent, or if in-order
> data is received) timer is canceled.

> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
> if the sack blocks need to be shuffled, even if the timer has not
> expired.

> A new SNMP counter is added in the following patch.

> Two other patches add sysctls to allow changing the 1,000,000 and 44
> values that this commit hard-coded.

> Signed-off-by: Eric Dumazet 
> ---

Very nice. I like the constants and the min(rcv_rtt, srtt).

Acked-by: Neal Cardwell 

Thanks!

neal


Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-05-17 Thread Richard Guy Briggs
On 2018-05-17 17:00, Steve Grubb wrote:
> On Fri, 16 Mar 2018 05:00:28 -0400
> Richard Guy Briggs  wrote:
> 
> > Implement the proc fs write to set the audit container ID of a
> > process, emitting an AUDIT_CONTAINER record to document the event.
> > 
> > This is a write from the container orchestrator task to a proc entry
> > of the form /proc/PID/containerid where PID is the process ID of the
> > newly created task that is to become the first task in a container,
> > or an additional task added to a container.
> > 
> > The write expects up to a u64 value (unset: 18446744073709551615).
> > 
> > This will produce a record such as this:
> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0
> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0
> > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455
> > res=0
> 
> The was one thing I was wondering about. Currently when we set the
> loginuid, the record is AUDIT_LOGINUID. The corollary is that when we
> set the container id, the event should be AUDIT_CONTAINERID or
> AUDIT_CONTAINER_ID.

The record type is actually AUDIT_LOGIN.  The field type is
AUDIT_LOGINUID.  Given that correction, I think we're fine and could
potentially violently agree.  The existing naming is consistent.

> During syscall events, the path info is returned in a a record simply
> called AUDIT_PATH, cwd info is returned in AUDIT_CWD. So, rather than
> calling the record that gets attached to everything
> AUDIT_CONTAINER_INFO, how about simply AUDIT_CONTAINER.

Considering the container initiation record is different than the record
to document the container involved in an otherwise normal syscall, we
need two names.  I don't have a strong opinion what they are.

I'd prefer AUDIT_CONTAINER and AUDIT_CONTAINER_INFO so that the two are
different enough to be visually distinct while leaving
AUDIT_CONTAINERID for the field type in patch 4 ("audit: add containerid
filtering")

> > The "op" field indicates an initial set.  The "pid" to "ses" fields
> > are the orchestrator while the "opid" field is the object's PID, the
> > process being "contained".  Old and new container ID values are given
> > in the "contid" fields, while res indicates its success.
> > 
> > It is not permitted to self-set, unset or re-set the container ID.  A
> > child inherits its parent's container ID, but then can be set only
> > once after.
> > 
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> > 
> > Signed-off-by: Richard Guy Briggs 
> > ---
> >  fs/proc/base.c | 37 
> >  include/linux/audit.h  | 16 +
> >  include/linux/init_task.h  |  4 ++-
> >  include/linux/sched.h  |  1 +
> >  include/uapi/linux/audit.h |  2 ++
> >  kernel/auditsc.c   | 84
> > ++ 6 files changed, 143
> > insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 60316b5..6ce4fbe 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1299,6 +1299,41 @@ static ssize_t proc_sessionid_read(struct file
> > * file, char __user * buf, .read= proc_sessionid_read,
> > .llseek = generic_file_llseek,
> >  };
> > +
> > +static ssize_t proc_containerid_write(struct file *file, const char
> > __user *buf,
> > +  size_t count, loff_t *ppos)
> > +{
> > +   struct inode *inode = file_inode(file);
> > +   u64 containerid;
> > +   int rv;
> > +   struct task_struct *task = get_proc_task(inode);
> > +
> > +   if (!task)
> > +   return -ESRCH;
> > +   if (*ppos != 0) {
> > +   /* No partial writes. */
> > +   put_task_struct(task);
> > +   return -EINVAL;
> > +   }
> > +
> > +   rv = kstrtou64_from_user(buf, count, 10, );
> > +   if (rv < 0) {
> > +   put_task_struct(task);
> > +   return rv;
> > +   }
> > +
> > +   rv = audit_set_containerid(task, containerid);
> > +   put_task_struct(task);
> > +   if (rv < 0)
> > +   return rv;
> > +   return count;
> > +}
> > +
> > +static const struct file_operations proc_containerid_operations = {
> > +   .write  = proc_containerid_write,
> > +   .llseek = generic_file_llseek,
> > +};
> > +
> >  #endif
> >  
> >  #ifdef CONFIG_FAULT_INJECTION
> > @@ -2961,6 +2996,7 @@ static int proc_pid_patch_state(struct seq_file
> > *m, struct pid_namespace *ns, #ifdef CONFIG_AUDITSYSCALL
> > REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
> > REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +   REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> > REG("make-it-fail", S_IRUGO|S_IWUSR,
> > proc_fault_inject_operations), @@ -3355,6 +3391,7 @@ static int
> > proc_tid_comm_permission(struct inode *inode, int mask) #ifdef
> > CONFIG_AUDITSYSCALL REG("loginuid",  S_IWUSR|S_IRUGO,
> > 

Re: [PATCH net 0/7] net: ip6_gre: Fixes in headroom handling

2018-05-17 Thread Petr Machata
David Miller  writes:

> Luckily for you, your Fixes: tags went out before I pushed, so I could
> actually fix up the commit messages and add the tags.

I was hoping that would be the case.

Thanks,
Petr


[PATCH v3 net-next 2/6] tcp: do not force quickack when receiving out-of-order packets

2018-05-17 Thread Eric Dumazet
As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
acks in slow start"), TCP stacks have to consider how many packets
are acknowledged in one single ACK, because of GRO, but also
because of ACK compression or losses.

We plan to add SACK compression in the following patch, we
must therefore not call tcp_enter_quickack_mode()

Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_input.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
0bf032839548f8dccb7f24a6fb5a7d47ea29208b..f5622b250665178e44460fa2cd4a11af23dfb23d
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4715,8 +4715,6 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
goto out_of_window;
 
-   tcp_enter_quickack_mode(sk);
-
if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
/* Partial packet, seq < rcv_next < end_seq */
SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X\n",
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v3 net-next 1/6] tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()

2018-05-17 Thread Eric Dumazet
Socket can not disappear under us.

Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
6deb540297ccaa1f05ce633efe313d1ca2c15dd9..511bd0fde1dc1dd842598d083905b0425bcb05f8
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -559,7 +559,7 @@ void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
 {
if (hrtimer_try_to_cancel(_sk(sk)->pacing_timer) == 1)
-   sock_put(sk);
+   __sock_put(sk);
 
inet_csk_clear_xmit_timers(sk);
 }
-- 
2.17.0.441.gb46fe60e1d-goog



Re: [PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT

2018-05-17 Thread Sean Young
Hi,

Again thanks for a thoughtful review. This will definitely will improve
the code.

On Thu, May 17, 2018 at 10:02:52AM -0700, Y Song wrote:
> On Wed, May 16, 2018 at 2:04 PM, Sean Young  wrote:
> > Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call
> > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
> > that the last key should be repeated.
> >
> > The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall;
> > the target_fd must be the /dev/lircN device.
> >
> > Signed-off-by: Sean Young 
> > ---
> >  drivers/media/rc/Kconfig   |  13 ++
> >  drivers/media/rc/Makefile  |   1 +
> >  drivers/media/rc/bpf-rawir-event.c | 363 +
> >  drivers/media/rc/lirc_dev.c|  24 ++
> >  drivers/media/rc/rc-core-priv.h|  24 ++
> >  drivers/media/rc/rc-ir-raw.c   |  14 +-
> >  include/linux/bpf_rcdev.h  |  30 +++
> >  include/linux/bpf_types.h  |   3 +
> >  include/uapi/linux/bpf.h   |  55 -
> >  kernel/bpf/syscall.c   |   7 +
> >  10 files changed, 531 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/media/rc/bpf-rawir-event.c
> >  create mode 100644 include/linux/bpf_rcdev.h
> >
> > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
> > index eb2c3b6eca7f..2172d65b0213 100644
> > --- a/drivers/media/rc/Kconfig
> > +++ b/drivers/media/rc/Kconfig
> > @@ -25,6 +25,19 @@ config LIRC
> >passes raw IR to and from userspace, which is needed for
> >IR transmitting (aka "blasting") and for the lirc daemon.
> >
> > +config BPF_RAWIR_EVENT
> > +   bool "Support for eBPF programs attached to lirc devices"
> > +   depends on BPF_SYSCALL
> > +   depends on RC_CORE=y
> > +   depends on LIRC
> > +   help
> > +  Allow attaching eBPF programs to a lirc device using the bpf(2)
> > +  syscall command BPF_PROG_ATTACH. This is supported for raw IR
> > +  receivers.
> > +
> > +  These eBPF programs can be used to decode IR into scancodes, for
> > +  IR protocols not supported by the kernel decoders.
> > +
> >  menuconfig RC_DECODERS
> > bool "Remote controller decoders"
> > depends on RC_CORE
> > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
> > index 2e1c87066f6c..74907823bef8 100644
> > --- a/drivers/media/rc/Makefile
> > +++ b/drivers/media/rc/Makefile
> > @@ -5,6 +5,7 @@ obj-y += keymaps/
> >  obj-$(CONFIG_RC_CORE) += rc-core.o
> >  rc-core-y := rc-main.o rc-ir-raw.o
> >  rc-core-$(CONFIG_LIRC) += lirc_dev.o
> > +rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o
> >  obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
> >  obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
> >  obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
> > diff --git a/drivers/media/rc/bpf-rawir-event.c 
> > b/drivers/media/rc/bpf-rawir-event.c
> > new file mode 100644
> > index ..7cb48b8d87b5
> > --- /dev/null
> > +++ b/drivers/media/rc/bpf-rawir-event.c
> > @@ -0,0 +1,363 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// bpf-rawir-event.c - handles bpf
> > +//
> > +// Copyright (C) 2018 Sean Young 
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include "rc-core-priv.h"
> > +
> > +/*
> > + * BPF interface for raw IR
> > + */
> > +const struct bpf_prog_ops rawir_event_prog_ops = {
> > +};
> > +
> > +BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event)
> > +{
> > +   struct ir_raw_event_ctrl *ctrl;
> > +
> > +   ctrl = container_of(event, struct ir_raw_event_ctrl, 
> > bpf_rawir_event);
> > +
> > +   rc_repeat(ctrl->dev);
> > +
> > +   return 0;
> > +}
> > +
> > +static const struct bpf_func_proto rc_repeat_proto = {
> > +   .func  = bpf_rc_repeat,
> > +   .gpl_only  = true, /* rc_repeat is EXPORT_SYMBOL_GPL */
> > +   .ret_type  = RET_INTEGER,
> > +   .arg1_type = ARG_PTR_TO_CTX,
> > +};
> > +
> > +BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol,
> > +  u32, scancode, u32, toggle)
> > +{
> > +   struct ir_raw_event_ctrl *ctrl;
> > +
> > +   ctrl = container_of(event, struct ir_raw_event_ctrl, 
> > bpf_rawir_event);
> > +
> > +   rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
> > +
> > +   return 0;
> > +}
> > +
> > +static const struct bpf_func_proto rc_keydown_proto = {
> > +   .func  = bpf_rc_keydown,
> > +   .gpl_only  = true, /* rc_keydown is EXPORT_SYMBOL_GPL */
> > +   .ret_type  = RET_INTEGER,
> > +   .arg1_type = ARG_PTR_TO_CTX,
> > +   .arg2_type = ARG_ANYTHING,
> > +   .arg3_type = ARG_ANYTHING,
> > +   .arg4_type = ARG_ANYTHING,
> > +};
> > +
> > +static const struct bpf_func_proto *
> > +rawir_event_func_proto(enum bpf_func_id func_id, const struct bpf_prog 
> > *prog)
> > +{
> > +   switch (func_id) {
> > +   case BPF_FUNC_rc_repeat:
> > +   

[PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Eric Dumazet
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.

Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.

This patch adds a high resolution timer and tp->compressed_ack counter.

Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :

delay = min ( 5 % of RTT, 1 ms)

If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.

When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.

Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.

A new SNMP counter is added in the following patch.

Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.

Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h   |  2 ++
 include/net/tcp.h |  3 +++
 net/ipv4/tcp.c|  1 +
 net/ipv4/tcp_input.c  | 35 +--
 net/ipv4/tcp_output.c |  7 +++
 net/ipv4/tcp_timer.c  | 25 +
 6 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 
807776928cb8610fe97121fbc3c600b08d5d2991..72705eaf4b84060a45bf04d5170f389a18010eac
 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -218,6 +218,7 @@ struct tcp_sock {
   reord:1;  /* reordering detected */
} rack;
u16 advmss; /* Advertised MSS   */
+   u8  compressed_ack;
u32 chrono_start;   /* Start time in jiffies of a TCP chrono */
u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */
u8  chrono_type:2,  /* current chronograph type */
@@ -297,6 +298,7 @@ struct tcp_sock {
u32 sacked_out; /* SACK'd packets   */
 
struct hrtimer  pacing_timer;
+   struct hrtimer  compressed_ack_timer;
 
/* from STCP, retrans queue hinting */
struct sk_buff* lost_skb_hint;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
511bd0fde1dc1dd842598d083905b0425bcb05f8..952d842a604a3ed79e1bf87a712db20a461c35a9
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -561,6 +561,9 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)
if (hrtimer_try_to_cancel(_sk(sk)->pacing_timer) == 1)
__sock_put(sk);
 
+   if (hrtimer_try_to_cancel(_sk(sk)->compressed_ack_timer) == 1)
+   __sock_put(sk);
+
inet_csk_clear_xmit_timers(sk);
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
62b776f9003798eaf06992a4eb0914d17646aa61..0a2ea0bbf867271db05aedd7d48b193677664321
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2595,6 +2595,7 @@ int tcp_disconnect(struct sock *sk, int flags)
dst_release(sk->sk_rx_dst);
sk->sk_rx_dst = NULL;
tcp_saved_syn_free(tp);
+   tp->compressed_ack = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
f5622b250665178e44460fa2cd4a11af23dfb23d..cc2ac5346b92b968593f919192d543384865bcb8
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4249,6 +4249,8 @@ static void tcp_sack_new_ofo_skb(struct sock *sk, u32 
seq, u32 end_seq)
 * If the sack array is full, forget about the last one.
 */
if (this_sack >= TCP_NUM_SACKS) {
+   if (tp->compressed_ack)
+   tcp_send_ack(sk);
this_sack--;
tp->rx_opt.num_sacks--;
sp--;
@@ -5081,6 +5083,7 @@ static inline void tcp_data_snd_check(struct sock *sk)
 static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   unsigned long rtt, delay;
 
/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
@@ -5092,15 +5095,35 @@ static void __tcp_ack_snd_check(struct sock *sk, int 
ofo_possible)
(tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
 __tcp_select_window(sk) >= tp->rcv_wnd)) ||
/* We ACK each frame or... */
-   tcp_in_quickack_mode(sk) ||
-   /* We have out of order data. */
-   (ofo_possible && !RB_EMPTY_ROOT(>out_of_order_queue))) {
-   /* Then ack it now */
+   tcp_in_quickack_mode(sk)) {
+send_now:
tcp_send_ack(sk);
-   } else {
-   /* Else, send delayed 

[PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl

2018-05-17 Thread Eric Dumazet
This per netns sysctl allows for TCP SACK compression fine-tuning.

Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.

Signed-off-by: Eric Dumazet 
---
 Documentation/networking/ip-sysctl.txt | 7 +++
 include/net/netns/ipv4.h   | 1 +
 net/ipv4/sysctl_net_ipv4.c | 7 +++
 net/ipv4/tcp_input.c   | 4 ++--
 net/ipv4/tcp_ipv4.c| 1 +
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 
ea304a23c8d72c92a925d0c107bfd2bcfbbb92ec..7ba952959bca0eee4ecf81fb5837e17790db0fde
 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -525,6 +525,13 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 tcp_sack - BOOLEAN
Enable select acknowledgments (SACKS).
 
+tcp_comp_sack_delay_ns - LONG INTEGER
+   TCP tries to reduce number of SACK sent, using a timer
+   based on 5% of SRTT, capped by this sysctl, in nano seconds.
+   The default is 1ms, based on TSO autosizing period.
+
+   Default : 1,000,000 ns (1 ms)
+
 tcp_slow_start_after_idle - BOOLEAN
If set, provide RFC2861 behavior and time out the congestion
window after an idle period.  An idle period is defined at
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 
8491bc9c86b1553ab603e4363e8e38ca7ff547e0..927318243cfaa2ddd8eb423c6ba6e66253f771d3
 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -160,6 +160,7 @@ struct netns_ipv4 {
int sysctl_tcp_pacing_ca_ratio;
int sysctl_tcp_wmem[3];
int sysctl_tcp_rmem[3];
+   unsigned long sysctl_tcp_comp_sack_delay_ns;
struct inet_timewait_death_row tcp_death_row;
int sysctl_max_syn_backlog;
int sysctl_tcp_fastopen;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 
4b195bac8ac0eefe0a224528ad854338c4f8e6e3..11fbfdc1566eca95f91360522178295318277588
 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1151,6 +1151,13 @@ static struct ctl_table ipv4_net_table[] = {
.proc_handler   = proc_dointvec_minmax,
.extra1 = ,
},
+   {
+   .procname   = "tcp_comp_sack_delay_ns",
+   .data   = _net.ipv4.sysctl_tcp_comp_sack_delay_ns,
+   .maxlen = sizeof(unsigned long),
+   .mode   = 0644,
+   .proc_handler   = proc_doulongvec_minmax,
+   },
{
.procname   = "udp_rmem_min",
.data   = _net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
cc2ac5346b92b968593f919192d543384865bcb8..6a1dae38c9558c7bc9dd31e9f16c4e8ea8c78149
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5113,13 +5113,13 @@ static void __tcp_ack_snd_check(struct sock *sk, int 
ofo_possible)
if (hrtimer_is_queued(>compressed_ack_timer))
return;
 
-   /* compress ack timer : 5 % of rtt, but no more than 1 ms */
+   /* compress ack timer : 5 % of rtt, but no more than 
tcp_comp_sack_delay_ns */
 
rtt = tp->rcv_rtt_est.rtt_us;
if (tp->srtt_us && tp->srtt_us < rtt)
rtt = tp->srtt_us;
 
-   delay = min_t(unsigned long, NSEC_PER_MSEC,
+   delay = min_t(unsigned long, 
sock_net(sk)->ipv4.sysctl_tcp_comp_sack_delay_ns,
  rtt * (NSEC_PER_USEC >> 3)/20);
sock_hold(sk);
hrtimer_start(>compressed_ack_timer, ns_to_ktime(delay),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
caf23de88f8a369c2038cecd34ce42c522487e90..a3f4647341db2eb5a63c3e9f1e8b93099aedadab
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2572,6 +2572,7 @@ static int __net_init tcp_sk_init(struct net *net)
   init_net.ipv4.sysctl_tcp_wmem,
   sizeof(init_net.ipv4.sysctl_tcp_wmem));
}
+   net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
spin_lock_init(>ipv4.tcp_fastopen_ctx_lock);
net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl

2018-05-17 Thread Eric Dumazet
This per netns sysctl allows for TCP SACK compression fine-tuning.

This limits number of SACK that can be compressed.
Using 0 disables SACK compression.

Signed-off-by: Eric Dumazet 
---
 Documentation/networking/ip-sysctl.txt |  6 ++
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/sysctl_net_ipv4.c | 10 ++
 net/ipv4/tcp_input.c   |  3 ++-
 net/ipv4/tcp_ipv4.c|  1 +
 5 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 
7ba952959bca0eee4ecf81fb5837e17790db0fde..924bd51327b7a8dff3503d7afccdd54e1eb5c29b
 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -532,6 +532,12 @@ tcp_comp_sack_delay_ns - LONG INTEGER
 
Default : 1,000,000 ns (1 ms)
 
+tcp_comp_sack_nr - INTEGER
+   Max numer of SACK that can be compressed.
+   Using 0 disables SACK compression.
+
+   Detault : 44
+
 tcp_slow_start_after_idle - BOOLEAN
If set, provide RFC2861 behavior and time out the congestion
window after an idle period.  An idle period is defined at
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 
927318243cfaa2ddd8eb423c6ba6e66253f771d3..661348f23ea5a3a9320b2cafcd17e23960214771
 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -160,6 +160,7 @@ struct netns_ipv4 {
int sysctl_tcp_pacing_ca_ratio;
int sysctl_tcp_wmem[3];
int sysctl_tcp_rmem[3];
+   int sysctl_tcp_comp_sack_nr;
unsigned long sysctl_tcp_comp_sack_delay_ns;
struct inet_timewait_death_row tcp_death_row;
int sysctl_max_syn_backlog;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 
11fbfdc1566eca95f91360522178295318277588..d2eed3ddcb0a1ad9778d96d46c685f6c60b93d8d
 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -46,6 +46,7 @@ static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
+static int comp_sack_nr_max = 255;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -1158,6 +1159,15 @@ static struct ctl_table ipv4_net_table[] = {
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
},
+   {
+   .procname   = "tcp_comp_sack_nr",
+   .data   = _net.ipv4.sysctl_tcp_comp_sack_nr,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
+   .extra2 = _sack_nr_max,
+   },
{
.procname   = "udp_rmem_min",
.data   = _net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
6a1dae38c9558c7bc9dd31e9f16c4e8ea8c78149..aebb29ab2fdf2ceaa182cd11928f145a886149ff
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5106,7 +5106,8 @@ static void __tcp_ack_snd_check(struct sock *sk, int 
ofo_possible)
return;
}
 
-   if (!tcp_is_sack(tp) || tp->compressed_ack >= 44)
+   if (!tcp_is_sack(tp) ||
+   tp->compressed_ack >= sock_net(sk)->ipv4.sysctl_tcp_comp_sack_nr)
goto send_now;
tp->compressed_ack++;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
a3f4647341db2eb5a63c3e9f1e8b93099aedadab..adbdb503db0c983ef4185f83b138aa51bafd15bf
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2573,6 +2573,7 @@ static int __net_init tcp_sk_init(struct net *net)
   sizeof(init_net.ipv4.sysctl_tcp_wmem));
}
net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
+   net->ipv4.sysctl_tcp_comp_sack_nr = 44;
net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
spin_lock_init(>ipv4.tcp_fastopen_ctx_lock);
net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v3 net-next 4/6] tcp: add TCPAckCompressed SNMP counter

2018-05-17 Thread Eric Dumazet
This counter tracks number of ACK packets that the host has not sent,
thanks to ACK compression.

Sample output :

$ nstat -n;sleep 1;nstat|egrep 
"IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
IpInReceives123250 0.0
IpOutRequests   3684   0.0
TcpInSegs   123251 0.0
TcpOutSegs  3684   0.0
TcpExtTCPAckCompressed  119252 0.0

Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
---
 include/uapi/linux/snmp.h | 1 +
 net/ipv4/proc.c   | 1 +
 net/ipv4/tcp_output.c | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 
d02e859301ff499dd72a1c0e1b56bed10a9397a6..750d89120335eb489f698191edb6c5110969fa8c
 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -278,6 +278,7 @@ enum
LINUX_MIB_TCPMTUPSUCCESS,   /* TCPMTUPSuccess */
LINUX_MIB_TCPDELIVERED, /* TCPDelivered */
LINUX_MIB_TCPDELIVEREDCE,   /* TCPDeliveredCE */
+   LINUX_MIB_TCPACKCOMPRESSED, /* TCPAckCompressed */
__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 
261b71d0ccc5c17c6032bf67eb8f842006766e64..6c1ff89a60fa0a3485dcc71fafc799e798d5dc11
 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -298,6 +298,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS),
SNMP_MIB_ITEM("TCPDelivered", LINUX_MIB_TCPDELIVERED),
SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE),
+   SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
7ee98aad82b758674ca7f3e90bd3fc165e8fcd45..437bb7ceba7fd388abac1c12f2920b02be77bad9
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -165,6 +165,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts)
struct tcp_sock *tp = tcp_sk(sk);
 
if (unlikely(tp->compressed_ack)) {
+   NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPACKCOMPRESSED,
+ tp->compressed_ack);
tp->compressed_ack = 0;
if (hrtimer_try_to_cancel(>compressed_ack_timer) == 1)
__sock_put(sk);
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v3 net-next 0/6] tcp: implement SACK compression

2018-05-17 Thread Eric Dumazet
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.

Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.

This patch series adds SACK compression, but the infrastructure
could be leveraged to also compress ACK in the future.

v2: Addressed Neal feedback.
Added two sysctls to allow fine tuning, or even disabling the feature.

v3: take rtt = min(srtt, rcv_rtt) as Yuchung suggested, because rcv_rtt
can be over estimated for RPC (or sender limited)

Eric Dumazet (6):
  tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()
  tcp: do not force quickack when receiving out-of-order packets
  tcp: add SACK compression
  tcp: add TCPAckCompressed SNMP counter
  tcp: add tcp_comp_sack_delay_ns sysctl
  tcp: add tcp_comp_sack_nr sysctl

 Documentation/networking/ip-sysctl.txt | 13 +
 include/linux/tcp.h|  2 ++
 include/net/netns/ipv4.h   |  2 ++
 include/net/tcp.h  |  5 +++-
 include/uapi/linux/snmp.h  |  1 +
 net/ipv4/proc.c|  1 +
 net/ipv4/sysctl_net_ipv4.c | 17 
 net/ipv4/tcp.c |  1 +
 net/ipv4/tcp_input.c   | 38 --
 net/ipv4/tcp_ipv4.c|  2 ++
 net/ipv4/tcp_output.c  |  9 ++
 net/ipv4/tcp_timer.c   | 25 +
 12 files changed, 107 insertions(+), 9 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog



Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d


Would you please take a look at how table->icm is being used in the mlx4 
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.


And you'll understand some people care about this.

Strongly.

Thanks.





Re: [RFC PATCH ghak32 V2 03/13] audit: log container info of syscalls

2018-05-17 Thread Richard Guy Briggs
On 2018-05-17 17:09, Steve Grubb wrote:
> On Fri, 16 Mar 2018 05:00:30 -0400
> Richard Guy Briggs  wrote:
> 
> > Create a new audit record AUDIT_CONTAINER_INFO to document the
> > container ID of a process if it is present.
> 
> As mentioned in a previous email, I think AUDIT_CONTAINER is more
> suitable for the container record. One more comment below...
> 
> > Called from audit_log_exit(), syscalls are covered.
> > 
> > A sample raw event:
> > type=SYSCALL msg=audit(1519924845.499:257): arch=c03e syscall=257
> > success=yes exit=3 a0=ff9c a1=56374e1cef30 a2=241 a3=1b6 items=2
> > ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0
> > sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash"
> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> > key="tmpcontainerid" type=CWD msg=audit(1519924845.499:257):
> > cwd="/root" type=PATH msg=audit(1519924845.499:257): item=0
> > name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0
> > rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT
> > cap_fp= cap_fi= cap_fe=0 cap_fver=0
> > type=PATH msg=audit(1519924845.499:257): item=1
> > name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0
> > ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0
> > nametype=CREATE cap_fp= cap_fi=
> > cap_fe=0 cap_fver=0 type=PROCTITLE msg=audit(1519924845.499:257):
> > proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > type=CONTAINER_INFO msg=audit(1519924845.499:257): op=task
> > contid=123458
> > 
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> > Signed-off-by: Richard Guy Briggs 
> > ---
> >  include/linux/audit.h  |  5 +
> >  include/uapi/linux/audit.h |  1 +
> >  kernel/audit.c | 20 
> >  kernel/auditsc.c   |  2 ++
> >  4 files changed, 28 insertions(+)
> > 
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index fe4ba3f..3acbe9d 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -154,6 +154,8 @@ extern void
> > audit_log_link_denied(const char *operation, extern int
> > audit_log_task_context(struct audit_buffer *ab); extern void
> > audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk);
> > +extern int audit_log_container_info(struct task_struct *tsk,
> > +struct audit_context *context);
> >  
> >  extern int audit_update_lsm_rules(void);
> >  
> > @@ -205,6 +207,9 @@ static inline int audit_log_task_context(struct
> > audit_buffer *ab) static inline void audit_log_task_info(struct
> > audit_buffer *ab, struct task_struct *tsk)
> >  { }
> > +static inline int audit_log_container_info(struct task_struct *tsk,
> > +   struct audit_context
> > *context); +{ }
> >  #define audit_enabled 0
> >  #endif /* CONFIG_AUDIT */
> >  
> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index 921a71f..e83ccbd 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -115,6 +115,7 @@
> >  #define AUDIT_REPLACE  1329/* Replace auditd
> > if this packet unanswerd */ #define AUDIT_KERN_MODULE
> > 1330/* Kernel Module events */ #define
> > AUDIT_FANOTIFY  1331/* Fanotify access decision
> > */ +#define AUDIT_CONTAINER_INFO1332/* Container ID
> > information */ #define AUDIT_AVC1400/* SE
> > Linux avc denial or grant */ #define AUDIT_SELINUX_ERR
> > 1401/* Internal SE Linux Errors */ diff --git
> > a/kernel/audit.c b/kernel/audit.c index 3f2f143..a12f21f 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -2049,6 +2049,26 @@ void audit_log_session_info(struct
> > audit_buffer *ab) audit_log_format(ab, " auid=%u ses=%u", auid,
> > sessionid); }
> >  
> > +/*
> > + * audit_log_container_info - report container info
> > + * @tsk: task to be recorded
> > + * @context: task or local context for record
> > + */
> > +int audit_log_container_info(struct task_struct *tsk, struct
> > audit_context *context) +{
> > +   struct audit_buffer *ab;
> > +
> > +   if (!audit_containerid_set(tsk))
> > +   return 0;
> > +   /* Generate AUDIT_CONTAINER_INFO with container ID */
> > +   ab = audit_log_start(context, GFP_KERNEL,
> > AUDIT_CONTAINER_INFO);
> > +   if (!ab)
> > +   return -ENOMEM;
> > +   audit_log_format(ab, "contid=%llu",
> > audit_get_containerid(tsk));
> > +   audit_log_end(ab);
> > +   return 0;
> > +}
> > +
> >  void audit_log_key(struct audit_buffer *ab, char *key)
> >  {
> > audit_log_format(ab, " key=");
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index a6b0a52..65be110 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -1453,6 +1453,8 @@ static void audit_log_exit(struct 

Re: [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy

2018-05-17 Thread Jesper Dangaard Brouer

On Tue, 15 May 2018 21:06:15 +0200 Björn Töpel  wrote:

> From: Magnus Karlsson 
> 
> Here, the zero-copy ndo is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This means that and XDP program
> cannot redirect to an AF_XDP enabled XDP Tx ring.

This "shortcut" is not acceptable, and completely broken.  The
XDP_REDIRECT queue_index is based on smp_processor_id(), and can easily
clash with the configured XSK queue_index.  Provided a bit more code
context below...

On Tue, 15 May 2018 21:06:15 +0200
Björn Töpel  wrote:

int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
{
struct i40e_netdev_priv *np = netdev_priv(dev);
unsigned int queue_index = smp_processor_id();
struct i40e_vsi *vsi = np->vsi;
int err;

if (test_bit(__I40E_VSI_DOWN, vsi->state))
return -ENETDOWN;

> @@ -4025,6 +4158,9 @@ int i40e_xdp_xmit(struct net_device *dev, struct 
> xdp_frame *xdpf)
>   if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>   return -ENXIO;
>  
> + if (vsi->xdp_rings[queue_index]->xsk_umem)
> + return -ENXIO;
> +

Using the sane errno makes this impossible to debug (via the tracepoints).

>   err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
>   if (err != I40E_XDP_TX)
>   return -ENOSPC;
> @@ -4048,5 +4184,34 @@ void i40e_xdp_flush(struct net_device *dev)
>   if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>   return;
>  
> + if (vsi->xdp_rings[queue_index]->xsk_umem)
> + return;
> +
>   i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>  }



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH bpf] bpf: fix truncated jump targets on heavy expansions

2018-05-17 Thread Martin KaFai Lau
On Thu, May 17, 2018 at 01:44:11AM +0200, Daniel Borkmann wrote:
> Recently during testing, I ran into the following panic:
> 
>   [  207.892422] Internal error: Accessing user space memory outside 
> uaccess.h routines: 9604 [#1] SMP
>   [  207.901637] Modules linked in: binfmt_misc [...]
>   [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: GW
>  4.17.0-rc3+ #7
>   [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 
> 03/31/2017
>   [  207.982428] pstate: 6045 (nZCv daif +PAN -UAO)
>   [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
>   [  207.992603] lr : 0x00bdb754
>   [  207.996080] sp : 13703ca0
>   [  207.999384] x29: 13703ca0 x28: 0001
>   [  208.004688] x27: 0001 x26: 
>   [  208.009992] x25: 13703ce0 x24: 800fb4afcb00
>   [  208.015295] x23: 7d2f5038 x22: 7d2f5000
>   [  208.020599] x21: feff2a6f x20: 000a
>   [  208.025903] x19: 09578000 x18: 0a03
>   [  208.031206] x17:  x16: 
>   [  208.036510] x15: 9de83000 x14: 
>   [  208.041813] x13:  x12: 
>   [  208.047116] x11: 0001 x10: 089e7f18
>   [  208.052419] x9 : feff2a6f x8 : 
>   [  208.057723] x7 : 000a x6 : 00280c616000
>   [  208.063026] x5 : 0018 x4 : 7db6
>   [  208.068329] x3 : 0008647a x2 : 19868179b1484500
>   [  208.073632] x1 :  x0 : 09578c08
>   [  208.078938] Process test_verifier (pid: 2256, stack limit = 
> 0x49ca7974)
>   [  208.086235] Call trace:
>   [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
>   [  208.093713]  0x00bdb754
>   [  208.096845]  bpf_test_run+0x78/0xf8
>   [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
>   [  208.104758]  sys_bpf+0x314/0x1198
>   [  208.108064]  el0_svc_naked+0x30/0x34
>   [  208.111632] Code: 91302260 f941 f9001fa1 d281 (29500680)
>   [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
> 
> The program itself which caused this had a long jump over the whole
> instruction sequence where all of the inner instructions required
> heavy expansions into multiple BPF instructions. Additionally, I also
> had BPF hardening enabled which requires once more rewrites of all
> constant values in order to blind them. Each time we rewrite insns,
> bpf_adj_branches() would need to potentially adjust branch targets
> which cross the patchlet boundary to accommodate for the additional
> delta. Eventually that lead to the case where the target offset could
> not fit into insn->off's upper 0x7fff limit anymore where then offset
> wraps around becoming negative (in s16 universe), or vice versa
> depending on the jump direction.
> 
> Therefore it becomes necessary to detect and reject any such occasions
> in a generic way for native eBPF and cBPF to eBPF migrations. For
> the latter we can simply check bounds in the bpf_convert_filter()'s
> BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
> bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
> of subsequent hardening) is a bit more complex in that we need to
> detect such truncations before hitting the bpf_prog_realloc(). Thus
> the latter is split into an extra pass to probe problematic offsets
> on the original program in order to fail early. With that in place
> and carefully tested I no longer hit the panic and the rewrites are
> rejected properly. The above example panic I've seen on bpf-next,
> though the issue itself is generic in that a guard against this issue
> in bpf seems more appropriate in this case.
> 
> Signed-off-by: Daniel Borkmann 
Acked-by: Martin KaFai Lau 


Re: [PATCH net 0/7] net: ip6_gre: Fixes in headroom handling

2018-05-17 Thread David Miller
From: Petr Machata 
Date: Fri, 18 May 2018 00:03:58 +0300

> David Miller  writes:
> 
>> Series applied, thank you.
> 
> Hi David, I forgot to add Fixes lines to the individual patches. I
> replied to the e-mails with those. Let me know if you want me to send a
> v2 with that and the Acked-by's.

When something is already in my tree, it can't be changed as it is committed
to the permanent record of my GIT tree and I cannot rebase since so many
people clone my tree.

Luckily for you, your Fixes: tags went out before I pushed, so I could
actually fix up the commit messages and add the tags.

>> Those reproducable test cases in the various commit messages are
>> pretty awesome.  Could you please extract them and put them somewhere
>> appropriate under selftests?
> 
> The ip6gretap one is covered by the mirror_gre test if you run it
> with veth devices instead of HW ports, but I can make it self-contained
> if you think that would be better.
> 
> I'll add the erspan one.

Thank you.


[bpf-next PATCH v2 2/2] bpf: add sk_msg prog sk access tests to test_verifier

2018-05-17 Thread John Fastabend
Add tests for BPF_PROG_TYPE_SK_MSG to test_verifier for read access
to new sk fields.

Signed-off-by: John Fastabend 
Acked-by: Martin KaFai Lau 
---
 tools/include/uapi/linux/bpf.h  |8 ++
 tools/testing/selftests/bpf/test_verifier.c |  115 +++
 2 files changed, 123 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d94d333..97446bb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2176,6 +2176,14 @@ enum sk_action {
 struct sk_msg_md {
void *data;
void *data_end;
+
+   __u32 family;
+   __u32 remote_ip4;   /* Stored in network byte order */
+   __u32 local_ip4;/* Stored in network byte order */
+   __u32 remote_ip6[4];/* Stored in network byte order */
+   __u32 local_ip6[4]; /* Stored in network byte order */
+   __u32 remote_port;  /* Stored in network byte order */
+   __u32 local_port;   /* stored in host byte order */
 };
 
 #define BPF_TAG_SIZE   8
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index a877af0..6ec4d9d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1686,6 +1686,121 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
.prog_type = BPF_PROG_TYPE_SK_SKB,
},
{
+   "valid access family in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, family)),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "valid access remote_ip4 in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, remote_ip4)),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "valid access local_ip4 in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, local_ip4)),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "valid access remote_port in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, remote_port)),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "valid access local_port in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, local_port)),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_MSG,
+   },
+   {
+   "valid access remote_ip6 in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, remote_ip6[0])),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, remote_ip6[1])),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, remote_ip6[2])),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, remote_ip6[3])),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SK_SKB,
+   },
+   {
+   "valid access local_ip6 in SK_MSG",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, local_ip6[0])),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, local_ip6[1])),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, local_ip6[2])),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct sk_msg_md, local_ip6[3])),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+

[bpf-next PATCH v2 1/2] bpf: allow sk_msg programs to read sock fields

2018-05-17 Thread John Fastabend
Currently sk_msg programs only have access to the raw data. However,
it is often useful when building policies to have the policies specific
to the socket endpoint. This allows using the socket tuple as input
into filters, etc.

This patch adds ctx access to the sock fields.

Signed-off-by: John Fastabend 
Acked-by: Martin KaFai Lau 
---
 include/linux/filter.h   |1 
 include/uapi/linux/bpf.h |8 +++
 kernel/bpf/sockmap.c |1 
 net/core/filter.c|  114 +-
 4 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9dbcb9d..d358d18 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,6 +517,7 @@ struct sk_msg_buff {
bool sg_copy[MAX_SKB_FRAGS];
__u32 flags;
struct sock *sk_redir;
+   struct sock *sk;
struct sk_buff *skb;
struct list_head list;
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d94d333..97446bb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2176,6 +2176,14 @@ enum sk_action {
 struct sk_msg_md {
void *data;
void *data_end;
+
+   __u32 family;
+   __u32 remote_ip4;   /* Stored in network byte order */
+   __u32 local_ip4;/* Stored in network byte order */
+   __u32 remote_ip6[4];/* Stored in network byte order */
+   __u32 local_ip6[4]; /* Stored in network byte order */
+   __u32 remote_port;  /* Stored in network byte order */
+   __u32 local_port;   /* stored in host byte order */
 };
 
 #define BPF_TAG_SIZE   8
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index c6de139..0ebf157 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -523,6 +523,7 @@ static unsigned int smap_do_tx_msg(struct sock *sk,
}
 
bpf_compute_data_pointers_sg(md);
+   md->sk = sk;
rc = (*prog->bpf_func)(md, prog->insnsi);
psock->apply_bytes = md->apply_bytes;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 6d0d156..aec5eba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5148,18 +5148,23 @@ static bool sk_msg_is_valid_access(int off, int size,
switch (off) {
case offsetof(struct sk_msg_md, data):
info->reg_type = PTR_TO_PACKET;
+   if (size != sizeof(__u64))
+   return false;
break;
case offsetof(struct sk_msg_md, data_end):
info->reg_type = PTR_TO_PACKET_END;
+   if (size != sizeof(__u64))
+   return false;
break;
+   default:
+   if (size != sizeof(__u32))
+   return false;
}
 
if (off < 0 || off >= sizeof(struct sk_msg_md))
return false;
if (off % size != 0)
return false;
-   if (size != sizeof(__u64))
-   return false;
 
return true;
 }
@@ -5835,7 +5840,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, local_ip4):
-   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 
4);
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+ skc_rcv_saddr) != 4);
 
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
  struct bpf_sock_ops_kern, sk),
@@ -6152,6 +6158,7 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type 
type,
 struct bpf_prog *prog, u32 *target_size)
 {
struct bpf_insn *insn = insn_buf;
+   int off;
 
switch (si->off) {
case offsetof(struct sk_msg_md, data):
@@ -6164,6 +6171,107 @@ static u32 sk_msg_convert_ctx_access(enum 
bpf_access_type type,
  si->dst_reg, si->src_reg,
  offsetof(struct sk_msg_buff, data_end));
break;
+   case offsetof(struct sk_msg_md, family):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct sk_msg_buff, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_msg_buff, sk));
+   *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_family));
+   break;
+
+   case offsetof(struct sk_msg_md, remote_ip4):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct sk_msg_buff, sk),
+  

[bpf-next PATCH v2 0/2] SK_MSG programs: read sock fields

2018-05-17 Thread John Fastabend
In this series we add the ability for sk msg programs to read basic
sock information about the sock they are attached to. The second
patch adds the tests to the selftest test_verifier.

One obseration that I had from writing this seriess is lots of the
./net/core/filter.c code is almost duplicated across program types.
I thought about building a template/macro that we could use as a
single block of code to read sock data out for multiple programs,
but I wasn't convinced it was worth it yet. The result was using a
macro saved a couple lines of code per block but made the code
a bit harder to read IMO. We can probably revisit the idea later
if we get more duplication.

v2: add errstr field to negative test_verifier test cases to ensure
we get the expected err string back from the verifier.

---

John Fastabend (2):
  bpf: allow sk_msg programs to read sock fields
  bpf: add sk_msg prog sk access tests to test_verifier


 include/linux/filter.h  |1 
 include/uapi/linux/bpf.h|8 ++
 kernel/bpf/sockmap.c|1 
 net/core/filter.c   |  114 ++-
 tools/include/uapi/linux/bpf.h  |8 ++
 tools/testing/selftests/bpf/test_verifier.c |  115 +++
 6 files changed, 244 insertions(+), 3 deletions(-)

--
Signature


Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Eric Dumazet


On 05/17/2018 01:53 PM, Qing Huang wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
> 
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
> 
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
> 


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look at 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d

And you'll understand some people care about this.

Strongly.

Thanks.



Re: [PATCH net-next v3 0/3] net: Allow more drivers with COMPILE_TEST

2018-05-17 Thread David Miller
From: Florian Fainelli 
Date: Thu, 17 May 2018 13:07:42 -0700

> Hi David,
> 
> This patch series includes more drivers to be build tested with COMPILE_TEST
> enabled. This helps cover some of the issues I just ran into with missing
> a driver *sigh*.
> 
> Chanves in v3:
> 
> - drop the TI Keystone NETCP driver from the COMPILE_TEST additions
> 
> Changes in v2:
> 
> - allow FEC to build outside of CONFIG_ARM/ARM64 by defining a layout of
>   registers, this is not meant to run, so this is not a real issue if we
>   are not matching the correct register layout

Ok, series applied.

Just some printf format string warnings to clear up on 64-bit in TI
driver files davinci_cpdma.c, cpsw.c, and cpts.c.

In file included from ./arch/x86/include/asm/bug.h:83:0,
 from ./include/linux/bug.h:5,
 from ./include/linux/thread_info.h:12,
 from ./arch/x86/include/asm/preempt.h:7,
 from ./include/linux/preempt.h:81,
 from ./include/linux/spinlock.h:51,
 from drivers/net/ethernet/ti/davinci_cpdma.c:16:
drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘cpdma_desc_pool_destroy’:
drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format ‘%d’ expects 
argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned 
int}’ [-Wformat=]
   "cpdma_desc_pool size %d != avail %d",
   ^
   gen_pool_size(pool->gen_pool),
   ~
./include/asm-generic/bug.h:98:50: note: in definition of macro ‘__WARN_printf’
 #define __WARN_printf(arg...) do { __warn_printk(arg); __WARN(); } while (0)
  ^~~
drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro 
‘WARN’
  WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool),
  ^~~~
drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format ‘%d’ expects 
argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned 
int}’ [-Wformat=]
   "cpdma_desc_pool size %d != avail %d",
   ^
drivers/net/ethernet/ti/davinci_cpdma.c:196:7:
   gen_pool_avail(pool->gen_pool));
   ~~
./include/asm-generic/bug.h:98:50: note: in definition of macro ‘__WARN_printf’
 #define __WARN_printf(arg...) do { __warn_printk(arg); __WARN(); } while (0)
  ^~~
drivers/net/ethernet/ti/davinci_cpdma.c:193:2: note: in expansion of macro 
‘WARN’
  WARN(gen_pool_size(pool->gen_pool) != gen_pool_avail(pool->gen_pool),
  ^~~~
In file included from ./arch/x86/include/asm/realmode.h:15:0,
 from ./arch/x86/include/asm/acpi.h:33,
 from ./arch/x86/include/asm/fixmap.h:19,
 from ./arch/x86/include/asm/apic.h:10,
 from ./arch/x86/include/asm/smp.h:13,
 from ./arch/x86/include/asm/mmzone_64.h:11,
 from ./arch/x86/include/asm/mmzone.h:5,
 from ./include/linux/mmzone.h:911,
 from ./include/linux/gfp.h:6,
 from ./include/linux/idr.h:16,
 from ./include/linux/kernfs.h:14,
 from ./include/linux/sysfs.h:16,
 from ./include/linux/kobject.h:20,
 from ./include/linux/device.h:16,
 from drivers/net/ethernet/ti/davinci_cpdma.c:17:
drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘cpdma_chan_submit’:
drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 of 
‘__writel’ makes integer from pointer without a cast [-Wint-conversion]
  writel_relaxed(token, >sw_token);
 ^
./arch/x86/include/asm/io.h:88:39: note: in definition of macro ‘writel_relaxed’
 #define writel_relaxed(v, a) __writel(v, a)
   ^
./arch/x86/include/asm/io.h:71:18: note: expected ‘unsigned int’ but argument 
is of type ‘void *’
 build_mmio_write(__writel, "l", unsigned int, "r", )
  ^
./arch/x86/include/asm/io.h:53:20: note: in definition of macro 
‘build_mmio_write’
 static inline void name(type val, volatile void __iomem *addr) \
^~~~
drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘__cpdma_chan_free’:
drivers/net/ethernet/ti/davinci_cpdma.c:1126:15: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
  token  = (void *)desc_read(desc, sw_token);
   ^
In file included from ./include/linux/kernel.h:14:0,
 from ./include/linux/uio.h:12,
 from ./include/linux/socket.h:8,
 from ./include/uapi/linux/if.h:25,
 from drivers/net/ethernet/ti/cpts.c:21:
drivers/net/ethernet/ti/cpts.c: In function ‘cpts_overflow_check’:
drivers/net/ethernet/ti/cpts.c:297:11: warning: format ‘%lld’ expects argument 
of type ‘long long int’, but argument 3 has type ‘__kernel_time_t {aka long 
int}’ [-Wformat=]
  

Re: [PATCH 0/2] bpf: sockmap, fix uninitialized variable and double-free

2018-05-17 Thread Gustavo A. R. Silva

Hi Daniel,

On 05/17/2018 03:51 PM, Daniel Borkmann wrote:

On 05/17/2018 04:04 PM, Gustavo A. R. Silva wrote:

This patchset aims to fix an uninitialized variable issue and
a double-free issue in __sock_map_ctx_update_elem.

Both issues were reported by Coverity.

Thanks.

Gustavo A. R. Silva (2):
   bpf: sockmap, fix uninitialized variable
   bpf: sockmap, fix double-free

  kernel/bpf/sockmap.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)


Applied to bpf-next, thanks Gustavo!



Glad to help. :)


P.s.: Please indicate that next time in the email subject via '[PATCH 
bpf-next]'.



OK. Will do that.

Thanks
--
Gustavo


Re: [RFC PATCH ghak32 V2 03/13] audit: log container info of syscalls

2018-05-17 Thread Steve Grubb
On Fri, 16 Mar 2018 05:00:30 -0400
Richard Guy Briggs  wrote:

> Create a new audit record AUDIT_CONTAINER_INFO to document the
> container ID of a process if it is present.

As mentioned in a previous email, I think AUDIT_CONTAINER is more
suitable for the container record. One more comment below...

> Called from audit_log_exit(), syscalls are covered.
> 
> A sample raw event:
> type=SYSCALL msg=audit(1519924845.499:257): arch=c03e syscall=257
> success=yes exit=3 a0=ff9c a1=56374e1cef30 a2=241 a3=1b6 items=2
> ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0
> sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash"
> subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> key="tmpcontainerid" type=CWD msg=audit(1519924845.499:257):
> cwd="/root" type=PATH msg=audit(1519924845.499:257): item=0
> name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0
> rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT
> cap_fp= cap_fi= cap_fe=0 cap_fver=0
> type=PATH msg=audit(1519924845.499:257): item=1
> name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0
> ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0
> nametype=CREATE cap_fp= cap_fi=
> cap_fe=0 cap_fver=0 type=PROCTITLE msg=audit(1519924845.499:257):
> proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> type=CONTAINER_INFO msg=audit(1519924845.499:257): op=task
> contid=123458
> 
> See: https://github.com/linux-audit/audit-kernel/issues/32
> Signed-off-by: Richard Guy Briggs 
> ---
>  include/linux/audit.h  |  5 +
>  include/uapi/linux/audit.h |  1 +
>  kernel/audit.c | 20 
>  kernel/auditsc.c   |  2 ++
>  4 files changed, 28 insertions(+)
> 
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index fe4ba3f..3acbe9d 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -154,6 +154,8 @@ extern void
> audit_log_link_denied(const char *operation, extern int
> audit_log_task_context(struct audit_buffer *ab); extern void
> audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk);
> +extern int audit_log_container_info(struct task_struct *tsk,
> +  struct audit_context *context);
>  
>  extern int   audit_update_lsm_rules(void);
>  
> @@ -205,6 +207,9 @@ static inline int audit_log_task_context(struct
> audit_buffer *ab) static inline void audit_log_task_info(struct
> audit_buffer *ab, struct task_struct *tsk)
>  { }
> +static inline int audit_log_container_info(struct task_struct *tsk,
> + struct audit_context
> *context); +{ }
>  #define audit_enabled 0
>  #endif /* CONFIG_AUDIT */
>  
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 921a71f..e83ccbd 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -115,6 +115,7 @@
>  #define AUDIT_REPLACE1329/* Replace auditd
> if this packet unanswerd */ #define AUDIT_KERN_MODULE
> 1330  /* Kernel Module events */ #define
> AUDIT_FANOTIFY1331/* Fanotify access decision
> */ +#define AUDIT_CONTAINER_INFO  1332/* Container ID
> information */ #define AUDIT_AVC  1400/* SE
> Linux avc denial or grant */ #define AUDIT_SELINUX_ERR
> 1401  /* Internal SE Linux Errors */ diff --git
> a/kernel/audit.c b/kernel/audit.c index 3f2f143..a12f21f 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2049,6 +2049,26 @@ void audit_log_session_info(struct
> audit_buffer *ab) audit_log_format(ab, " auid=%u ses=%u", auid,
> sessionid); }
>  
> +/*
> + * audit_log_container_info - report container info
> + * @tsk: task to be recorded
> + * @context: task or local context for record
> + */
> +int audit_log_container_info(struct task_struct *tsk, struct
> audit_context *context) +{
> + struct audit_buffer *ab;
> +
> + if (!audit_containerid_set(tsk))
> + return 0;
> + /* Generate AUDIT_CONTAINER_INFO with container ID */
> + ab = audit_log_start(context, GFP_KERNEL,
> AUDIT_CONTAINER_INFO);
> + if (!ab)
> + return -ENOMEM;
> + audit_log_format(ab, "contid=%llu",
> audit_get_containerid(tsk));
> + audit_log_end(ab);
> + return 0;
> +}
> +
>  void audit_log_key(struct audit_buffer *ab, char *key)
>  {
>   audit_log_format(ab, " key=");
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index a6b0a52..65be110 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -1453,6 +1453,8 @@ static void audit_log_exit(struct audit_context
> *context, struct task_struct *ts 
>   audit_log_proctitle(tsk, context);
>  
> + audit_log_container_info(tsk, context);

Would there be any problem moving audit_log_container_info before
audit_log_proctitle? There are some 

Re: [PATCH net-next] vlan: Add extack messages for link create

2018-05-17 Thread David Miller
From: David Ahern 
Date: Thu, 17 May 2018 12:29:47 -0700

> Add informative messages for error paths related to adding a
> VLAN to a device.
> 
> Signed-off-by: David Ahern 

Applied, thanks David.


Re: [patch net-next RFC 04/12] dsa: set devlink port attrs for dsa ports

2018-05-17 Thread Andrew Lunn
On Thu, May 17, 2018 at 10:48:55PM +0200, Jiri Pirko wrote:
> Thu, May 17, 2018 at 09:14:32PM CEST, f.faine...@gmail.com wrote:
> >On 05/17/2018 10:39 AM, Jiri Pirko wrote:
>  That is compiled inside "fixed_phy", isn't it?
> >>>
> >>> It matches what CONFIG_FIXED_PHY is, so if it's built-in it also becomes
> >>> built-in, if is modular, it is also modular, this was fixed with
> >>> 40013ff20b1beed31184935fc0aea6a859d4d4ef ("net: dsa: Fix functional
> >>> dsa-loop dependency on FIXED_PHY")
> >> 
> >> Now I have it compiled as module, and after modprobe dsa_loop I see:
> >> [ 1168.129202] libphy: Fixed MDIO Bus: probed
> >> [ 1168.222716] dsa-loop fixed-0:1f: DSA mockup driver: 0x1f
> >> 
> >> This messages I did not see when I had fixed_phy compiled as buildin.
> >> 
> >> But I still see no netdevs :/
> >
> >The platform data assumes there is a network device named "eth0" as the
> 
> Oups, I missed, I created dummy device and modprobed again. Now I see:
> 
> $ sudo devlink port
> mdio_bus/fixed-0:1f/0: type eth netdev lan1
> mdio_bus/fixed-0:1f/1: type eth netdev lan2
> mdio_bus/fixed-0:1f/2: type eth netdev lan3
> mdio_bus/fixed-0:1f/3: type eth netdev lan4
> mdio_bus/fixed-0:1f/4: type notset
> mdio_bus/fixed-0:1f/5: type notset
> mdio_bus/fixed-0:1f/6: type notset
> mdio_bus/fixed-0:1f/7: type notset
> mdio_bus/fixed-0:1f/8: type notset
> mdio_bus/fixed-0:1f/9: type notset
> mdio_bus/fixed-0:1f/10: type notset
> mdio_bus/fixed-0:1f/11: type notset
> 
> I wonder why there are ports 4-11

Hi Jiri

ds = dsa_switch_alloc(>dev, DSA_MAX_PORTS);

It is allocating a switch with 12 ports. However only 4 of them have
names. So the core only creates slave devices for those 4.

This is a useful test. Real hardware often has unused ports. A WiFi AP
with a 7 port switch which only uses 6 ports is often seen.

 Andrew


Re: [PATCH net-next 1/1] qede: Add build_skb() support.

2018-05-17 Thread David Miller
From: Manish Chopra 
Date: Thu, 17 May 2018 12:05:00 -0700

> This patch makes use of build_skb() throughout in driver's receieve
> data path [HW gro flow and non HW gro flow]. With this, driver can
> build skb directly from the page segments which are already mapped
> to the hardware instead of allocating new SKB via netdev_alloc_skb()
> and memcpy the data which is quite costly.
> 
> This really improves performance (keeping same or slight gain in rx
> throughput) in terms of CPU utilization which is significantly reduced
> [almost half] in non HW gro flow where for every incoming MTU sized
> packet driver had to allocate skb, memcpy headers etc. Additionally
> in that flow, it also gets rid of bunch of additional overheads
> [eth_get_headlen() etc.] to split headers and data in the skb.
> 
> Tested with:
> system: 2 sockets, 4 cores per socket, hyperthreading, 2x4x2=16 cores
> iperf [server]: iperf -s
> iperf [client]: iperf -c  -t 500 -i 10 -P 32
> 
> HW GRO off – w/o build_skb(), throughput: 36.8 Gbits/sec
> 
> Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
> %guest   %idle
> Average: all0.590.00   32.930.000.00   43.070.00
> 0.00   23.42
> 
> HW GRO off - with build_skb(), throughput: 36.9 Gbits/sec
> 
> Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
> %guest   %idle
> Average: all0.700.00   31.700.000.00   25.680.00
> 0.00   41.92
 ^  
 ^
> 
> HW GRO on - w/o build_skb(), throughput: 36.9 Gbits/sec
> 
> Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
> %guest   %idle
> Average: all0.860.00   24.140.000.006.590.00
> 0.00   68.41
> 
> HW GRO on - with build_skb(), throughput: 37.5 Gbits/sec
> 
> Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
> %guest   %idle
> Average: all0.870.00   23.750.000.006.190.00
> 0.00   69.19
> 
> Signed-off-by: Ariel Elior 
> Signed-off-by: Manish Chopra 

Looks great, applied, thank you.


Re: [PATCH net v2] net: test tailroom before appending to linear skb

2018-05-17 Thread David Miller
From: Willem de Bruijn 
Date: Thu, 17 May 2018 13:13:29 -0400

> From: Willem de Bruijn 
> 
> Device features may change during transmission. In particular with
> corking, a device may toggle scatter-gather in between allocating
> and writing to an skb.
> 
> Do not unconditionally assume that !NETIF_F_SG at write time implies
> that the same held at alloc time and thus the skb has sufficient
> tailroom.
> 
> This issue predates git history.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Eric Dumazet 
> Signed-off-by: Willem de Bruijn 
> 
> ---
> 
> v2: fix ipv4 boundary condition

Applied and queued up for -stable, thanks Willem.


  1   2   3   4   >