from:"Sowmini Varadhan"

[PATCH V2 bpf-next 0/2] Perf-based event notification for sock_ops

2018-11-07 Thread Sowmini Varadhan

This patchset uses eBPF perf-event based notification mechanism to solve
the problem described in 
   https://marc.info/?l=linux-netdev=154022219423571=2.
Thanks to Daniel Borkmann for feedback/input.

V2: inlined the call to sys_perf_event_open() following the style
of existing code in kselftests/bpf

The problem statement is
  We would like to monitor some subset of TCP sockets in user-space,
  (the monitoring application would define 4-tuples it wants to monitor)
  using TCP_INFO stats to analyze reported problems. The idea is to
  use those stats to see where the bottlenecks are likely to be ("is it
  application-limited?" or "is there evidence of BufferBloat in the
  path?" etc)

  Today we can do this by periodically polling for tcp_info, but this
  could be made more efficient if the kernel would asynchronously
  notify the application via tcp_info when some "interesting"
  thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
  are reached. And to make this effective, it is better if
  we could apply the threshold check *before* constructing the
  tcp_info netlink notification, so that we don't waste resources
  constructing notifications that will be discarded by the filter.

This patchset solves the problem by adding perf-event based notification
support for sock_ops (Patch1). The eBPF kernel module can thus 
be designed to apply any desired filters to the bpf_sock_ops and
trigger a perf-event notification based on the verdict from the filter.
The uspace component can use these perf-event notifications to either
read any state managed by the eBPF kernel module, or issue a TCP_INFO 
netlink call if desired.

Patch 2 provides a simple example that shows how to use this infra
(and also provides a test case for it)

Sowmini Varadhan (2):
  bpf: add perf-event notificaton support for sock_ops
  selftests/bpf: add a test case for sock_ops perf-event notification

 net/core/filter.c |   19 ++
 tools/testing/selftests/bpf/Makefile  |4 +-
 tools/testing/selftests/bpf/test_tcpnotify.h  |   19 ++
 tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++
 tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +
 5 files changed, 322 insertions(+), 1 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c

[PATCH V2 bpf-next 2/2] selftests/bpf: add a test case for sock_ops perf-event notification

2018-11-07 Thread Sowmini Varadhan

This patch provides a tcp_bpf based eBPF sample. The test
- ncat(1) as the TCP client program to connect() to a port
  with the intention of triggerring SYN retransmissions: we
  first install an iptables DROP rule to make sure ncat SYNs are
  resent (instead of aborting instantly after a TCP RST)
- has a bpf kernel module that sends a perf-event notification for
  each TCP retransmit, and also tracks the number of such notifications
  sent in the global_map
The test passes when the number of event notifications intercepted
in user-space matches the value in the global_map.

Signed-off-by: Sowmini Varadhan 
---
V2: inline call to sys_perf_event_open() following the style of existing
code in kselftests/bpf

 tools/testing/selftests/bpf/Makefile  |4 +-
 tools/testing/selftests/bpf/test_tcpnotify.h  |   19 ++
 tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++
 tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +
 4 files changed, 303 insertions(+), 1 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index e39dfb4..6c94048 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,12 +24,13 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user 
\
test_socket_cookie test_cgroup_storage test_select_reuseport 
test_section_names \
-   test_netcnt
+   test_netcnt test_tcpnotify_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
+   test_tcpnotify_kern.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
@@ -74,6 +75,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_socket_cookie: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
 $(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c
+$(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
diff --git a/tools/testing/selftests/bpf/test_tcpnotify.h 
b/tools/testing/selftests/bpf/test_tcpnotify.h
new file mode 100644
index 000..8b6cea0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpnotify.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef _TEST_TCPBPF_H
+#define _TEST_TCPBPF_H
+
+struct tcpnotify_globals {
+   __u32 total_retrans;
+   __u32 ncalls;
+};
+
+struct tcp_notifier {
+   __u8type;
+   __u8subtype;
+   __u8source;
+   __u8hash;
+};
+
+#defineTESTPORT12877
+#endif
diff --git a/tools/testing/selftests/bpf/test_tcpnotify_kern.c 
b/tools/testing/selftests/bpf/test_tcpnotify_kern.c
new file mode 100644
index 000..edbca20
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpnotify_kern.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+#include "test_tcpnotify.h"
+
+struct bpf_map_def SEC("maps") global_map = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(struct tcpnotify_globals),
+   .max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") perf_event_map = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(__u32),
+   .max_entries = 2,
+};
+
+int _version SEC("version") = 1;
+
+SEC("sockops")
+int bpf_testcb(struct bpf_sock_ops *skops)
+{
+   int rv = -1;
+   int op;
+
+   op = (int) skops->op;
+
+   if (bpf_ntohl(skops->remote_port) != TESTPORT) {
+   skops->reply = -1;
+   return 0;
+   }
+
+   switch (op) {
+   case BPF_SOCK_OPS_TIMEOUT_INIT:
+   case BPF_SOCK_OPS_RWND_INIT:
+   case BPF_SOCK_OPS_NEEDS_ECN:
+   case BPF_SOCK_OPS_BASE_RTT:
+   case BPF_SOCK_OPS_RTO_CB:
+   rv = 1;
+   break;
+
+   case BPF

[PATCH V2 bpf-next 1/2] bpf: add perf-event notificaton support for sock_ops

2018-11-07 Thread Sowmini Varadhan

This patch allows eBPF programs that use sock_ops to send
perf-based event notifications using bpf_perf_event_output()

Signed-off-by: Sowmini Varadhan 
---
 net/core/filter.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e521c5e..23464a3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4048,6 +4048,23 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
return ret;
 }
 
+BPF_CALL_5(bpf_sock_opts_event_output, struct bpf_sock_ops *, skops,
+  struct bpf_map *, map, u64, flags, void *, data, u64, size)
+{
+   return bpf_event_output(map, flags, data, size, NULL, 0, NULL);
+}
+
+static const struct bpf_func_proto bpf_sock_ops_event_output_proto =  {
+   .func   = bpf_sock_opts_event_output,
+   .gpl_only   = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_CONST_MAP_PTR,
+   .arg3_type  = ARG_ANYTHING,
+   .arg4_type  = ARG_PTR_TO_MEM,
+   .arg5_type  = ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto bpf_setsockopt_proto = {
.func   = bpf_setsockopt,
.gpl_only   = false,
@@ -5226,6 +5243,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
switch (func_id) {
+   case BPF_FUNC_perf_event_output:
+   return _sock_ops_event_output_proto;
case BPF_FUNC_setsockopt:
return _setsockopt_proto;
case BPF_FUNC_getsockopt:
-- 
1.7.1

[PATCH bpf-next 0/2] TCP-BPF event notification support

2018-11-06 Thread Sowmini Varadhan

This patchset uses eBPF perf-event based notification mechanism to solve
the problem described in 
   https://marc.info/?l=linux-netdev=154022219423571=2.
Thanks to Daniel Borkmann for feedback/input.

The problem statement is
  We would like to monitor some subset of TCP sockets in user-space,
  (the monitoring application would define 4-tuples it wants to monitor)
  using TCP_INFO stats to analyze reported problems. The idea is to
  use those stats to see where the bottlenecks are likely to be ("is it
  application-limited?" or "is there evidence of BufferBloat in the
  path?" etc)

  Today we can do this by periodically polling for tcp_info, but this
  could be made more efficient if the kernel would asynchronously
  notify the application via tcp_info when some "interesting"
  thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
  are reached. And to make this effective, it is better if
  we could apply the threshold check *before* constructing the
  tcp_info netlink notification, so that we don't waste resources
  constructing notifications that will be discarded by the filter.

This patchset solves the problem by adding perf-event based notification
support for sock_ops (Patch1). The eBPF kernel module can thus 
be designed to apply any desired filters to the bpf_sock_ops and
trigger a perf-event notification based on the verdict from the filter.
The uspace component can use these perf-event notifications to either
read any state managed by the eBPF kernel module, or issue a TCP_INFO 
netlink call if desired.

Patch 2 provides a simple example that shows how to use this infra
(and also provides a test case for it)

Sowmini Varadhan (2):
  bpf: add perf-event notificaton support for sock_ops
  selftests/bpf: add a test case for sock_ops perf-event notification

 net/core/filter.c |   19 ++
 tools/testing/selftests/bpf/Makefile  |4 +-
 tools/testing/selftests/bpf/perf-sys.h|   74 
 tools/testing/selftests/bpf/test_tcpnotify.h  |   19 ++
 tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++
 tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +
 6 files changed, 396 insertions(+), 1 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/perf-sys.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c

[PATCH bpf-next 2/2] selftests/bpf: add a test case for sock_ops perf-event notification

2018-11-06 Thread Sowmini Varadhan

This patch provides a tcp_bpf based eBPF sample. The test
- ncat(1) as the TCP client program to connect() to a port
  with the intention of triggerring SYN retransmissions: we
  first install an iptables DROP rule to make sure ncat SYNs are
  resent (instead of aborting instantly after a TCP RST)
- has a bpf kernel module that sends a perf-event notification for
  each TCP retransmit, and also tracks the number of such notifications
  sent in the global_map
The test passes when the number of event notifications intercepted
in user-space matches the value in the global_map.

Signed-off-by: Sowmini Varadhan 
---
 tools/testing/selftests/bpf/Makefile  |4 +-
 tools/testing/selftests/bpf/perf-sys.h|   74 
 tools/testing/selftests/bpf/test_tcpnotify.h  |   19 ++
 tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++
 tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +
 5 files changed, 377 insertions(+), 1 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/perf-sys.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index e39dfb4..6c94048 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,12 +24,13 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user 
\
test_socket_cookie test_cgroup_storage test_select_reuseport 
test_section_names \
-   test_netcnt
+   test_netcnt test_tcpnotify_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
+   test_tcpnotify_kern.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
@@ -74,6 +75,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_socket_cookie: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
 $(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c
+$(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
diff --git a/tools/testing/selftests/bpf/perf-sys.h 
b/tools/testing/selftests/bpf/perf-sys.h
new file mode 100644
index 000..3eb7a39
--- /dev/null
+++ b/tools/testing/selftests/bpf/perf-sys.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PERF_SYS_H
+#define _PERF_SYS_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __powerpc__
+#define CPUINFO_PROC   {"cpu"}
+#endif
+
+#ifdef __s390__
+#define CPUINFO_PROC   {"vendor_id"}
+#endif
+
+#ifdef __sh__
+#define CPUINFO_PROC   {"cpu type"}
+#endif
+
+#ifdef __hppa__
+#define CPUINFO_PROC   {"cpu"}
+#endif
+
+#ifdef __sparc__
+#define CPUINFO_PROC   {"cpu"}
+#endif
+
+#ifdef __alpha__
+#define CPUINFO_PROC   {"cpu model"}
+#endif
+
+#ifdef __arm__
+#define CPUINFO_PROC   {"model name", "Processor"}
+#endif
+
+#ifdef __mips__
+#define CPUINFO_PROC   {"cpu model"}
+#endif
+
+#ifdef __arc__
+#define CPUINFO_PROC   {"Processor"}
+#endif
+
+#ifdef __xtensa__
+#define CPUINFO_PROC   {"core ID"}
+#endif
+
+#ifndef CPUINFO_PROC
+#define CPUINFO_PROC   { "model name", }
+#endif
+
+static inline int
+sys_perf_event_open(struct perf_event_attr *attr,
+ pid_t pid, int cpu, int group_fd,
+ unsigned long flags)
+{
+   int fd;
+
+   fd = syscall(__NR_perf_event_open, attr, pid, cpu,
+group_fd, flags);
+
+#ifdef HAVE_ATTR_TEST
+   if (unlikely(test_attr__enabled))
+   test_attr__open(attr, pid, cpu, fd, group_fd, flags);
+#endif
+   return fd;
+}
+
+#endif /* _PERF_SYS_H */
diff --git a/tools/testing/selftests/bpf/test_tcpnotify.h 
b/tools/testing/selftests/bpf/test_tcpnotify.h
new file mode 100644
index 000..8b6cea0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpnotify.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef _TEST_TCPBPF_H
+#define _TEST_TCPBPF_H
+
+struct tcpnotify_globals {
+   __u32 total_retrans;
+   __u32

[PATCH bpf-next 1/2] bpf: add perf-event notificaton support for sock_ops

2018-11-06 Thread Sowmini Varadhan

This patch allows eBPF programs that use sock_ops to send
perf-based event notifications using bpf_perf_event_output()

Signed-off-by: Sowmini Varadhan 
---
 net/core/filter.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e521c5e..23464a3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4048,6 +4048,23 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
return ret;
 }
 
+BPF_CALL_5(bpf_sock_opts_event_output, struct bpf_sock_ops *, skops,
+  struct bpf_map *, map, u64, flags, void *, data, u64, size)
+{
+   return bpf_event_output(map, flags, data, size, NULL, 0, NULL);
+}
+
+static const struct bpf_func_proto bpf_sock_ops_event_output_proto =  {
+   .func   = bpf_sock_opts_event_output,
+   .gpl_only   = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_CONST_MAP_PTR,
+   .arg3_type  = ARG_ANYTHING,
+   .arg4_type  = ARG_PTR_TO_MEM,
+   .arg5_type  = ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto bpf_setsockopt_proto = {
.func   = bpf_setsockopt,
.gpl_only   = false,
@@ -5226,6 +5243,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
switch (func_id) {
+   case BPF_FUNC_perf_event_output:
+   return _sock_ops_event_output_proto;
case BPF_FUNC_setsockopt:
return _setsockopt_proto;
case BPF_FUNC_getsockopt:
-- 
1.7.1

[PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback

2018-10-22 Thread Sowmini Varadhan

Simple Proof-Of-Concept test program for BPF_TCP_INFO_NOTIFY
(will move this to testing/selftests/net later)

Signed-off-by: Sowmini Varadhan 
---
 samples/bpf/Makefile  |1 +
 samples/bpf/tcp_notify_kern.c |   73 +
 2 files changed, 74 insertions(+), 0 deletions(-)
 create mode 100644 samples/bpf/tcp_notify_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961..d937bd2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -152,6 +152,7 @@ always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
 always += tcp_iw_kern.o
 always += tcp_clamp_kern.o
+always += tcp_notify_kern.o
 always += tcp_basertt_kern.o
 always += tcp_tos_reflect_kern.o
 always += xdp_redirect_kern.o
diff --git a/samples/bpf/tcp_notify_kern.c b/samples/bpf/tcp_notify_kern.c
new file mode 100644
index 000..bc4efd8
--- /dev/null
+++ b/samples/bpf/tcp_notify_kern.c
@@ -0,0 +1,73 @@
+/* Sample BPF program to demonstrate how to triger async tcp_info
+ * notification based on thresholds determeined in the filter.
+ * The example here will trigger notification if  skops->total_retrans > 16
+ *
+ * Use load_sock_ops to load this BPF program.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define DEBUG 0
+
+#define bpf_printk(fmt, ...)   \
+({ \
+  char fmt[] = fmt;\
+  bpf_trace_printk(fmt, sizeof(fmt),   \
+   ##__VA_ARGS__); \
+})
+
+SEC("sockops")
+int bpf_tcp_info_notify(struct bpf_sock_ops *skops)
+{
+   int bufsize = 15;
+   int to_init = 10;
+   int clamp = 100;
+   int rv = 0;
+   int op;
+
+   /* For testing purposes, only execute rest of BPF program
+* if neither port numberis 5001 (default iperf port)
+*/
+   if (bpf_ntohl(skops->remote_port) != 5001 &&
+   skops->local_port != 5001) {
+   skops->reply = -1;
+   return 0;
+   }
+
+   op = (int) skops->op;
+
+#ifdef DEBUG
+   bpf_printk("BPF command: %d\n", op);
+#endif
+
+   switch (op) {
+   case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+   case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+   bpf_sock_ops_cb_flags_set(skops,
+   (BPF_SOCK_OPS_RETRANS_CB_FLAG|
+   BPF_SOCK_OPS_RTO_CB_FLAG));
+   rv = 1;
+   break;
+   case BPF_SOCK_OPS_RETRANS_CB:
+   case BPF_SOCK_OPS_RTO_CB:
+   if (skops->total_retrans < 16)
+   rv = 1; /* skip */
+   else
+   rv = BPF_TCP_INFO_NOTIFY;
+   break;
+   default:
+   rv = -1;
+   }
+#ifdef DEBUG
+   bpf_printk("Returning %d\n", rv);
+#endif
+   skops->reply = rv;
+   return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
1.7.1

[PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code

2018-10-22 Thread Sowmini Varadhan

We want to use the inet_sock_diag_destroy code to send notifications
for more types of TCP events than just socket_close(), so refactor
the code to allow this.

Signed-off-by: Sowmini Varadhan 
---
 include/linux/sock_diag.h  |   18 +-
 include/uapi/linux/sock_diag.h |2 ++
 net/core/sock.c|4 ++--
 net/core/sock_diag.c   |   11 ++-
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index 15fe980..df85767 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -34,7 +34,7 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, 
struct sock *sk,
 struct sk_buff *skb, int attrtype);
 
 static inline
-enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
+enum sknetlink_groups sock_diag_group(const struct sock *sk)
 {
switch (sk->sk_family) {
case AF_INET:
@@ -43,7 +43,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct 
sock *sk)
 
switch (sk->sk_protocol) {
case IPPROTO_TCP:
-   return SKNLGRP_INET_TCP_DESTROY;
+   switch (sk->sk_state) {
+   case TCP_ESTABLISHED:
+   return SKNLGRP_INET_TCP_CONNECTED;
+   case TCP_SYN_SENT:
+   case TCP_SYN_RECV:
+   return SKNLGRP_INET_TCP_3WH;
+   default:
+   return SKNLGRP_INET_TCP_DESTROY;
+   }
case IPPROTO_UDP:
return SKNLGRP_INET_UDP_DESTROY;
default:
@@ -67,15 +75,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct 
sock *sk)
 }
 
 static inline
-bool sock_diag_has_destroy_listeners(const struct sock *sk)
+bool sock_diag_has_listeners(const struct sock *sk)
 {
const struct net *n = sock_net(sk);
-   const enum sknetlink_groups group = sock_diag_destroy_group(sk);
+   const enum sknetlink_groups group = sock_diag_group(sk);
 
return group != SKNLGRP_NONE && n->diag_nlsk &&
netlink_has_listeners(n->diag_nlsk, group);
 }
-void sock_diag_broadcast_destroy(struct sock *sk);
+void sock_diag_broadcast(struct sock *sk);
 
 int sock_diag_destroy(struct sock *sk, int err);
 #endif
diff --git a/include/uapi/linux/sock_diag.h b/include/uapi/linux/sock_diag.h
index e592500..4252674 100644
--- a/include/uapi/linux/sock_diag.h
+++ b/include/uapi/linux/sock_diag.h
@@ -32,6 +32,8 @@ enum sknetlink_groups {
SKNLGRP_INET_UDP_DESTROY,
SKNLGRP_INET6_TCP_DESTROY,
SKNLGRP_INET6_UDP_DESTROY,
+   SKNLGRP_INET_TCP_3WH,
+   SKNLGRP_INET_TCP_CONNECTED,
__SKNLGRP_MAX,
 };
 #define SKNLGRP_MAX(__SKNLGRP_MAX - 1)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7e8796a..6684840 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1600,8 +1600,8 @@ static void __sk_free(struct sock *sk)
if (likely(sk->sk_net_refcnt))
sock_inuse_add(sock_net(sk), -1);
 
-   if (unlikely(sk->sk_net_refcnt && sock_diag_has_destroy_listeners(sk)))
-   sock_diag_broadcast_destroy(sk);
+   if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk)))
+   sock_diag_broadcast(sk);
else
sk_destruct(sk);
 }
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 3312a58..dbd9e65 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -116,14 +116,14 @@ static size_t sock_diag_nlmsg_size(void)
   + nla_total_size_64bit(sizeof(struct tcp_info))); /* 
INET_DIAG_INFO */
 }
 
-static void sock_diag_broadcast_destroy_work(struct work_struct *work)
+static void sock_diag_broadcast_work(struct work_struct *work)
 {
struct broadcast_sk *bsk =
container_of(work, struct broadcast_sk, work);
struct sock *sk = bsk->sk;
const struct sock_diag_handler *hndl;
struct sk_buff *skb;
-   const enum sknetlink_groups group = sock_diag_destroy_group(sk);
+   const enum sknetlink_groups group = sock_diag_group(sk);
int err = -1;
 
WARN_ON(group == SKNLGRP_NONE);
@@ -144,11 +144,12 @@ static void sock_diag_broadcast_destroy_work(struct 
work_struct *work)
else
kfree_skb(skb);
 out:
-   sk_destruct(sk);
+   if (group <= SKNLGRP_INET6_UDP_DESTROY)
+   sk_destruct(sk);
kfree(bsk);
 }
 
-void sock_diag_broadcast_destroy(struct sock *sk)
+void sock_diag_broadcast(struct sock *sk)
 {
/* Note, this function is often called from an interrupt context. */
struct broadcast_sk *bsk =
@@ -156,7 +157,7 @@ void sock_diag_broadcast_destroy(struct sock *sk)
if (!bsk)
return sk_destruct(sk);
bsk->s

[PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters

2018-10-22 Thread Sowmini Varadhan

Problem statement:
  We would like to monitor some subset of TCP sockets in user-space,
  (the monitoring application would define 4-tuples it wants to monitor)
  using TCP_INFO stats to analyze reported problems. The idea is to
  use those stats to see where the bottlenecks are likely to be ("is it
  application-limited?" or "is there evidence of BufferBloat in the
  path?" etc)

  Today we can do this by periodically polling for tcp_info, but this
  could be made more efficient if the kernel would asynchronously
  notify the application via tcp_info when some "interesting"
  thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
  are reached. And to make this effective, it is better if
  we could apply the threshold check *before* constructing the
  tcp_info netlink notification, so that we don't waste resources
  constructing notifications that will be discarded by the filter.

One idea, implemented in this patchset, is to extend the tcp_call_bpf()
infra so that the BPF kernel module (the sock_ops filter/callback)
can examine the values in the sock_ops, apply any thresholds it wants,
and return some new status ("BPF_TCP_INFO_NOTIFY"). Use this status in
the tcp stack to queue up a tcp_info notification (similar to
sock_diag_broadcast_destroy() today..)

Patch 1 in this set refactors the existing sock_diag code so that
the functions can be reused for notifications from other states than CLOSE.

Patch 2 provides a minor extension to tcp_call_bpf() so that it
will queue a tcp_info_notification if the BPF callout returns 
BPF_TCP_INFO_NOTIFY

Patch 3, provided strictly as a demonstration/PoC to aid in reviewing
this proposal, shows a simple sample/bpf example where we trigger the
tcp_info notification for an iperf connection if the number of 
retransmits exceeds 16.


Sowmini Varadhan (3):
  sock_diag: Refactor inet_sock_diag_destroy code
  tcp: BPF_TCP_INFO_NOTIFY support
  bpf: Added a sample for tcp_info_notify callback

 include/linux/sock_diag.h  |   18 +++---
 include/net/tcp.h  |   15 +++-
 include/uapi/linux/bpf.h   |4 ++
 include/uapi/linux/sock_diag.h |2 +
 net/core/sock.c|4 +-
 net/core/sock_diag.c   |   11 +++---
 samples/bpf/Makefile   |1 +
 samples/bpf/tcp_notify_kern.c  |   73 
 8 files changed, 114 insertions(+), 14 deletions(-)
 create mode 100644 samples/bpf/tcp_notify_kern.c

[PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support

2018-10-22 Thread Sowmini Varadhan

We want to be able to set up the monitoring application so that it can
be aysnchronously notified when "interesting" events happen, e.g., when
application-determined thresholds on parameters like RTT estimate, number
of retransmissions, RTO are reached.

The bpf_sock_ops infrastructure provided as part of Commit 40304b2a1567
("bpf: BPF support for sock_ops") provides an elegant way to trigger
this asynchronous notification. The BPF program can examine the
current TCP state reported in the bpf_sock_ops and conditionally
return a (new) status BPF_TCP_INFO_NOTIFY. The return status is used
by the caller to queue up a tcp_info notification for the application.

Signed-off-by: Sowmini Varadhan 
---
 include/net/tcp.h|   15 +--
 include/uapi/linux/bpf.h |4 
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0d29292..df06a9f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -2065,6 +2066,12 @@ struct tcp_ulp_ops {
__MODULE_INFO(alias, alias_userspace, name);\
__MODULE_INFO(alias, alias_tcp_ulp, "tcp-ulp-" name)
 
+#defineTCPDIAG_CB(sk)  
\
+do {   \
+   if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk))) \
+   sock_diag_broadcast(sk);\
+} while (0)
+
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
@@ -2088,9 +2095,13 @@ static inline int tcp_call_bpf(struct sock *sk, int op, 
u32 nargs, u32 *args)
memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
-   if (ret == 0)
+   if (ret == 0) {
ret = sock_ops.reply;
-   else
+
+   /* XXX would be nice if we could use replylong[1] here */
+   if (ret == BPF_TCP_INFO_NOTIFY)
+   TCPDIAG_CB(sk);
+   } else
ret = -1;
return ret;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa5ccd2..bc45e5e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2678,6 +2678,10 @@ enum {
BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
+enum {
+   BPF_TCP_INFO_NOTIFY = 2
+};
+
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP  1002/* Set sndcwnd_clamp */
 
-- 
1.7.1

Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan

Without getting into Ahern's patchset, which he obviously feels 
quite passionately about..

On (10/11/18 12:28), David Miller wrote:
> 
> Once you've composed the message, the whole point of filtering is lost.

it would be nice to apply the filter *before* constructing the skb, 
but afaict most things in BPF today only operate on sk_buffs. How should
we use *BPF on something other than an sk_buff?

--Sowmini

Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan

On (10/11/18 09:33), Roopa Prabhu wrote:
> 3. All networking subsystems already have this type of netlink
> attribute filtering that apps rely on. This series
> just makes it consistent for route dumps. Apps use such mechanism
> already when requesting dumps.
> Like everywhere else, BPF hook can be an alternate parallel mechanism.

sure and that make sense. though I hope we will explore those
alternate mechanisms too.

--Sowmini

Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan

On (10/11/18 09:32), David Ahern wrote:
> 
> Route dumps are done for the entire FIB for each address family. As we
> approach internet routing tables (700k+ routes for IPv4, currently
> around 55k for IPv6) with many VRFs dumping the entire table is grossly
> inefficient when for example only a single VRF table is wanted.

I think someone mentioned a long time ago that a VRF is not an 
interface/driver/net_device but rather a separate routing table with a
dedicated set of interfaces, iirc :-) :-)

In the latter model, if you wanted to dump a VRF table, you'd only
lock that table, and walk it, instead of holding up other VRFS 

sorry, could not resist my i-told-you-so moment :-P

--Sowmini

Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps

2018-10-11 Thread Sowmini Varadhan

On (10/11/18 08:26), Stephen Hemminger wrote:
> You can do the something like this already with BPF socket filters.
> But writing BPF for multi-part messages is hard.

Indeed. And I was just experimenting with this for ARP just last week.
So to handle the caes of "ip neigh show a.b.c.d" without walking through
the entire arp table and filtering in userspace, you could add a sk_filter()
hook like this:

@@ -2258,6 +2260,24 @@ static int neigh_fill_info(struct sk_buff *skb, struct ne
goto nla_put_failure;
 
nlmsg_end(skb, nlh);
+
+   /* XXX skb->sk can be null in the neigh_timer_handler->__neigh_notify 
+* path. Revisit..
+*/
+   if (!skb->sk)
+   return 0;
+
+   /* pull/push skb->data pointers so that sk_filter only sees the
+* most recent nlh that wasjust added.
+*/
+   len = skb->len - nlh->nlmsg_len;
+   skb_pull(skb, len);
+   ret = sk_filter(skb->sk, skb);
+   skb_push(skb, len);
+   if (ret)
+   nlmsg_cancel(skb, nlh);
return 0;
 
Writing the cBPF filter is not horrible, due to the nla extension. e.g.,
to pass a filter that matches on if_index and ipv4 address, the bpf_asm
src below will do the job. The benefit of using cBPF is that we can 
use this older kernels as well

/*
 * Generated from the bpf_asm src
 * ldb [20] ; len(nlmsghdr) + offsetof(ndm_ifindex)
 * jne sll->sll_ifindex, skip
 * ld #28   ; A <- len(nlmsghdr) + len(ndmsg), payload offset
 * ldx #1   ; X <-  NDA_DST
 * ld #nla  ; A <- offset(NDA_DST)
 * jeq #0, skip
 * tax
 * ld [x + 4]   ; A <- value(NDA_DST)
 * jneq htonl(addr), skip
 * ret #-1
 * skip: ret #0
 */
struct sock_filter bpf_filter[] = {
{ 0x30,  0,  0, 0x0014 },
{ 0x15,  0,  1, sll->sll_ifindex },
{ ,  0,  0, 0x001c },
{ 0x01,  0,  0, 0x0001 },
{ 0x20,  0,  0, 0xf00c },
{ 0x15,  4,  0, 00 },
{ 0x07,  0,  0, 00 },
{ 0x40,  0,  0, 0x0004 },
{ 0x15,  0,  1, htonl(addr) },
{ 0x06,  0,  0, 0x },
{ 0x06,  0,  0, 00 },
{ 0x06,  0,  0, 0x },
{ 0x06,  0,  0, 00 },
};

Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER

2018-09-17 Thread Sowmini Varadhan

On (09/17/18 16:15), Alexei Starovoitov wrote:
> 
> if the goal is to add firewall ability to RDS then the patch set
> is going in the wrong direction.

The goal is to add the ability to process scatterlist directly,
just like we process skb's today.

Your main objection was that you wanted a test case in selftests
that was aligned with existing tests, Tushar is working on that
patchset. Why dont we wait for that patchset before continuing
this discussion further? 

> May be the right answer is to teach rds to behave like the rest of protocols.
> Then all existing tooling and features will 'just work' ?

RDS does not need to be taught anything :-) I think KCM is modeled
on the RDS stack model. Before we "teach" rds anything, "we" need
to understand what RDS does first - google can provide lot of slide-decks
that explain the rds stack to you, suggest you look at that first. 

Meanwhile, how about waiting for Tushar's next patchset, where
you will have your selftests that are based on veth/netns
just like exising tests for XDP. vxlan etc. I strongly suggest
waiting for that.

And btw, it would have been very useful/courteous to help with 
the RFC reviews to start with.

--Sowmini

Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER

2018-09-13 Thread Sowmini Varadhan

On (09/12/18 19:07), Alexei Starovoitov wrote:
> 
> I didn't know that. The way I understand your statement that
> this new program type, new sg logic, and all the complexity
> are only applicable to RDMA capable hw and RDS.

I dont know if you have been following the RFC series at all 
(and DanielB/JohnF feedback to it) but that is not what the patch 
set is about.

To repeat a summary of the original problem statement:

RDS (hardly a "niche" driver, let's please not get carried away 
with strong assertions based on incomplete understanding), 
is an example of a driver that happens to pass up packets
as both scatterlist and sk_buffs to the ULPs. 

The scatterlist comes from IB, the sk_buffs come from the ethernet
drivers. At the moment, the only way to build firewalls for
this is to convert scatterlist to skb and use  either netfilter
or eBPF on the skb. What Tushar is adding is support to use eBPF
on the scatterlist itself, so that you dont have to do this
inefficient scatterlist->skb conversion.

> In such case, I think, such BPF extensions do not belong
> in the upstream kernel. I don't want BPF to support niche technologies,

> since the maintenance overhead makes it prohibitive long term.

After I sent the message, I noticed that selftests/bpf has
some tests using veth/netns. I think Tushar should be able to write
tests for the rds-tcp path (and thus test the eBPF infrastructure, 
which seems to be what you are worried about)

Does that address your concern?

--Sowmini

Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER

2018-09-12 Thread Sowmini Varadhan

> On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> >please no samples.
> >Add this as proper test to tools/testing/selftests/bpf
> >that reports PASS/FAIL and can be run automatically.
> >samples/bpf is effectively dead code.

Just a second.

You do realize that RDS is doing real networking, so it needs
RDMA capable hardware to test the rds_rdma paths? Also, when we
"talk to ourselves" we default to the rds_loop transport, so
we would even bypass the rds-tcp module.

I dont think this can be tested with some academic "test it 
over lo0" exercise.. I suppose you can add example code in
sefltests for this, but asking for a "proper test" may be
a litte unrealistic here- a proper test needs proper hardware
in this case.

--Sowmini

Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE

2018-09-10 Thread Sowmini Varadhan

On (09/10/18 17:16), Cong Wang wrote:
> >
> > On (09/10/18 16:51), Cong Wang wrote:
> > >
> > > __rds_create_bind_key(key, addr, port, scope_id);
> > > -   rs = rhashtable_lookup_fast(_hash_table, key, ht_parms);
> > > +   rcu_read_lock();
> > > +   rs = rhashtable_lookup(_hash_table, key, ht_parms);
> > > if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
> > > rds_sock_addref(rs);
> > > else
> > > rs = NULL;
> > > +   rcu_read_unlock();
> >
> > aiui, the rcu_read lock/unlock is only useful if the write
> > side doing destructive operations  does something to make sure readers
> > are done before doing the destructive opertion. AFAIK, that does
> > not exist for rds socket management today
> 
> That is exactly why we need it here, right?

Maybe I am confused, what exactly is the patch you are proposing?

Does it have the SOCK_RCU_FREE change? 
Does it have the rcu_read_lock you have above? 
Where is the call_rcu?

> Hmm, so you are saying synchronize_rcu() is kinda more correct
> than call_rcu()??  

I'm not saying that, I'm asking "what exactly is the patch
you are proposing?" The only one on record is 
   http://patchwork.ozlabs.org/patch/968282/
which does not have either synchronize_rcu or call_rcu.

> I never hear this before, would like to know why.

Please post precise patches first.

Thanks.

Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE

2018-09-10 Thread Sowmini Varadhan

On (09/10/18 16:51), Cong Wang wrote:
> 
> __rds_create_bind_key(key, addr, port, scope_id);
> -   rs = rhashtable_lookup_fast(_hash_table, key, ht_parms);
> +   rcu_read_lock();
> +   rs = rhashtable_lookup(_hash_table, key, ht_parms);
> if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
> rds_sock_addref(rs);
> else
> rs = NULL;
> +   rcu_read_unlock();

aiui, the rcu_read lock/unlock is only useful if the write
side doing destructive operations  does something to make sure readers
are done before doing the destructive opertion. AFAIK, that does
not exist for rds socket management today

> Although sock release path is not a very hot path, but blocking
> it isn't a good idea either, especially when you can use call_rcu(),
> which has the same effect.
> 
> I don't see any reason we should prefer synchronize_rcu() here.

Usually correctness (making sure all readers are done, before nuking a
data structure) is a little bit more important than perforamance, aka
"safety before speed" is what I've always been taught.  

Clearly, your mileage varies. As you please.

--Sowmini

Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE

2018-09-10 Thread Sowmini Varadhan

On (09/10/18 15:43), Santosh Shilimkar wrote:
> On 9/10/2018 3:24 PM, Cong Wang wrote:
> >When a rds sock is bound, it is inserted into the bind_hash_table
> >which is protected by RCU. But when releasing rd sock, after it
> >is removed from this hash table, it is freed immediately without
> >respecting RCU grace period. This could cause some use-after-free
> >as reported by syzbot.
> >
> Indeed.
> 
> >Mark the rds sock as SOCK_RCU_FREE before inserting it into the
> >bind_hash_table, so that it would be always freed after a RCU grace
> >period.

So I'm not sure I understand. 

Yes, Cong's fix may eliminate *some* of the syzbot failures, but the
basic problem is not solved.

To take one example of possible races (one that was discussed in
  https://www.spinics.net/lists/netdev/msg475074.html)
rds_recv_incoming->rds_find_bound is being called in rds_send_worker
context and the rds_find_bound code is

 63 rs = rhashtable_lookup_fast(_hash_table, , ht_parms);
 64 if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
 65 rds_sock_addref(rs);
 66 else 
 67 rs = NULL;
 68 

After we find an rs at line 63, how can we be sure that the entire
logic of rds_release does not execute on another cpu, and free the rs,
before we hit line 64 with the bad rs?

Normally synchronize_rcu() or synchronize_net() in rds_release() would 
ensure this. How do we ensure this with SOCK_RCU_FREE (or is the
intention to just reduce *some* of the syzbot failures)?

--Sowmini

Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE

2018-09-10 Thread Sowmini Varadhan

On (09/10/18 15:24), Cong Wang wrote:
> 
> When a rds sock is bound, it is inserted into the bind_hash_table
> which is protected by RCU. But when releasing rd sock, after it
> is removed from this hash table, it is freed immediately without
> respecting RCU grace period. This could cause some use-after-free
> as reported by syzbot.
> 

I have no objection to the change itself, but the syzbot failures
are caused for a very simple reason: we need synchronize_net()
in rds_release before we remove the rds_sock from the bind_hash_table.

I already pointed this out in 
  https://www.spinics.net/lists/netdev/msg475074.html

I think the objection to synchronize_net() is that it can cause
perf issues (I'm told that rds_release() has been known to be held
up by other threads in rcu critical sections?) but I personally
dont see any other alternative to this (other than going back
to rwlock, instead of rcu)

--Sowmini

Re: [PATCH RFC net-next 00/11] udp gso

2018-09-03 Thread Sowmini Varadhan

On (09/03/18 10:02), Steffen Klassert wrote:
> I'm working on patches that builds such skb lists. The list is chained
> at the frag_list pointer of the first skb, all subsequent skbs are linked
> to the next pointer of the skb. It looks like this:

there are some risks to using the frag_list pointer, Alex Duyck
had pointed this out to me in 
  https://www.mail-archive.com/netdev@vger.kernel.org/msg131081.html
(see last paragraph, or search for the string "gotcha" there)

I dont know the details of your playground patch, but might want to
watch out for those to make sure it is immune to these issues..

--Sowmini

[PATCH V2 ipsec-next 2/2] xfrm: reset crypto_done when iterating over multiple input xfrms

2018-09-03 Thread Sowmini Varadhan

We only support one offloaded xfrm (we do not have devices that
can handle more than one offload), so reset crypto_done in
xfrm_input() when iterating over multiple transforms in xfrm_input,
so that we can invoke the appropriate x->type->input for the
non-offloaded transforms

Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")

Signed-off-by: Sowmini Varadhan 
---
v2: added "Fixes" tag

 net/xfrm/xfrm_input.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index b89c9c7..be3520e 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -458,6 +458,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR);
goto drop;
}
+   crypto_done = false;
} while (!err);
 
err = xfrm_rcv_cb(skb, family, x->type->proto, 0);
-- 
1.7.1

[PATCH V2 ipsec-next 0/2] xfrm: bug fixes when processing multiple transforms

2018-09-03 Thread Sowmini Varadhan

This series contains bug fixes that were encountered when I set
up a libreswan tunnel using the config below, which will set up
an IPsec policy involving 2 tmpls.

type=transport
compress=yes
esp=aes_gcm_c-128-null # offloaded to Niantic
auto=start

The non-offload test case uses  esp=aes_gcm_c-256-null.

Each patch has a technical description of the contents of the fix.

V2: added Fixes tag so that it can be backported to the stable trees.

Sowmini Varadhan (2):
  xfrm: reset transport header back to network header after all input
transforms ahave been applied
  xfrm: reset crypto_done when iterating over multiple input xfrms

 net/ipv4/xfrm4_input.c  |1 +
 net/ipv4/xfrm4_mode_transport.c |4 +---
 net/ipv6/xfrm6_input.c  |1 +
 net/ipv6/xfrm6_mode_transport.c |4 +---
 net/xfrm/xfrm_input.c   |1 +
 5 files changed, 5 insertions(+), 6 deletions(-)

[PATCH V2 ipsec-next 1/2] xfrm: reset transport header back to network header after all input transforms ahave been applied

2018-09-03 Thread Sowmini Varadhan

A policy may have been set up with multiple transforms (e.g., ESP
and ipcomp). In this situation, the ingress IPsec processing
iterates in xfrm_input() and applies each transform in turn,
processing the nexthdr to find any additional xfrm that may apply.

This patch resets the transport header back to network header
only after the last transformation so that subsequent xfrms
can find the correct transport header.

Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
Suggested-by: Steffen Klassert 
Signed-off-by: Sowmini Varadhan 
---
v2: added "Fixes" tag

 net/ipv4/xfrm4_input.c  |1 +
 net/ipv4/xfrm4_mode_transport.c |4 +---
 net/ipv6/xfrm6_input.c  |1 +
 net/ipv6/xfrm6_mode_transport.c |4 +---
 4 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index bcfc00e..f8de248 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -67,6 +67,7 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async)
 
if (xo && (xo->flags & XFRM_GRO)) {
skb_mac_header_rebuild(skb);
+   skb_reset_transport_header(skb);
return 0;
}
 
diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index 3d36644..1ad2c2c 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -46,7 +46,6 @@ static int xfrm4_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
 static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int ihl = skb->data - skb_transport_header(skb);
-   struct xfrm_offload *xo = xfrm_offload(skb);
 
if (skb->transport_header != skb->network_header) {
memmove(skb_transport_header(skb),
@@ -54,8 +53,7 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct 
sk_buff *skb)
skb->network_header = skb->transport_header;
}
ip_hdr(skb)->tot_len = htons(skb->len + ihl);
-   if (!xo || !(xo->flags & XFRM_GRO))
-   skb_reset_transport_header(skb);
+   skb_reset_transport_header(skb);
return 0;
 }
 
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index 841f4a0..9ef490d 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -59,6 +59,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async)
 
if (xo && (xo->flags & XFRM_GRO)) {
skb_mac_header_rebuild(skb);
+   skb_reset_transport_header(skb);
return -1;
}
 
diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c
index 9ad07a9..3c29da5 100644
--- a/net/ipv6/xfrm6_mode_transport.c
+++ b/net/ipv6/xfrm6_mode_transport.c
@@ -51,7 +51,6 @@ static int xfrm6_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
 static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int ihl = skb->data - skb_transport_header(skb);
-   struct xfrm_offload *xo = xfrm_offload(skb);
 
if (skb->transport_header != skb->network_header) {
memmove(skb_transport_header(skb),
@@ -60,8 +59,7 @@ static int xfrm6_transport_input(struct xfrm_state *x, struct 
sk_buff *skb)
}
ipv6_hdr(skb)->payload_len = htons(skb->len + ihl -
   sizeof(struct ipv6hdr));
-   if (!xo || !(xo->flags & XFRM_GRO))
-   skb_reset_transport_header(skb);
+   skb_reset_transport_header(skb);
return 0;
 }
 
-- 
1.7.1

[PATCH ipsec-next 2/2] xfrm: reset crypto_done when iterating over multiple input xfrms

2018-09-02 Thread Sowmini Varadhan

We only support one offloaded xfrm (we do not have devices that
can handle more than one offload), so reset crypto_done in
xfrm_input() when iterating over multiple transforms in xfrm_input, 
so that we can invoke the appropriate x->type->input for the 
non-offloaded transforms

Signed-off-by: Sowmini Varadhan 
---
 net/xfrm/xfrm_input.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index b89c9c7..be3520e 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -458,6 +458,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR);
goto drop;
}
+   crypto_done = false;
} while (!err);
 
err = xfrm_rcv_cb(skb, family, x->type->proto, 0);
-- 
1.7.1

[PATCH ipsec-next 1/2] xfrm: reset transport header back to network header after all input transforms ahave been applied

2018-09-02 Thread Sowmini Varadhan

A policy may have been set up with multiple transforms (e.g., ESP
and ipcomp). In this situation, the ingress IPsec processing
iterates in xfrm_input() and applies each transform in turn,
processing the nexthdr to find any additional xfrm that may apply.

This patch resets the transport header back to network header
only after the last transformation so that subsequent xfrms
can find the correct transport header.

Suggested-by: Steffen Klassert 
Signed-off-by: Sowmini Varadhan 
---
 net/ipv4/xfrm4_input.c  |1 +
 net/ipv4/xfrm4_mode_transport.c |4 +---
 net/ipv6/xfrm6_input.c  |1 +
 net/ipv6/xfrm6_mode_transport.c |4 +---
 4 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index bcfc00e..f8de248 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -67,6 +67,7 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async)
 
if (xo && (xo->flags & XFRM_GRO)) {
skb_mac_header_rebuild(skb);
+   skb_reset_transport_header(skb);
return 0;
}
 
diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index 3d36644..1ad2c2c 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -46,7 +46,6 @@ static int xfrm4_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
 static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int ihl = skb->data - skb_transport_header(skb);
-   struct xfrm_offload *xo = xfrm_offload(skb);
 
if (skb->transport_header != skb->network_header) {
memmove(skb_transport_header(skb),
@@ -54,8 +53,7 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct 
sk_buff *skb)
skb->network_header = skb->transport_header;
}
ip_hdr(skb)->tot_len = htons(skb->len + ihl);
-   if (!xo || !(xo->flags & XFRM_GRO))
-   skb_reset_transport_header(skb);
+   skb_reset_transport_header(skb);
return 0;
 }
 
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index 841f4a0..9ef490d 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -59,6 +59,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async)
 
if (xo && (xo->flags & XFRM_GRO)) {
skb_mac_header_rebuild(skb);
+   skb_reset_transport_header(skb);
return -1;
}
 
diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c
index 9ad07a9..3c29da5 100644
--- a/net/ipv6/xfrm6_mode_transport.c
+++ b/net/ipv6/xfrm6_mode_transport.c
@@ -51,7 +51,6 @@ static int xfrm6_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
 static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int ihl = skb->data - skb_transport_header(skb);
-   struct xfrm_offload *xo = xfrm_offload(skb);
 
if (skb->transport_header != skb->network_header) {
memmove(skb_transport_header(skb),
@@ -60,8 +59,7 @@ static int xfrm6_transport_input(struct xfrm_state *x, struct 
sk_buff *skb)
}
ipv6_hdr(skb)->payload_len = htons(skb->len + ihl -
   sizeof(struct ipv6hdr));
-   if (!xo || !(xo->flags & XFRM_GRO))
-   skb_reset_transport_header(skb);
+   skb_reset_transport_header(skb);
return 0;
 }
 
-- 
1.7.1

[PATCH ipsec-next 0/2] xfrm: bug fixes when processing multiple transforms

2018-09-02 Thread Sowmini Varadhan

This series contains bug fixes that were encountered when I set
up a libreswan tunnel using the config below, which will set up 
an IPsec policy involving 2 tmpls.

type=transport
compress=yes
esp=aes_gcm_c-128-null # offloaded to Niantic
auto=start

The non-offload test case uses  esp=aes_gcm_c-256-null.

Each patch has a technical description of the contents of the fix.

Sowmini Varadhan (2):
  xfrm: reset transport header back to network header after all input
transforms ahave been applied
  xfrm: reset crypto_done when iterating over multiple input xfrms

 net/ipv4/xfrm4_input.c  |1 +
 net/ipv4/xfrm4_mode_transport.c |4 +---
 net/ipv6/xfrm6_input.c  |1 +
 net/ipv6/xfrm6_mode_transport.c |4 +---
 net/xfrm/xfrm_input.c   |1 +
 5 files changed, 5 insertions(+), 6 deletions(-)

Re: [PATCH] rds: tcp: remove duplicated include from tcp.c

2018-08-21 Thread Sowmini Varadhan

On (08/21/18 14:05), Yue Haibing wrote:
> Remove duplicated include.
> 
> Signed-off-by: Yue Haibing 

Acked-by: Sowmini Varadhan

Re: [PATCH net-next] rds: avoid lock hierarchy violation between m_rs_lock and rs_recv_lock

2018-08-08 Thread Sowmini Varadhan

On (08/08/18 14:51), Santosh Shilimkar wrote:
> This bug doesn't make sense since two different transports are using
> same socket (Loop and rds_tcp) and running together.
> For same transport, such race can't happen with MSG_ON_SOCK flag.
> CPU1-> rds_loop_inc_free
> CPU0 -> rds_tcp_cork ...
> 

The test is just reporting a lock hierarchy violation

As far as I can tell, this wasn't an actual deadlock that happened
because as you point out, either a socket has the rds_tcp transport
or the rds_loop transport, so this particular pair of stack traces
would not happen with the code as it exists today.

but there is a valid lock hierachy violation here, and
imho it's a good idea to get that cleaned up. 

It also avoids needlessly  holding down the rs_recv_lock
when doing an rds_inc_put.

--Sowmini

[PATCH net-next] rds: avoid lock hierarchy violation between m_rs_lock and rs_recv_lock

2018-08-08 Thread Sowmini Varadhan

The following deadlock, reported by syzbot, can occur if CPU0 is in
rds_send_remove_from_sock() while CPU1 is in rds_clear_recv_queue()

   CPU0CPU1
   
  lock(&(>m_rs_lock)->rlock);
   lock(>rs_recv_lock);
   lock(&(>m_rs_lock)->rlock);
  lock(>rs_recv_lock);

The deadlock should be avoided by moving the messages from the
rs_recv_queue into a tmp_list in rds_clear_recv_queue() under
the rs_recv_lock, and then dropping the refcnt on the messages
in the tmp_list (potentially resulting in rds_message_purge())
after dropping the rs_recv_lock.

The same lock hierarchy violation also exists in rds_still_queued()
and should be avoided in a similar manner

Signed-off-by: Sowmini Varadhan 
Reported-by: syzbot+52140d69ac6dc6b92...@syzkaller.appspotmail.com
---
 net/rds/recv.c |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 504cd6b..1cf7072 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -429,6 +429,7 @@ static int rds_still_queued(struct rds_sock *rs, struct 
rds_incoming *inc,
struct sock *sk = rds_rs_to_sk(rs);
int ret = 0;
unsigned long flags;
+   bool drop_ref = false;
 
write_lock_irqsave(>rs_recv_lock, flags);
if (!list_empty(>i_item)) {
@@ -439,11 +440,13 @@ static int rds_still_queued(struct rds_sock *rs, struct 
rds_incoming *inc,
  -be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
list_del_init(>i_item);
-   rds_inc_put(inc);
+   drop_ref = true;
}
}
write_unlock_irqrestore(>rs_recv_lock, flags);
 
+   if (drop_ref)
+   rds_inc_put(inc);
rdsdebug("inc %p rs %p still %d dropped %d\n", inc, rs, ret, drop);
return ret;
 }
@@ -751,16 +754,20 @@ void rds_clear_recv_queue(struct rds_sock *rs)
struct sock *sk = rds_rs_to_sk(rs);
struct rds_incoming *inc, *tmp;
unsigned long flags;
+   LIST_HEAD(tmp_list);
 
write_lock_irqsave(>rs_recv_lock, flags);
list_for_each_entry_safe(inc, tmp, >rs_recv_queue, i_item) {
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  -be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   list_move_tail(>i_item, _list);
+   }
+   write_unlock_irqrestore(>rs_recv_lock, flags);
+   list_for_each_entry_safe(inc, tmp, _list, i_item) {
list_del_init(>i_item);
rds_inc_put(inc);
}
-   write_unlock_irqrestore(>rs_recv_lock, flags);
 }
 
 /*
-- 
1.7.1

Re: [PATCH] rds: send: Fix dead code in rds_sendmsg

2018-07-25 Thread Sowmini Varadhan

On (07/25/18 10:22), Gustavo A. R. Silva wrote:
> Currently, code at label *out* is unreachable. Fix this by updating
> variable *ret* with -EINVAL, so the jump to *out* can be properly
> executed instead of directly returning from function.
> 
> Addresses-Coverity-ID: 1472059 ("Structurally dead code")
> Fixes: 1e2b44e78eea ("rds: Enable RDS IPv6 support")
> Signed-off-by: Gustavo A. R. Silva 

Acked-by: Sowmini Varadhan

Re: [PATCH v3 net-next 0/3] rds: IPv6 support

2018-07-18 Thread Sowmini Varadhan

On (07/18/18 15:19), Ka-Cheong Poon wrote:
> >bind() and connect() are using the sa_family/ss_family to have
> >the application signal to the kernel about whether ipv4 or ipv6 is
> >desired. (and bind and connect are doing the right thing for
> >v4mapped, so that doesnt seem to be a problem there)
> >
> >In this case you want the application to signal that info via
> >the optlen.  (And the reason for this inconsistency is that you dont
> >want to deal with the user->kernel copy in the same way?)
> 
> 
> Because doing that can break existing RDS apps.  Existing code
> does not check the address family in processing this socket
> option.  It only cares about the address and port.  If the new

I'll leave this up to DaveM. Existing code only handles IPv4,

everywhere else, we always check the sa_family or ss_family
first and verify the length afterward. This was DaveM's original
point about bind/connect/sendmsg. I dont know why rds sockopts have
to be special.

> code suddenly checks that and use it to decide on the address
> family, working app will break.  As I mentioned before, this
> patch set does not change existing behavior.  And doing what
> you mentioned will change existing behavior and break apps.

thank you.

--Sowmini

Re: [PATCH v3 net-next 0/3] rds: IPv6 support

2018-07-17 Thread Sowmini Varadhan

On (07/17/18 13:32), Ka-Cheong Poon wrote:
> 
> The app can use either structures to make the call.  When the
> app fills in the structure, it knows what it is filling in,
> either sockaddr_in or sockaddr_in6.  So it knows the right size
> to use.  The app can also use IPv4 mapped address in a sockaddr_in6
> without a problem.

tupical applications that I have seen in routing applicaitons
will use a union like
  union {
struct sockaddr_in sin;
struct sockaddr_in sin5;
  }
Or they will use sockadd_storage. Passing down the sizeoof that structure
will do the worng thing thing in the existing code for ipv4 (even
though it will not generate EFAOIT)..

> Could you please explain the inconsistency?  An app can use IPv4
> mapped address in a sockaddr_in6 to operate on an IPv4 connection,
> in case you are thinking of this new addition in v3 of the patch.

bind() and connect() are using the sa_family/ss_family to have
the application signal to the kernel about whether ipv4 or ipv6 is 
desired. (and bind and connect are doing the right thing for
v4mapped, so that doesnt seem to be a problem there)

In this case you want the application to signal that info via
the optlen.  (And the reason for this inconsistency is that you dont 
want to deal with the user->kernel copy in the same way?)

--Sowmini

Re: [PATCH v3 net-next 0/3] rds: IPv6 support

2018-07-16 Thread Sowmini Varadhan



-  Looks like rds_connect() is checking things in the right order (thanks)
   However, rds_cancel_sent_to is still looking at the len to figure
   out the family.. as we move to ipv6,  it would be better if we allow
   the caller to specify struct sockaddr_storage, or even a union of
   sockaddr_in/sockaddr_in6, rather than require them to hint at which 
   one of ipv4/ipv6 through the optlen.

   Please see __sys_connect and move_addr_to_kernel if the user-kernel
   copy is the reason you are not doing this. Similar to inet_dgram_connect
   you can then check the sa_family and use that to figure out the
   "Assume IPv4" etc stuff.

   This would also make the CANCEL_SEND_TO API consistent with the bind/
   connect etc semantics.
   
-  net/rds/rds.h: thanks for moving RDS_CM_PORT to the rdma specific file.

   I am guessing (?) that you want to update the comment to talk about
   the non-existent "RDS over UDP" based on the title of the IANA registration?
   I would just like to re-iterate that this is actually inaccurate
   (and confusing to someone looking at this for the first time, since
   there is no RDS-over-UDP today). If it were up to me, I would update
   the comment to say

/* The following ports, 16385, 18634, 18635, are registered with IANA as
 * the ports to be used for "RDS over TCP and UDP".
 * The current linux implementation supports RDS over TCP and IB, and uses
 * the ports as follows: 18634 is the historical value used for the
 * RDMA_CM listener port.  RDS/TCP uses port 16385.  After
 * IPv6 work, RDMA_CM also uses 16385 as the listener port.  18634 is kept
 * to ensure compatibility with older RDS modules.  Those ports are defined
 * in each transport's header file.

IMHO that makes the comment look a little less odd (I've already explained
to you why RDS-over-UDP does not make much practical sense for the RDS
use-cases we anticipate). YMMV.

Thanks,

--Sowmini

Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread Sowmini Varadhan

On (07/06/18 23:08), Ka-Cheong Poon wrote:
> 
> As mentioned in a previous mail, it is unclear why the
> port number is transport specific.  Most Internet services
> use the same port number running over TCP/UDP as shown
> in the IANA database.  And the IANA RDS registration is
> the same.  What is the rationale of having a transport
> specific number in the RDS implementation?

because not every transport may need a port number.

e.g., "RDS over pigeon carrier" may not need a port number.

in a different design (e.g., KCM) the listen port is configurable
with sysctl

Some may need more than one, e.g., rds_rdma, evidently.

What is the problem with using the unused 18635 for the RDS_CM_PORT?

Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-06 Thread Sowmini Varadhan

On (07/06/18 22:36), Ka-Cheong Poon wrote:
> This patch series does not change existing behavior.  But
> I think this is a strange RDS semantics as it differs from
> other types of socket.  But this is not about IPv6 support
> and can be dealt with later.

sure, 

> >   Since we are choosing to treat this behavior as a feature we
> >   need to be consistent in rds_getname(). If we find
> >   (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound,
> >   the returned address (address size, address family) should be based
> >   on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a
> >   bind(), and try to do a getsockname(), I get back a sockaddr_in?!)
> 
> 
> OK, will change that.
> >- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
> >   As DaveM has already pointed out, we should be using sa_family to figure
> >   out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
> >   len first, and then the family- that way wont work if I pass in
> >   sockaddr_storage).  E.g., see inet_dgram_connect.
> >
> > if (addr_len < sizeof(uaddr->sa_family))
> > return -EINVAL;
> OK, will change that.

thank you

> >- In net/rds/rds.h;
> >
> >   /* The following ports, 16385, 18634, 18635, are registered with IANA as
> >* the ports to be used for RDS over TCP and UDP.  18634 is the historical
> >
> >   What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
> >   IN fact RDS has nothing to do with UDP. The comment is confused. See next
> >   item below, where the comment disappears.
> 
> 
> As mentioned in the above comments, they are registered with IANA.

16385 is registered for  RDS-TCP. 

what does it mean to run rds-tcp and rds_rdma with the CM at the same time? 
Is this possible?  (if it is, why not poach some other number like 2049
from NFS?!)

18635 is actually not used in the RDS code.
Why can you not use that for RDS_CM_PORT? 

In general, please do NOT pull up these definitions into net/rds/rds.h
They may change in the future- and the really belong in the transport
specific header file - you question this further below, see my answer
there.

> There is no RDS/UDP in Linux but the port numbers are registered
> nevertheless.  And RDS can work over UDP AFAICT.  BTW, it is really

No it cannot. RDS needs a reliable transport.  If you start using
UDP (or pigeon-carrier) for the transport you have to build the
reliability yourself.

The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP
case they do their own congestion management in userspace, and 
history has shown that it is hard to get that right, thus the desire to
switch to rds-tcp).

Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP",
that is probably the most harmless one to use for this case, because
- it is unused, 
- it calls itself rds-Infiniband,  
- the likelihood of doing rds-udp is infinitesmally small (it is more likely
  that we will need to send packets from abc::1 -> fe80:2 before we need
  rds-udp :-) :-) :-))
- and if rds-udp happens, we can use 16385/udp for rds-udp

so please use 18635 for your RDS_CM_PORT and move it to IB specific
header files and lets converge this thread.

Towing RDS_TCP_PORT around has absolutely nothing to do with your
ipv6 patch set.

> strange that RDS/TCP does not use the port number already registered.
> Anyway, the above comments are just a note on this fact in the IANA

16385 was in defacto use for a long time (since 2.6 kernels), thus I
registered it as well when I started fixing this up. 

the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp.
They are available for your use.

> database.  The following is a link to the IANA assignment.

yes, I am aware

> IMHO, there should only be RDS_PORT in the first place.  And it
> can be used for all transports.  For example, if RDS/SCTP is added,
> it can also use the same port number.  There is no need to define

if/when we add rds/sctp, we shall keep that in mind. 

This is getting to be "arguing for the sake of arguing". I dont have
the time for that.

> a new port number for each transport.  What is the rationale that
> each transport needs to have its own number and be defined in its
> own header file?

Some transports may not even need a port number. Some may need
several. Some may use sysctl (suggested by Tom Herbert) to make
this configurable. Until recently, we had rds/iWARP, that may need
its own transport hooks, it does not make sense to peanut-butter
that into the core module.

That is why it has to be in the transport. I hope that answers the
question.

> If the behavior of a software component is modified/enhanced such
> that the existing API of this component has problems dealing with
> this new behavior, should the API be modified or a new API be added
> to handle that?  If neither is done, shouldn't this be considered
> a bug?

whatever. The design (parallelization for perf) is fine. Some
parts of

Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-06 Thread Sowmini Varadhan

On (07/06/18 17:08), Ka-Cheong Poon wrote:
> >Hmm. Why do you need to include tcp header in ib transport
> >code ? If there is any common function just move to core
> >common file and use it.
> 
> I think it can be removed as it is left over from earlier
> changes when the IB IPv6 listener port was RDS_TCP_PORT.
> Now all the port definitions are in rds.h.

Transport specific information such as port numbers should not
be smeared into the common rds-module.  Please fix that.

If IB is also piggybacking on port 16385 (why?), please use your
own definitions for it in IB specific header files. 

Santosh, David, I have to NACK this if it is not changed.

--Sowmini

Re: [PATCH v2 net-next 0/3] rds: IPv6 support

2018-07-05 Thread Sowmini Varadhan

Some additional comments on this patchset (consolidated here,
please tease this apart into patch1/patch2/patch3 as appropriate)

I looked at the most of rds-core, and the rds-tcp changes.
Please make sure santosh looks at these carefully, especially.
- RDS bind key changes
- connection.c
- all the rds_rdma related changes (e.g., the ib* and rdma* files)

- rds_getname(): one of the existing properties of PF_RDS is that a
connect() does not involve an implicit bind(). Note that you are basing
the changes in rds_bind() on this behavior, thus I guess we make the
choice of treating this as a feature, not a bug.

Since we are choosing to treat this behavior as a feature we
need to be consistent in rds_getname(). If we find
(!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound,
the returned address (address size, address family) should be based
on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a
bind(), and try to do a getsockname(), I get back a sockaddr_in?!)

- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg
As DaveM has already pointed out, we should be using sa_family to figure
out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting
len first, and then the family- that way wont work if I pass in
sockaddr_storage). E.g., see inet_dgram_connect.

if (addr_len < sizeof(uaddr->sa_family))
return -EINVAL;

- In net/rds/rds.h;

/* The following ports, 16385, 18634, 18635, are registered with IANA as
* the ports to be used for RDS over TCP and UDP. 18634 is the historical

What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP.
IN fact RDS has nothing to do with UDP. The comment is confused. See next
item below, where the comment disappears.

- Also in net/rds/rds.h
Please dont define transport specific parameters like RD_CM_PORT in the
common rds.h header. It is unfortunate that we already have RDS_PORT there,
and we should try to clean that up as well. NOte that RDS_TCP_PORT
is now in the correct transport-module-specific header (net/rds/tcp.h)
and its unclean to drag it from there, into the common header as you are
doing.

In fact I just tried to move the RDS_PORT definition into
net/rds/rdma_transport.h and it built just-fine. So to summarize,
please do the following:
1. move RDS_PORT into rdma_transport.h
2. add RDS_CM_PORT into rdma_transport.h
3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h
to net/rds/rds.h

- net/rds/connection.c
As we have discussed offline before, the fact that we cannot report
TCP seq# etc info via the existing rds-info API is not "a bug in the
design of MPRDS" but rather a lacking in the API design. Moreover,
much of the useful information around the TCP socket is already
available via procfs, TCP_INFO etc, so the info by rds-info is rarely
used for rds-tcp- the more useful information is around the RDS socket
itself. So there is a bug in the comment, would be nice if you removed it.

Also, while you are there, s/exisiting/existing, please.

General comments:
-
I remain unconvinced by your global <-> link-local arguments.

For UDP sockets we can do this:

eth0
host1 -- host2
abc::1/64 fe80::2
add abc::/64 as onlink subnet route

host1# traceroute6 -i eth0 -s abc::1 fe80::2

You just broke this for RDS and are using polemic to defend your case,
but the main thrust of your diatribe seems to be "why would you need
this?" I'll try to address that briefly here.

- There may be lot of valid reasons why host2 does not want to be
configured with a global prefix. e.g., I only want host2 to be able
to talk to onlink hosts.

- RDS mandatorily requires sockets to be bound. So the normal src addr
selection (that would have caused host1 to use a link-local to talk
to host2) is suppressed in this case

This is exactly the same as a UDP socket bound to abc::1

Note well, that one of the use-cases for RDS-TCP is to replace
existing infra that uses UDP for cluster-IPC. This has come up before
on netdev:

See
https://www.mail-archive.com/search?l=netdev@vger.kernel.org=subject:%22Re%5C%3A+%5C%5BPATCH+net%5C-next+0%5C%2F6%5C%5D+kcm%5C%3A+Kernel+Connection+Multiplexor+%5C%28KCM%5C%29%22=newest=1

so feature parity with udp is just as important as feature-parity
for rds_rdma.

I hope that helps you see why we need to not break this gratuituously
for rds-tcp.

BTW, etiquette is to cc folks who have offered review comments on the
code. Please make sure to cc me in follow-ups to this thread.

Thank you.

--Sowmini

[BISECTED] [4.17.0-rc6] IPv6 link-local address not getting added

2018-06-27 Thread Sowmini Varadhan



Hi David,

An IPv6 regression has been introduced in 4.17.0-rc6 by
  8308f3f net/ipv6: Add support for specifying metric of connected routes

The regression is that some interfaces on my test machine come
up with link-local addrs but the fe80 prefix is missing.
After this bug, I cannot send any packets to anyone onlink 
(including my routers). 

Here are the symptoms:

When everything is fine, "ip -6 route|grep eno" shows

2606:b400:400:18c8::/64 dev eno1 proto ra metric 100  pref medium
fe80::5:73ff:fea0:52d dev eno1 proto static metric 100  pref medium
fe80::/64 dev eno1 proto kernel metric 256  pref medium
fe80::/64 dev eno3 proto kernel metric 256  pref medium
fe80::/64 dev eno4 proto kernel metric 256  pref medium
default via fe80::5:73ff:fea0:52d dev eno1 proto static metric 100  pref medium

But after 8308f3f, I only find

# ip -6 route|grep eno
2606:b400:400:18c8::/64 dev eno1 proto ra metric 100  pref medium
fe80::5:73ff:fea0:52d dev eno1 proto static metric 100  pref medium
fe80::/64 dev eno1 proto kernel metric 256  pref medium
default via fe80::5:73ff:fea0:52d dev eno1 proto static metric 100  pref medium

(note that eno2 is not enabled in my config, so its absence is expected)

Please have a look, thanks.
--Sowmini

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-27 Thread Sowmini Varadhan

On (06/27/18 18:07), Ka-Cheong Poon wrote:
> 
> There is a reason for that.  It is the way folks expect
> how IPv6 addresses are being used.

have you tried "traceoute6 -s abc::2 fe80::2" on linux?

> It is not just forwarding.  The simple case is that one
> picks a global address in a different link and then
> use it to send to a link local address in another link.

This is actually not any different than ipv4's strong/weak ES model.

Global addresses are supposed to be globally routable. For your
above example, if yuu do that, it is assumed that your routing
table has been set up suitably.

To state what may be well-known:
This does not work for link-locals, becuase, as the name 
suggests, those are local to the link and you may have the same
link-local on multiple links

> This does not work.  And the RDS connection created will
> be stuck forever.  

that is a different problem in the RDS implementation (that
it does not backoff and timeout a failing reconnect)

As you can see from the traceroute6 example, global <-> link-local 
is supported for udp (and probably also tcp sockets, I have not checked
that case)

> I don't expect RDS apps will want to use link local address
> in the first place.  In fact, most normal network apps don't.
   :
> Do you know of any IPv4 RDS app which uses IPv4 link local
> address?  In fact, IPv4 link local address is explicitly
> disallowed for active active bonding.

Are we talking about "why this ok for my particular use
of link-local, so I can slide my patch forward" or, 
"why this is correct IPv6 behavior"?

> Can you explain why DNS name resolution will return an IPv6
> link local address?  I'm surprised if it actually does.

It depends on how you set up your DNS.

It seems like this is all about "I dont want to deal with this
now", so I dont want to continue this discussion which is really
going nowhere.

Thanks

--Sowmini

Re: [rds-devel] [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 10:53), Sowmini Varadhan wrote:
> Date: Tue, 26 Jun 2018 10:53:23 -0400
> From: Sowmini Varadhan 
> To: David Miller 
> Cc: netdev@vger.kernel.org, rds-de...@oss.oracle.com
> Subject: Re: [rds-devel] [PATCH net-next] rds: clean up loopback
> 
> and just to add, the fix itself is logically correct, so belongs in
> net-next. What I dont have (and therefore did not target net) is
> official confirmation that the syzbot failures are root-caused to the
> absence of this patch (since there is no reproducer for many of these,
> and no crash dumps available from syzbot).  
> 

With help from Dmitry, I just got the confirmation from syzbot that
"syzbot has tested the proposed patch and the reproducer did not trigger 
crash:"

thus, we can mark this

Reported-and-tested-by: syzbot+4c20b3866171ce844...@syzkaller.appspotmail.com

and yes, it can target net.

Thanks
--Sowmini

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 16:48), Dmitry Vyukov wrote:
> it probably hit the race by a pure luck of the large program, but then
> never had the same luck when tried to remove any syscalls.
> So it can make sense to submit several test requests to get more testing.

How does one submit test requests by email? 

the last time I asked this question, the answer was a pointer to
https://groups.google.com/forum/#!msg/syzkaller-bugs/7ucgCkAJKSk/skZjgavRAQAJ

Thanks
--Sowmini

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 23:29), David Miller wrote:
> >> 
> >> Since this probably fixes syzbot reports, this can be targetted
> >> at 'net' instead?
> > 
> > that thought occurred to me but I wanted to be conservative and have
> > it in net-next first, have the syzkaller-bugs team confirm the
> > the fixes and then backport to earlier kernels (if needed)..
> 
> I think there is a way to ask syzbot to test a patch in an
> email.

and just to add, the fix itself is logically correct, so belongs in
net-next. What I dont have (and therefore did not target net) is
official confirmation that the syzbot failures are root-caused to the
absence of this patch (since there is no reproducer for many of these,
and no crash dumps available from syzbot).  

--Sowmini

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 23:29), David Miller wrote:
> 
> I think there is a way to ask syzbot to test a patch in an
> email.

Dmitry/syzkaller-bugs, can you clarify? 

This is for the cluster of dup reports like
 https://groups.google.com/forum/#!topic/syzkaller-bugs/zBph8Vu-q2U
and (most recently)
 https://www.spinics.net/lists/linux-rdma/msg66020.html

as I understand it, if there is no reproducer, you cannot really
have a pass/fail test to confirm the fix.

--Sowmini

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 22:23), David Miller wrote:
> 
> Since this probably fixes syzbot reports, this can be targetted
> at 'net' instead?

that thought occurred to me but I wanted to be conservative and have
it in net-next first, have the syzkaller-bugs team confirm the
the fixes and then backport to earlier kernels (if needed)..

--Sowmini

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 21:02), Ka-Cheong Poon wrote:
> 
> In this case, RFC 6724 prefers link local address as source.

the keyword is "prefers". 

> While using non-link local address (say ULA) is not forbidden,
> doing this can easily cause inter-operability issues (does the
> app really know that the non-link local source and the link
> local destination addresses are really on the same link?).  I
> think it is prudent to disallow this in RDS unless there is a
> very clear and important reason to do so. 

I remember the issues that triggered 6724. The "interop" issue
is that when you send from Link-local to global, and need forwarding,
it may not work.

but I dont think an RDS application today expects to deal with
the case that "oh I got back and error when I tried to send to
address X on rds socket rs1, let me go and check what I am bound
to, and maybe create another socket, and bind it to link-local"

You're not doing this for IPv4 and RDS today (you dont have to do this
for UDP, afaik)

This is especially true if "X" is a hostname that got resovled using DNS

> BTW, if it is really > needed, it can be added in future.

shrug. You are introducing a new error return.

--Sowmini

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-26 Thread Sowmini Varadhan

On (06/26/18 13:30), Ka-Cheong Poon wrote:
> 
> My answer to this is that if a socket is not bound to a link
> local address (meaning it is bound to a non-link local address)
> and it is used to send to a link local peer, I think it should
> fail.

Hmm, I'm not sure I agree. I dont think this is forbidden
by RFC 6724 - yes, such a packet cannot be forwarded, but
if everything is on  the same link, and the dest only has
a link-local, you should not need to (create and) bind
another socket to a link-local to talk to this destination..

>  This is consistent with the scope_id check I mentioned in
> the previous mail.  If the socket is not bound to a link local
> address, the bound_scope_id is 0.  So if the socket is used to
> send to a link local address (which has a non-zero scope_id), the
> check will catch it and fail the call.  A new conn should not
> be created in this case.

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-25 Thread Sowmini Varadhan

On (06/26/18 01:43), Ka-Cheong Poon wrote:
> 
> Yes, I think if the socket is bound, it should check the scope_id
> in msg_name (if not NULL) to make sure that they match.  A bound
> RDS socket can send to multiple peers.  But if the bound local
> address is link local, it should only be allowed to send to peers
> on the same link.

agree.

> If a socket is bound, I guess the scope_id should be used.  So
> if a socket is not bound to a link local address and the socket
> is used to sent to a link local peer, it should fail.

PF_RDS sockets *MUST* alwasy be bound.  See
Documentation/networking/rds.txt:
"   Sockets must be bound before you can send or receive data.
This is needed because binding also selects a transport and
attaches it to the socket. Once bound, the transport assignment
does not change."

Also, rds_sendmsg checks this (from net-next, your version
has the equivalent ipv6_addr_any etc check):

if (daddr == 0 || rs->rs_bound_addr == 0) {
release_sock(sk);
ret = -ENOTCONN; /* XXX not a great errno */
goto out;
}

> 
> >Also, why is there no IPv6 support in rds_connect?
> 
> 
> Oops, I missed this when I ported the internal version to the
> net-next version.  Will add it back.

Ok

--Sowmini

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-25 Thread Sowmini Varadhan

On (06/25/18 03:38), Ka-Cheong Poon wrote:
> @@ -1105,8 +1105,27 @@ int rds_sendmsg(struct socket *sock, struct msghdr 
> *msg, size_t payload_len)
>   break;
>  
>   case sizeof(*sin6): {
> - ret = -EPROTONOSUPPORT;
> - goto out;
> + int addr_type;
 :
 :
> + daddr = sin6->sin6_addr;
> + dport = sin6->sin6_port;
> + scope_id = sin6->sin6_scope_id;
> + break;
>   }

In rds_sendmsg, the scopeid passed to rds_conn_create_outgoing
may come from the msg_name (if msg_name is a link-local) or
may come from the rs_bound_scope_id (for connected socket, change
made in Patch 1 of the series). 

This sounds inconsistent.

If I bind to scopeid if1 and then send to fe80::1%if2 (without connect()), 
we'd create an rds_connection with dev_if set to if2. 
(first off, its a bit unexpected to be sending to fe80::1%if2 when you
are bound to a link-local on if1!)

But then, if we got back a response from fe80::1%if2, I think we would
not find a matching conn in rds_recv_incoming? 

And this is even more confusing because the fastpath in rds_sendmsg
does not take the bound_scope_id into consideration at all:
1213 if (rs->rs_conn && ipv6_addr_equal(>rs_conn->c_faddr, ))
1214 conn = rs->rs_conn;
1215 else {
1216 conn = rds_conn_create_outgoing( /* .. */, scope_id)
so if I erroneously passed a msg_name on a connected rds socket, what
would happen? (see also question about rds_connect() itself, below)

Should we always use rs_bound_scope_id for creating the outgoing
rds_connection? (you may need something deterministic for this, 
like "if bound addr is linklocal, return error if daddr has a different
scopeid, else use the bound addr's scopeid", plus, "if bound addr is
not global, and daddr is link-local, we need a conn with the daddr's
scopeid")

Also, why is there no IPv6 support in rds_connect? 

(still looking through the rds-tcp changes, but wanted to get these
questions clarified first).

--Sowmini

Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-25 Thread Sowmini Varadhan

On (06/25/18 06:41), Sowmini Varadhan wrote:
  :
> Add the changes aligned with the changes from
> commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize
> netns/module teardown and rds connection/workq management") for
> rds_loop_transport

FWIW, I am optimistic that this will take care of a number
of the use-after-free panics reported by syzbot (I have not
marked the patch with the recommended syzkaller Reported-by 
tags because I was not able to reproduce each original issue, 
but inspection of the traces suggests this missing patch may 
be behind the races that cause the reports).

--Sowmini

[PATCH net-next] rds: clean up loopback rds_connections on netns deletion

2018-06-25 Thread Sowmini Varadhan

The RDS core module creates rds_connections based on callbacks
from rds_loop_transport when sending/receiving packets to local
addresses.

These connections will need to be cleaned up when they are
created from a netns that is not init_net, and that netns is deleted.

Add the changes aligned with the changes from
commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize
netns/module teardown and rds connection/workq management") for
rds_loop_transport

Acked-by: Santosh Shilimkar 
Signed-off-by: Sowmini Varadhan 
---
 net/rds/connection.c |   11 +-
 net/rds/loop.c   |   56 ++
 net/rds/loop.h   |2 +
 3 files changed, 68 insertions(+), 1 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index abef75d..cfb0595 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -659,11 +659,19 @@ static void rds_conn_info(struct socket *sock, unsigned 
int len,
 
 int rds_conn_init(void)
 {
+   int ret;
+
+   ret = rds_loop_net_init(); /* register pernet callback */
+   if (ret)
+   return ret;
+
rds_conn_slab = kmem_cache_create("rds_connection",
  sizeof(struct rds_connection),
  0, 0, NULL);
-   if (!rds_conn_slab)
+   if (!rds_conn_slab) {
+   rds_loop_net_exit();
return -ENOMEM;
+   }
 
rds_info_register_func(RDS_INFO_CONNECTIONS, rds_conn_info);
rds_info_register_func(RDS_INFO_SEND_MESSAGES,
@@ -676,6 +684,7 @@ int rds_conn_init(void)
 
 void rds_conn_exit(void)
 {
+   rds_loop_net_exit(); /* unregister pernet callback */
rds_loop_exit();
 
WARN_ON(!hlist_empty(rds_conn_hash));
diff --git a/net/rds/loop.c b/net/rds/loop.c
index dac6218..feea1f9 100644
--- a/net/rds/loop.c
+++ b/net/rds/loop.c
@@ -33,6 +33,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds_single_path.h"
 #include "rds.h"
@@ -40,6 +42,17 @@
 
 static DEFINE_SPINLOCK(loop_conns_lock);
 static LIST_HEAD(loop_conns);
+static atomic_t rds_loop_unloading = ATOMIC_INIT(0);
+
+static void rds_loop_set_unloading(void)
+{
+   atomic_set(_loop_unloading, 1);
+}
+
+static bool rds_loop_is_unloading(struct rds_connection *conn)
+{
+   return atomic_read(_loop_unloading) != 0;
+}
 
 /*
  * This 'loopback' transport is a special case for flows that originate
@@ -165,6 +178,8 @@ void rds_loop_exit(void)
struct rds_loop_connection *lc, *_lc;
LIST_HEAD(tmp_list);
 
+   rds_loop_set_unloading();
+   synchronize_rcu();
/* avoid calling conn_destroy with irqs off */
spin_lock_irq(_conns_lock);
list_splice(_conns, _list);
@@ -177,6 +192,46 @@ void rds_loop_exit(void)
}
 }
 
+static void rds_loop_kill_conns(struct net *net)
+{
+   struct rds_loop_connection *lc, *_lc;
+   LIST_HEAD(tmp_list);
+
+   spin_lock_irq(_conns_lock);
+   list_for_each_entry_safe(lc, _lc, _conns, loop_node)  {
+   struct net *c_net = read_pnet(>conn->c_net);
+
+   if (net != c_net)
+   continue;
+   list_move_tail(>loop_node, _list);
+   }
+   spin_unlock_irq(_conns_lock);
+
+   list_for_each_entry_safe(lc, _lc, _list, loop_node) {
+   WARN_ON(lc->conn->c_passive);
+   rds_conn_destroy(lc->conn);
+   }
+}
+
+static void __net_exit rds_loop_exit_net(struct net *net)
+{
+   rds_loop_kill_conns(net);
+}
+
+static struct pernet_operations rds_loop_net_ops = {
+   .exit = rds_loop_exit_net,
+};
+
+int rds_loop_net_init(void)
+{
+   return register_pernet_device(_loop_net_ops);
+}
+
+void rds_loop_net_exit(void)
+{
+   unregister_pernet_device(_loop_net_ops);
+}
+
 /*
  * This is missing .xmit_* because loop doesn't go through generic
  * rds_send_xmit() and doesn't call rds_recv_incoming().  .listen_stop and
@@ -194,4 +249,5 @@ struct rds_transport rds_loop_transport = {
.inc_free   = rds_loop_inc_free,
.t_name = "loopback",
.t_type = RDS_TRANS_LOOP,
+   .t_unloading= rds_loop_is_unloading,
 };
diff --git a/net/rds/loop.h b/net/rds/loop.h
index 469fa4b..bbc8cdd 100644
--- a/net/rds/loop.h
+++ b/net/rds/loop.h
@@ -5,6 +5,8 @@
 /* loop.c */
 extern struct rds_transport rds_loop_transport;
 
+int rds_loop_net_init(void);
+void rds_loop_net_exit(void);
 void rds_loop_exit(void);
 
 #endif
-- 
1.7.1

Re: KASAN: out-of-bounds Read in rds_cong_queue_updates (2)

2018-06-13 Thread Sowmini Varadhan

On (06/13/18 09:52), Dmitry Vyukov wrote:
> I think this is:
> 
> #syz dup: KASAN: use-after-free Read in rds_cong_queue_updates

Indeed. We'd had a discussion about getting a dump of threads
using sysrq (or similar), given the challenges around actually
getting a crash dump, is that now possible? That will certainly help.

another missing bit is that we still need the sychronize_net()
in rds_release(). I realize synchronize_net() is sub-optimal for perf, 
but leaving this existing hole where races can occur in unexpected
manifestations is not ideal either.
(See https://www.spinics.net/lists/netdev/msg475074.html for earlier
discussion thread)

--Sowmini

Re: [rds-devel] KASAN: null-ptr-deref Read in rds_ib_get_mr

2018-05-11 Thread Sowmini Varadhan

On (05/11/18 15:48), Yanjun Zhu wrote:
> diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
> index e678699..2228b50 100644
> --- a/net/rds/ib_rdma.c
> +++ b/net/rds/ib_rdma.c
> @@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void)
>void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
> struct rds_sock *rs, u32 *key_ret)
>{
> - struct rds_ib_device *rds_ibdev;
> + struct rds_ib_device *rds_ibdev = NULL;
> struct rds_ib_mr *ibmr = NULL;
> - struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;
> + struct rds_ib_connection *ic = NULL;
> int ret;
> 
> + if (rs->rs_bound_addr == 0) {
> + ret = -EPERM;
> + goto out;
> + }
> +
> + ic = rs->rs_conn->c_transport_data;
> rds_ibdev = rds_ib_get_device(rs->rs_bound_addr);
> if (!rds_ibdev) {
> ret = -ENODEV;
> 
> I made this raw patch. If you can reproduce this bug, please make tests 
> with it.

I dont think this solves the problem, I think it
just changes the timing under which it can still happen. 

what if the rds_remove_bound() in rds_bind() happens after the check
for if (rs->rs_bound_addr == 0) added above by the patch 

I believe you need some type of synchronization (either
through mutex, or some atomic flag in the rs or similar) to make
sure rds_bind() and rds_ib_get_mr() are mutually exclusive.

--Sowmini

Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Sowmini Varadhan

On (04/18/18 06:35), Eric Dumazet wrote:
> 
> There is no change at all.
> 
> This will only be used as a mechanism to send X packets of same size.
> 
> So instead of X system calls , one system call.
> 
> One traversal of some expensive part of the host stack.
> 
> The content on the wire should be the same.

I'm sorry that's not how I interpret Willem's email below
(and maybe I misunderstood)

the following taken from https://www.spinics.net/lists/netdev/msg496150.html

Sowmini> If yes, how will the recvmsg differentiate between the case
Sowmini> (2000 byte message followed by 512 byte message) and
Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
Sowmini> in other words, how are UDP message boundary semantics preserved?

Willem> They aren't. This is purely an optimization to amortize the cost of
Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
Willem> boundaries of the original larger than MTU datagram.

As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
- classic UDP will send 2 IP fragments, the first one with a full UDP
  header, and the IP header indicating that this is the first frag for
  that ipid, with more frags to follow. The second frag will have the
  rest with the same ipid, it will not have a udp header,
  and it will indicatet that it is the last frag (no more frags).

  The receiver can thus use the ipid, "more-frags" bit, frag offset etc
  to stitch the 2000 byte udp message together and pass it up on the udp
  socket.

- in the "GSO" proposal my 2000  bytes of data are sent as *two*
  udp packets, each of them with a unique udp header, and uh_len set
  to 1476 (for first) and 526 (for second). The receiver has no clue
  that they are both part of the same UDP datagram, So wire format
  is not the same, am I mistaken?

--Sowmini

Re: [PATCH RFC net-next 00/11] udp gso

2018-04-18 Thread Sowmini Varadhan


I went through the patch set and the code looks fine- it extends existing
infra for TCP/GSO to UDP.

One thing that was not clear to me about the API: shouldn't UDP_SEGMENT
just be automatically determined in the stack from the pmtu? Whats
the motivation for the socket option for this? also AIUI this can be
either a per-socket or a per-packet option?

However, I share Sridhar's concerns about the very fundamental change
to UDP message boundary semantics here.  There is actually no such thing
as a "segment" in udp, so in general this feature makes me a little
uneasy.  Well behaved udp applications should already be sending mtu
sized datagrams. And the not-so-well-behaved ones are probably relying
on IP fragmentation/reassembly to take care of datagram boundary semantics
for them?

As Sridhar points out, the feature is not really "negotiated" - one side
unilaterally sets the option. If the receiver is a classic/POSIX UDP
implementation, it will have no way of knowing that message boundaries
have been re-adjusted at the sender.  

One thought to recover from this: use the infra being proposed in
  https://tools.ietf.org/html/draft-touch-tsvwg-udp-options-09
to include a new UDP TLV option that tracks datagram# (similar to IP ID)
to help the receiver reassemble the UDP datagram and pass it up with
the POSIX-conformant UDP message boundary. I realize that this is also
not a perfect solution: as you point out, there are risks from
packet re-ordering/drops- you may well end up just reinventing IP
frag/re-assembly when you are done (with just the slight improvement
that each "fragment" has a full UDP header, so it has a better shot
at ECMP and RSS).

--Sowmini

Re: [PATCH RFC net-next 00/11] udp gso

2018-04-17 Thread Sowmini Varadhan

On (04/17/18 16:23), Willem de Bruijn wrote:
> 
> Assuming IPv4 with an MTU of 1500 and the maximum segment
> size of 1472, the receiver will see three datagrams with MSS of
> 1472B, 528B and 512B.

so the recvmsg will also pass up 1472, 526, 512, right?
If yes, how will the recvmsg differentiate between the case
(2000 byte message followed by 512 byte message) and
(1472 byte message, 526 byte message, then 512 byte message),
in other words, how are UDP message boundary semantics preserved?

--Sowmini

Re: [PATCH RFC net-next 00/11] udp gso

2018-04-17 Thread Sowmini Varadhan

On (04/17/18 16:00), Willem de Bruijn wrote:
> 
> This patchset implements GSO for UDP. A process can concatenate and
> submit multiple datagrams to the same destination in one send call
> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
> or passing an analogous cmsg at send time.
> 
> The stack will send the entire large (up to network layer max size)
> datagram through the protocol layer. At the GSO layer, it is broken
> up in individual segments. All receive the same network layer header
> and UDP src and dst port. All but the last segment have the same UDP
> header, but the last may differ in length and checksum.

I'll go through the patch-set later today/tomorrow (interesting!), 
but a question: what are message boundary semantics in this model? E.g.,
if I do a udp_sendmsg of 2000 bytes, and then then a udp_sendmsg of
512 bytes, with the receiver first recvmsg 2000 bytes and then the
512 bytes?

My understanding of the comment above is that no, it will not- 
because (assuming an mtu of 1500) there will be 2 UDP messages on 
the wire, the first one with 1500 bytes, and the second with 1012
bytes (with some discounts for the various L2/L3 etc headers)

--Sowmini

Re: KASAN: use-after-free Read in inet_create

2018-04-08 Thread Sowmini Varadhan


#syz dup: KASAN: use-after-free Read in rds_cong_queue_updates

There  are a number of manifestations of this bug, basically
all suggest that the connect/reconnect etc workqs are somehow
being scheduled after the netns is deleted, despite the
code refactoring in Commit  3db6e0d172c (and looks like
the WARN_ONs in that commit are not even being triggered).
We've not been able to reproduce this issues, and without
a crash dump (or some hint of other threads that were running
at the time of the problem) are working on figuring out
the root-cause by code-inspection.

--Sowmini

Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-20 Thread Sowmini Varadhan

On (03/20/18 12:37), H??kon Bugge wrote:
> 
> A little nit below. And some spelling issues in existing commentary
> you can consider fixing, since you reshuffle this file considerable.
> > +   if (net != _net && rtn->ctl_table)
> > +   kfree(rtn->ctl_table);
> 
> Well, this comes from the existing code, but as pointed out by Linus
> recently, kfree() handles NULL pointers. So, if rtn->ctl_table is
> NULL most likely, the code is OK _if_ you add an unlikely() around the
> if-clause. Otherwise, the ??? && rtn->ctl_table??? can simply be removed.

As you observe correctly, this comes from the existing code,
and is unrelated to the contents of the commit comment.

So if we feel passionately about htis, let us please fix this
separately, with its own ceremony.


> s/when/When/
> s/parameters,this/parameters, this/
> 
> Well, not part of your commit.

As above.
> 
> 
> >  * function  resets the RDS connections in that netns  so that we can
> 
> Two double spaces incidents above
> 
> Not part of your commit

As above.

Thanks much.
--Sowmini

[PATCH net-next] rds: tcp: remove register_netdevice_notifier infrastructure.

2018-03-19 Thread Sowmini Varadhan

The netns deletion path does not need to wait for all net_devices
to be unregistered before dismantling rds_tcp state for the netns
(we are able to dismantle this state on module unload even when
all net_devices are active so there is no dependency here).

This patch removes code related to netdevice notifiers and
refactors all the code needed to dismantle rds_tcp state
into a ->exit callback for the pernet_operations used with
register_pernet_device().

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
 net/rds/tcp.c |   93 ++---
 1 files changed, 23 insertions(+), 70 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 08ea9cd..4f3a32c 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
return err;
 }
 
-static void __net_exit rds_tcp_exit_net(struct net *net)
-{
-   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-
-   if (rtn->rds_tcp_sysctl)
-   unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
-
-   if (net != _net && rtn->ctl_table)
-   kfree(rtn->ctl_table);
-
-   /* If rds_tcp_exit_net() is called as a result of netns deletion,
-* the rds_tcp_kill_sock() device notifier would already have cleaned
-* up the listen socket, thus there is no work to do in this function.
-*
-* If rds_tcp_exit_net() is called as a result of module unload,
-* i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
-* we do need to clean up the listen socket here.
-*/
-   if (rtn->rds_tcp_listen_sock) {
-   struct socket *lsock = rtn->rds_tcp_listen_sock;
-
-   rtn->rds_tcp_listen_sock = NULL;
-   rds_tcp_listen_stop(lsock, >rds_tcp_accept_w);
-   }
-}
-
-static struct pernet_operations rds_tcp_net_ops = {
-   .init = rds_tcp_init_net,
-   .exit = rds_tcp_exit_net,
-   .id = _tcp_netid,
-   .size = sizeof(struct rds_tcp_net),
-   .async = true,
-};
-
 static void rds_tcp_kill_sock(struct net *net)
 {
struct rds_tcp_connection *tc, *_tc;
@@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
rds_conn_destroy(tc->t_cpath->cp_conn);
 }
 
-void *rds_tcp_listen_sock_def_readable(struct net *net)
+static void __net_exit rds_tcp_exit_net(struct net *net)
 {
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-   struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-   if (!lsock)
-   return NULL;
+   rds_tcp_kill_sock(net);
 
-   return lsock->sk->sk_user_data;
+   if (rtn->rds_tcp_sysctl)
+   unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
+
+   if (net != _net && rtn->ctl_table)
+   kfree(rtn->ctl_table);
 }
 
-static int rds_tcp_dev_event(struct notifier_block *this,
-unsigned long event, void *ptr)
+static struct pernet_operations rds_tcp_net_ops = {
+   .init = rds_tcp_init_net,
+   .exit = rds_tcp_exit_net,
+   .id = _tcp_netid,
+   .size = sizeof(struct rds_tcp_net),
+   .async = true,
+};
+
+void *rds_tcp_listen_sock_def_readable(struct net *net)
 {
-   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+   struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-   /* rds-tcp registers as a pernet subys, so the ->exit will only
-* get invoked after network acitivity has quiesced. We need to
-* clean up all sockets  to quiesce network activity, and use
-* the unregistration of the per-net loopback device as a trigger
-* to start that cleanup.
-*/
-   if (event == NETDEV_UNREGISTER_FINAL &&
-   dev->ifindex == LOOPBACK_IFINDEX)
-   rds_tcp_kill_sock(dev_net(dev));
+   if (!lsock)
+   return NULL;
 
-   return NOTIFY_DONE;
+   return lsock->sk->sk_user_data;
 }
 
-static struct notifier_block rds_tcp_dev_notifier = {
-   .notifier_call= rds_tcp_dev_event,
-   .priority = -10, /* must be called after other network notifiers */
-};
-
 /* when sysctl is used to modify some kernel socket parameters,this
  * function  resets the RDS connections in that netns  so that we can
  * restart with new parameters.  The assumption is that such reset
@@ -625,9 +589,7 @@ static void rds_tcp_exit(void)
rds_tcp_set_unloading();
synchronize_rcu();
rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
-   unregister_pernet_subsys(_tcp_net_ops);
-   if (unregister_netdevice_notifier(_tcp_dev_notifier))
-   pr_warn("could not unregister rds_tcp_dev_notifier\n");
+   unregister_pernet_device(_tcp_ne

Re: KASAN: slab-out-of-bounds Read in rds_cong_queue_updates

2018-03-19 Thread Sowmini Varadhan

On (03/19/18 09:29), Dmitry Vyukov wrote:
> 
> This looks the same as:
> 
> #syz dup: KASAN: use-after-free Read in rds_cong_queue_updates

correct, seems like the rds_destroy_pending() fixes did not seal
this race condition. I need to look at this more carefully to see
what race I missed.. no easy answer here, I am afraid.

--Sowmini

Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-18 Thread Sowmini Varadhan

On (03/18/18 00:55), Kirill Tkhai wrote:
> 
> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
> another solution to do that, I'm not again that.

The patch below takes care of this. I've done some preliminary testing,
and I'll send it upstream after doing additional self-review/testing.
Please also take a look, if you can, to see if I missed something.

Thanks for the input,

--Sowmini
---patch follows

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 08ea9cd..87c2643 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
return err;
 }
 
-static void __net_exit rds_tcp_exit_net(struct net *net)
-{
-   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-
-   if (rtn->rds_tcp_sysctl)
-   unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
-
-   if (net != _net && rtn->ctl_table)
-   kfree(rtn->ctl_table);
-
-   /* If rds_tcp_exit_net() is called as a result of netns deletion,
-* the rds_tcp_kill_sock() device notifier would already have cleaned
-* up the listen socket, thus there is no work to do in this function.
-*
-* If rds_tcp_exit_net() is called as a result of module unload,
-* i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
-* we do need to clean up the listen socket here.
-*/
-   if (rtn->rds_tcp_listen_sock) {
-   struct socket *lsock = rtn->rds_tcp_listen_sock;
-
-   rtn->rds_tcp_listen_sock = NULL;
-   rds_tcp_listen_stop(lsock, >rds_tcp_accept_w);
-   }
-}
-
-static struct pernet_operations rds_tcp_net_ops = {
-   .init = rds_tcp_init_net,
-   .exit = rds_tcp_exit_net,
-   .id = _tcp_netid,
-   .size = sizeof(struct rds_tcp_net),
-   .async = true,
-};
-
 static void rds_tcp_kill_sock(struct net *net)
 {
struct rds_tcp_connection *tc, *_tc;
@@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
rds_conn_destroy(tc->t_cpath->cp_conn);
 }
 
-void *rds_tcp_listen_sock_def_readable(struct net *net)
+static void __net_exit rds_tcp_exit_net(struct net *net)
 {
struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-   struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-   if (!lsock)
-   return NULL;
+   rds_tcp_kill_sock(net);
 
-   return lsock->sk->sk_user_data;
+   if (rtn->rds_tcp_sysctl)
+   unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
+
+   if (net != _net && rtn->ctl_table)
+   kfree(rtn->ctl_table);
 }
 
-static int rds_tcp_dev_event(struct notifier_block *this,
-unsigned long event, void *ptr)
+static struct pernet_operations rds_tcp_net_ops = {
+   .init = rds_tcp_init_net,
+   .exit = rds_tcp_exit_net,
+   .id = _tcp_netid,
+   .size = sizeof(struct rds_tcp_net),
+   .async = true,
+};
+
+void *rds_tcp_listen_sock_def_readable(struct net *net)
 {
-   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+   struct socket *lsock = rtn->rds_tcp_listen_sock;
 
-   /* rds-tcp registers as a pernet subys, so the ->exit will only
-* get invoked after network acitivity has quiesced. We need to
-* clean up all sockets  to quiesce network activity, and use
-* the unregistration of the per-net loopback device as a trigger
-* to start that cleanup.
-*/
-   if (event == NETDEV_UNREGISTER_FINAL &&
-   dev->ifindex == LOOPBACK_IFINDEX)
-   rds_tcp_kill_sock(dev_net(dev));
+   if (!lsock)
+   return NULL;
 
-   return NOTIFY_DONE;
+   return lsock->sk->sk_user_data;
 }
 
-static struct notifier_block rds_tcp_dev_notifier = {
-   .notifier_call= rds_tcp_dev_event,
-   .priority = -10, /* must be called after other network notifiers */
-};
-
 /* when sysctl is used to modify some kernel socket parameters,this
  * function  resets the RDS connections in that netns  so that we can
  * restart with new parameters.  The assumption is that such reset
@@ -625,9 +589,7 @@ static void rds_tcp_exit(void)
rds_tcp_set_unloading();
synchronize_rcu();
rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info);
-   unregister_pernet_subsys(_tcp_net_ops);
-   if (unregister_netdevice_notifier(_tcp_dev_notifier))
-   pr_warn("could not unregister rds_tcp_dev_notifier\n");
+   unregister_pernet_device(_tcp_net_ops);
rds_tcp_destroy_conns();
rds_trans_unregister(_tcp_transport);
rds_tcp_recv_exit();
@@ -651,24 +613,17 @@ static int rds_tcp_init(void)
if (ret)
goto out_slab;
 
-   ret = register_pernet_subsys(_tcp_net_ops);
+   ret =

Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-17 Thread Sowmini Varadhan

On (03/17/18 10:15), Sowmini Varadhan wrote:
> To solve the scaling problem why not just have a well-defined 
> callback to modules when devices are quiesced, instead of 
> overloading the pernet_device registration in this obscure way?

I thought about this a bit, and maybe I missed your original point-
today we are able to do all the needed cleanup for rds-tcp when
we unload the module, even though network activity has not quiesced,
and there is no reason we cannot use the same code for netns cleanup
as well. I think this is what you were trying to ask, when you 
said "why do you need to know that loopback is down?"
I'm sorry I  missed that, I will re-examine the code and get back to
you- it should be possible to just do one registration and 
cleanup rds-state and avoid the hack of registering twice

(saw your most recent long mail- sorry- both v1 and v2 are hacks)

I'm on the road at the moment, so I'll get back to you on this.

Thanks
--Sowmini

Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-17 Thread Sowmini Varadhan


I spent a long time staring at both v1 and v2 of your patch.

I understand the overall goal, but I am afraid to say that these
patches are complete hacks.

I was trying to understand why patchv1 blows with a null rtn in 
rds_tcp_init_net, but v2 does not, and the analysis is ugly.  

I'm going to put down the analysis here, and others can
decide if this sort of hack is a healthy solution for a scaling
issue (IMHO  it is not- we should get the real fix for the
scaling instead of using duck-tape-and-chewing-gum solutions)

What is happening in v1 is this:

1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
   following in rds_tcp_init
   register_pernet_device(_tcp_dev_ops);
   register_pernet_device(_tcp_net_ops);
   Where rds_tcp_dev_ops has 
.id = _tcp_netid,
.size = sizeof(struct rds_tcp_net),
   and rds_tcp_net_ops has 0 values for both of these.

2. So now pernet_list has _tcp_net_ops as the first member of the
   pernet_list.

3. Now I do "ip netns create blue". As part of setup_net(), we walk
   the pernet_list and call the ->init of each member (through ops_init()). 
   So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
   skip the struct rds_tcp_net allocation, so rds_tcp_init_net would 
   find a null return from net_generic() and bomb.

The way I view it (and ymmv) the hack here is to call both
register_pernet_device and register_pernet_subsys: the kernel only
guarantees that calling *one* of register_pernet_* will ensure
that you can safely call net_generic() afterwards.

The v2 patch "works around" this by reordering the registration. 
So this time, init_net will set up the rds_tcp_net_ops as the second 
member, and the first memeber will be the pernet_operations struct 
that has non-zero id and size.

But then the unregistration (necessarily) works in the opposite order
you have to unregister_pernet_device first (so that interfaces are
quiesced) and then unregister_pernet_subsys() so that sockets can
be safely quiesced.

I dont think this type of hack makes the code cleaner, it just
make things much harder to understand, and completely brittle
for subsequent changes.

To solve the scaling problem why not just have a well-defined 
callback to modules when devices are quiesced, instead of 
overloading the pernet_device registration in this obscure way?

Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-16 Thread Sowmini Varadhan

On (03/16/18 21:48), Kirill Tkhai wrote:
> 
> > Thus I have to spend some time reviewing your patch,
> > and I cannot give you an answer in the next 5 minutes.
> 
> No problem, 5 minutes response never required. Thanks for
> your review.

thank you. I would like to take some time this weekend
to understand why v1 of your patch hit the null rtn,
but v2 did not (a superficial glance at the patch suggested
that you were registering twice in both cases, with just
a reordering, so I would like to understand the root-cause of
the null ptr deref with v1)

As for registering 2 times, that needs some comments to
provide guidance for other subsystems. e.g., I found the
large block comment in net-namespace.h very helpful, so lets
please clearly document what and why and when this should
be used.

--Sowmini

Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-16 Thread Sowmini Varadhan

On (03/16/18 21:14), Kirill Tkhai wrote:
> 
> I did the second version and sent you. Have you tried it?

I tried it briefly, and it works for the handful of testcases
that I tried, but I still think its very werid to register
as both a device and a subsys, esp in the light of the 
warning in net_namespace.h

Thus I have to spend some time reviewing your patch,
and I cannot give you an answer in the next 5 minutes.

> Calling netdevice handler for every event is more disturbing,
> as number of events is several times bigger, than one more
> pernet exit method.

So you are saying there are scaling constraints on subsystems
that register for netdevice handlers. The disturbing part
of that is that it does not scale.

Thanks.
--Sowmini

Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-16 Thread Sowmini Varadhan

I had taken some of this offline, but it occurs to me
that some of these notes should be saved to the netdev archives, 
in case this question pops up again in a few years.

When I run your patch, I get a repeatable panic by doing
  modprobe rds-tcp
  ip netns create blue
the panic is because we are finding a null trn in rds_tcp_init_net.

I think there's something very disturbed about calling
register_pernet_operations() twice, once via 
register_pernet_device() and again via register_pernet_subsys().

I suspect this has everything to do with the panic but I have
not had time to debug every little detail here.

In general, rds_tcp is not a network device, it is a kernel
module.  That is the fundamental problem here. 

To repeat the comments form net_namespace.h:
 * Network interfaces need to be removed from a dying netns _before_
 * subsys notifiers can be called, as most of the network code cleanup
 * (which is done from subsys notifiers) runs with the assumption that
 * dev_remove_pack has been called so no new packets will arrive during
 * and after the cleanup functions have been called.  dev_remove_pack
 * is not per namespace so instead the guarantee of no more packets
 * arriving in a network namespace is provided by ensuring that all
 * network devices and all sockets have left the network namespace
 * before the cleanup methods are called.

when the "blue" netns starts up, it creates at least one kernel listen
socket on *.16385. This socket, and any other child/client sockets 
created must be cleaned up before the cleanup_net can happen.

This is why I chose to call regster_pernet_subsys. Again, as per
comments in net_namespace.h:

 * Use these carefully.  If you implement a network device and it
 * needs per network namespace operations use device pernet operations,
 * otherwise use pernet subsys operations.

On (03/16/18 18:51), Kirill Tkhai wrote:
> > Let's find another approach. Could you tell what problem we have in 
> > case of rds_tcp_dev_ops is declared as pernet_device?

As above, rds-tcp is not a network device.

> One more question. Which time we take a reference of loopback device?
> Is it possible before we created a net completely?

We dont take a reference on the loopback device. 
We make sure none of the kernel sockets does a get_net() so
that we dont block the cleanup_net, and then, when all
the network interfaces have been taken down (loopback is
the last one) we know there are no more packets coming in
and out, so it is safe to dismantle all kernel sockets 
created by rds-tcp.

Hope that helps.

--Sowmini

Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-16 Thread Sowmini Varadhan


Found my previous question:

https://www.mail-archive.com/netdev@vger.kernel.org/msg72330.html

(see section about "Comments are specifically ivinted.."

> This is not a problem, and rds-tcp is not the only pernet_subsys registering
> a socket. It's OK to close it from .exit method. There are many examples,
> let me point you to icmp_sk_ops as one of them. But it's not the only.

I'm not averse to changing this to NETDEV_UNREGISTER
as long as it works for the 2 test cases below- you 
can test it by using rds-ping from rds-tools rpm, to
be used from/to init_net, from/to the netns  against
some external machine (i.e something not on the same
physical host)

> > For rds-tcp, we need to be able to do the right thing in both of these
> > cases
> > 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
> >every namespace, including init_net)
> > 2. netns delete (rds_tcp.ko should remain loaded for other namespaces)
> 
> The same as above, every pernet_subsys does this. It's not a problem.
> exit and exit_batch methods are called in both of the cases.
> 
> Please, see __unregister_pernet_operations()->ops_exit_list for the details.

I am familiar with ops_exit_list, but this is the sequence:
- when the module is loaded (or netns is started) it starts a 
  kernel listen socket on *.16385
- when you start the rds-pings above, it will create kernel
  tcp connections from/to the 16385 in the netns. And it will
  start socket keepalives for those connections. Each tcp 
  connection is associated with a rds_connection

As I recall, when I wrote the initial patchset, my problem
was that in order to let the module unload make progress,
all these sockets had to be cleaned up. But to clean up these
sockets, net_device cleanup had to complete (should not have
any new incoming connections to the listen endpoint on a 
non-loopback socket) so I ended up with a circular dependancy.

> If we replace NETDEV_UNREGISTER_FINAL with NETDEV_UNREGISTER, the only change
> which happens is we call rds_tcp_kill_sock() earlier. So, it may be a reason
> of problems only if someone changes the list during the time between
> NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL are called for loopback.
> But since this time noone related to this net can extend the list,
> there is no a problem to do that.

Please share your patch, I can review it and maybe help to test
it..

As I was trying to say in my RFC, I am quite open to ways to make
this cleanup more obvious

--Sowmini

Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-16 Thread Sowmini Varadhan

On (03/16/18 15:38), Kirill Tkhai wrote:
> 
> 467fa15356acf by Sowmini Varadhan added NETDEV_UNREGISTER_FINAL dependence
> with the commentary:
> 
>   /* rds-tcp registers as a pernet subys, so the ->exit will only
>* get invoked after network acitivity has quiesced. We need to
>* clean up all sockets  to quiesce network activity, and use
>* the unregistration of the per-net loopback device as a trigger
>* to start that cleanup.
>*/
> 
> It seems all the protocols pernet subsystems meet this situation, but they
> solve it in generic way. What does rds so specific have?

The difference with rds is this: most consumers of netns associate
a net_device with the netns, and cleanup of the netns state 
happens as part of the net_device teardown without the constraint
above. rds-tcp does has a netns tied to listening socket, not
to a specific network interface (net_device) so it registers
as a pernet-subsys. But this means that cleanup has to be
cone carefully (see comments in net_namespace.h before
register_pernet_subsys)

For rds-tcp, we need to be able to do the right thing in both of these
cases
1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in
   every namespace, including init_net)
2. netns delete (rds_tcp.ko should remain loaded for other namespaces)

> This commit makes event handler to iterate rds_tcp_conn_list and
> kill them. If we change the stage to NETDEV_UNREGISTER, what will change?

The above two cases need to work correctly.

> Can unregistered loopback device on dead net add new items to 
> rds_tcp_conn_list?
> How it's possible?

I dont understand the question- no unregistered loopback devices
cannot add items. 

fwiw, I had asked questions about this (netns per net_device
vs netns for module) on the netdev list a few years ago, I can
try to hunt down that thread for you later (nobody replied to 
it, but maybe it will help answer your questions).

--Sowmini

[PATCH net-next] rds: tcp: must use spin_lock_irq* and not spin_lock_bh with rds_tcp_conn_lock

2018-03-15 Thread Sowmini Varadhan

rds_tcp_connection allocation/free management has the potential to be
called from __rds_conn_create after IRQs have been disabled, so
spin_[un]lock_bh cannot be used with rds_tcp_conn_lock.

Bottom-halves that need to synchronize for critical sections protected
by rds_tcp_conn_lock should instead use rds_destroy_pending() correctly.

Reported-by: syzbot+c68e51bb5e699d3f8...@syzkaller.appspotmail.com
Fixes: ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize
   netns/module teardown and rds connection/workq management")
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
 net/rds/tcp.c |   17 +
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index eb04e7f..08ea9cd 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -272,13 +272,14 @@ static int rds_tcp_laddr_check(struct net *net, __be32 
addr)
 static void rds_tcp_conn_free(void *arg)
 {
struct rds_tcp_connection *tc = arg;
+   unsigned long flags;
 
rdsdebug("freeing tc %p\n", tc);
 
-   spin_lock_bh(_tcp_conn_lock);
+   spin_lock_irqsave(_tcp_conn_lock, flags);
if (!tc->t_tcp_node_detached)
list_del(>t_tcp_node);
-   spin_unlock_bh(_tcp_conn_lock);
+   spin_unlock_irqrestore(_tcp_conn_lock, flags);
 
kmem_cache_free(rds_tcp_conn_slab, tc);
 }
@@ -308,13 +309,13 @@ static int rds_tcp_conn_alloc(struct rds_connection 
*conn, gfp_t gfp)
rdsdebug("rds_conn_path [%d] tc %p\n", i,
 conn->c_path[i].cp_transport_data);
}
-   spin_lock_bh(_tcp_conn_lock);
+   spin_lock_irq(_tcp_conn_lock);
for (i = 0; i < RDS_MPATH_WORKERS; i++) {
tc = conn->c_path[i].cp_transport_data;
tc->t_tcp_node_detached = false;
list_add_tail(>t_tcp_node, _tcp_conn_list);
}
-   spin_unlock_bh(_tcp_conn_lock);
+   spin_unlock_irq(_tcp_conn_lock);
 fail:
if (ret) {
for (j = 0; j < i; j++)
@@ -527,7 +528,7 @@ static void rds_tcp_kill_sock(struct net *net)
 
rtn->rds_tcp_listen_sock = NULL;
rds_tcp_listen_stop(lsock, >rds_tcp_accept_w);
-   spin_lock_bh(_tcp_conn_lock);
+   spin_lock_irq(_tcp_conn_lock);
list_for_each_entry_safe(tc, _tc, _tcp_conn_list, t_tcp_node) {
struct net *c_net = read_pnet(>t_cpath->cp_conn->c_net);
 
@@ -540,7 +541,7 @@ static void rds_tcp_kill_sock(struct net *net)
tc->t_tcp_node_detached = true;
}
}
-   spin_unlock_bh(_tcp_conn_lock);
+   spin_unlock_irq(_tcp_conn_lock);
list_for_each_entry_safe(tc, _tc, _list, t_tcp_node)
rds_conn_destroy(tc->t_cpath->cp_conn);
 }
@@ -588,7 +589,7 @@ static void rds_tcp_sysctl_reset(struct net *net)
 {
struct rds_tcp_connection *tc, *_tc;
 
-   spin_lock_bh(_tcp_conn_lock);
+   spin_lock_irq(_tcp_conn_lock);
list_for_each_entry_safe(tc, _tc, _tcp_conn_list, t_tcp_node) {
struct net *c_net = read_pnet(>t_cpath->cp_conn->c_net);
 
@@ -598,7 +599,7 @@ static void rds_tcp_sysctl_reset(struct net *net)
/* reconnect with new parameters */
rds_conn_path_drop(tc->t_cpath, false);
}
-   spin_unlock_bh(_tcp_conn_lock);
+   spin_unlock_irq(_tcp_conn_lock);
 }
 
 static int rds_tcp_skbuf_handler(struct ctl_table *ctl, int write,
-- 
1.7.1

Re: WARNING in __local_bh_enable_ip (2)

2018-03-14 Thread Sowmini Varadhan

On (03/14/18 14:28), Eric Dumazet wrote:
> 
> 
> spin_lock_bh(_tcp_conn_lock);/spin_unlock_bh(_tcp_conn_lock); in
> rds_tcp_conn_free()
> 
> is in conflict with the spin_lock_irqsave(_conn_lock, flags);
> in __rds_conn_create()

yes I was going to look into this and fix it later.

> Hard to understand why RDS is messing with hard irqs really.

some of it comes from the rds_rdma history: some parts of
the common rds and rds_rdma module get called in various
driver contexts for infiniband.

--Sowmini

Re: [PATCH][rds-next] rds: make functions rds_info_from_znotifier and rds_message_zcopy_from_user static

2018-03-11 Thread Sowmini Varadhan

On (03/11/18 18:03), Colin King wrote:
> From: Colin Ian King 
> 
> Functions rds_info_from_znotifier and rds_message_zcopy_from_user are
> local to the source and do not need to be in global scope, so make them
> static.

the rds_message_zcopy_from_user warning was already flagged by  kbuild-robot
last week.  See
   https://www.spinics.net/lists/netdev/msg488041.html

Re: [rds-devel] [PATCH][rds-next] rds: remove redundant variable 'sg_off'

2018-03-11 Thread Sowmini Varadhan

On (03/11/18 17:27), Colin King wrote:
> Variable sg_off is assigned a value but it is never read, hence it is
> redundant and can be removed.
> 

Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>

Re: [RFC PATCH linux-next] rds: rds_message_zcopy_from_user() can be static

2018-03-08 Thread Sowmini Varadhan

On (03/08/18 18:56), kbuild test robot wrote:
> 
> Fixes: d40a126b16ea ("rds: refactor zcopy code into 
> rds_message_zcopy_from_user")
> Signed-off-by: Fengguang Wu <fengguang...@intel.com>

Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>

(do I need to separately submit a non-RFC patch for this?)

Re: [PATCH net-next] sock: Fix SO_ZEROCOPY switch case

2018-03-07 Thread Sowmini Varadhan

On (03/07/18 09:40), Jesus Sanchez-Palencia wrote:
> Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
> ret values to be overwritten by the one set on the default case.

Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>

[PATCH net-next 1/2] rds: refactor zcopy code into rds_message_zcopy_from_user

2018-03-06 Thread Sowmini Varadhan

Move the large block of code predicated on zcopy from
rds_message_copy_from_user into a new function,
rds_message_zcopy_from_user()

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
 net/rds/message.c |  108 +---
 1 files changed, 60 insertions(+), 48 deletions(-)

diff --git a/net/rds/message.c b/net/rds/message.c
index 116cf87..c36edbb 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -333,14 +333,14 @@ struct rds_message *rds_message_map_pages(unsigned long 
*page_addrs, unsigned in
return rm;
 }
 
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
-  bool zcopy)
+int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *from)
 {
-   unsigned long to_copy, nbytes;
unsigned long sg_off;
struct scatterlist *sg;
int ret = 0;
int length = iov_iter_count(from);
+   int total_copied = 0;
+   struct sk_buff *skb;
 
rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
 
@@ -350,54 +350,66 @@ int rds_message_copy_from_user(struct rds_message *rm, 
struct iov_iter *from,
sg = rm->data.op_sg;
sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
 
-   if (zcopy) {
-   int total_copied = 0;
-   struct sk_buff *skb;
-
-   skb = alloc_skb(0, GFP_KERNEL);
-   if (!skb)
-   return -ENOMEM;
-   BUILD_BUG_ON(sizeof(skb->cb) <
-max_t(int, sizeof(struct rds_znotifier),
-  sizeof(struct rds_zcopy_cookies)));
-   rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
-   if (mm_account_pinned_pages(>data.op_mmp_znotifier->z_mmp,
-   length)) {
-   ret = -ENOMEM;
+   skb = alloc_skb(0, GFP_KERNEL);
+   if (!skb)
+   return -ENOMEM;
+   BUILD_BUG_ON(sizeof(skb->cb) < max_t(int, sizeof(struct rds_znotifier),
+sizeof(struct rds_zcopy_cookies)));
+   rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
+   if (mm_account_pinned_pages(>data.op_mmp_znotifier->z_mmp,
+   length)) {
+   ret = -ENOMEM;
+   goto err;
+   }
+   while (iov_iter_count(from)) {
+   struct page *pages;
+   size_t start;
+   ssize_t copied;
+
+   copied = iov_iter_get_pages(from, , PAGE_SIZE,
+   1, );
+   if (copied < 0) {
+   struct mmpin *mmp;
+   int i;
+
+   for (i = 0; i < rm->data.op_nents; i++)
+   put_page(sg_page(>data.op_sg[i]));
+   mmp = >data.op_mmp_znotifier->z_mmp;
+   mm_unaccount_pinned_pages(mmp);
+   ret = -EFAULT;
goto err;
}
-   while (iov_iter_count(from)) {
-   struct page *pages;
-   size_t start;
-   ssize_t copied;
-
-   copied = iov_iter_get_pages(from, , PAGE_SIZE,
-   1, );
-   if (copied < 0) {
-   struct mmpin *mmp;
-   int i;
-
-   for (i = 0; i < rm->data.op_nents; i++)
-   put_page(sg_page(>data.op_sg[i]));
-   mmp = >data.op_mmp_znotifier->z_mmp;
-   mm_unaccount_pinned_pages(mmp);
-   ret = -EFAULT;
-   goto err;
-   }
-   total_copied += copied;
-   iov_iter_advance(from, copied);
-   length -= copied;
-   sg_set_page(sg, pages, copied, start);
-   rm->data.op_nents++;
-   sg++;
-   }
-   WARN_ON_ONCE(length != 0);
-   return ret;
+   total_copied += copied;
+   iov_iter_advance(from, copied);
+   length -= copied;
+   sg_set_page(sg, pages, copied, start);
+   rm->data.op_nents++;
+   sg++;
+   }
+   WARN_ON_ONCE(length != 0);
+   return ret;
 err:
-   consume_skb(skb);
-   rm->data.op_mmp_znotifier = NULL;
-   return ret;
-   } /* zcopy */
+   consume_skb(skb);
+   rm->data.op_mmp_znotifier = NULL;
+   return ret;
+}
+
+int rds_message_copy_from_user(struct rds_message *r

[PATCH net-next 0/2] RDS: zerocopy code enhancements

2018-03-06 Thread Sowmini Varadhan

A couple of enhancements to the rds zerocop code
- patch 1 refactors rds_message_copy_from_user to pull the zcopy logic
  into its own function
- patch 2 drops the usage sk_buff to track MSG_ZEROCOPY cookies and
  uses a simple linked list (enhancement suggested by willemb during
  code review)

Sowmini Varadhan (2):
  rds: refactor zcopy code into rds_message_zcopy_from_user
  rds: use list structure to track information for zerocopy completion
notification

[PATCH net-next 2/2] rds: use list structure to track information for zerocopy completion notification

2018-03-06 Thread Sowmini Varadhan

Commit 401910db4cd4 ("rds: deliver zerocopy completion notification
with data") removes support fo r zerocopy completion notification
on the sk_error_queue, thus we no longer need to track the cookie
information in sk_buff structures.

This commit removes the struct sk_buff_head rs_zcookie_queue by
a simpler list that results in a smaller memory footprint as well
as more efficient memory_allocation time.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
 net/rds/af_rds.c  |6 ++--
 net/rds/message.c |   77 +---
 net/rds/rds.h |   23 +++
 net/rds/recv.c|   23 +++-
 4 files changed, 85 insertions(+), 44 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index f712610..ab751a1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -77,7 +77,7 @@ static int rds_release(struct socket *sock)
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
rds_notify_queue_get(rs, NULL);
-   __skb_queue_purge(>rs_zcookie_queue);
+   rds_notify_msg_zcopy_purge(>rs_zcookie_queue);
 
spin_lock_bh(_sock_lock);
list_del_init(>rs_item);
@@ -180,7 +180,7 @@ static __poll_t rds_poll(struct file *file, struct socket 
*sock,
}
if (!list_empty(>rs_recv_queue) ||
!list_empty(>rs_notify_queue) ||
-   !skb_queue_empty(>rs_zcookie_queue))
+   !list_empty(>rs_zcookie_queue.zcookie_head))
mask |= (EPOLLIN | EPOLLRDNORM);
if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
mask |= (EPOLLOUT | EPOLLWRNORM);
@@ -515,7 +515,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_recv_queue);
INIT_LIST_HEAD(>rs_notify_queue);
INIT_LIST_HEAD(>rs_cong_list);
-   skb_queue_head_init(>rs_zcookie_queue);
+   rds_message_zcopy_queue_init(>rs_zcookie_queue);
spin_lock_init(>rs_rdma_lock);
rs->rs_rdma_keys = RB_ROOT;
rs->rs_rx_traces = 0;
diff --git a/net/rds/message.c b/net/rds/message.c
index c36edbb..90dcdcf 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -48,7 +48,6 @@
 [RDS_EXTHDR_GEN_NUM]   = sizeof(u32),
 };
 
-
 void rds_message_addref(struct rds_message *rm)
 {
rdsdebug("addref rm %p ref %d\n", rm, refcount_read(>m_refcount));
@@ -56,9 +55,9 @@ void rds_message_addref(struct rds_message *rm)
 }
 EXPORT_SYMBOL_GPL(rds_message_addref);
 
-static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie)
+static inline bool rds_zcookie_add(struct rds_msg_zcopy_info *info, u32 cookie)
 {
-   struct rds_zcopy_cookies *ck = (struct rds_zcopy_cookies *)skb->cb;
+   struct rds_zcopy_cookies *ck = >zcookies;
int ncookies = ck->num;
 
if (ncookies == RDS_MAX_ZCOOKIES)
@@ -68,38 +67,61 @@ static inline bool skb_zcookie_add(struct sk_buff *skb, u32 
cookie)
return true;
 }
 
+struct rds_msg_zcopy_info *rds_info_from_znotifier(struct rds_znotifier 
*znotif)
+{
+   return container_of(znotif, struct rds_msg_zcopy_info, znotif);
+}
+
+void rds_notify_msg_zcopy_purge(struct rds_msg_zcopy_queue *q)
+{
+   unsigned long flags;
+   LIST_HEAD(copy);
+   struct rds_msg_zcopy_info *info, *tmp;
+
+   spin_lock_irqsave(>lock, flags);
+   list_splice(>zcookie_head, );
+   INIT_LIST_HEAD(>zcookie_head);
+   spin_unlock_irqrestore(>lock, flags);
+
+   list_for_each_entry_safe(info, tmp, , rs_zcookie_next) {
+   list_del(>rs_zcookie_next);
+   kfree(info);
+   }
+}
+
 static void rds_rm_zerocopy_callback(struct rds_sock *rs,
 struct rds_znotifier *znotif)
 {
-   struct sk_buff *skb, *tail;
-   unsigned long flags;
-   struct sk_buff_head *q;
+   struct rds_msg_zcopy_info *info;
+   struct rds_msg_zcopy_queue *q;
u32 cookie = znotif->z_cookie;
struct rds_zcopy_cookies *ck;
+   struct list_head *head;
+   unsigned long flags;
 
+   mm_unaccount_pinned_pages(>z_mmp);
q = >rs_zcookie_queue;
spin_lock_irqsave(>lock, flags);
-   tail = skb_peek_tail(q);
-
-   if (tail && skb_zcookie_add(tail, cookie)) {
-   spin_unlock_irqrestore(>lock, flags);
-   mm_unaccount_pinned_pages(>z_mmp);
-   consume_skb(rds_skb_from_znotifier(znotif));
-   /* caller invokes rds_wake_sk_sleep() */
-   return;
+   head = >zcookie_head;
+   if (!list_empty(head)) {
+   info = list_entry(head, struct rds_msg_zcopy_info,
+ rs_zcookie_next);
+   if (info && rds_zcookie_add(info, cookie)) {
+   spin_unlock_irqrestore(>lock, flags);
+

Re: [PATCH v2 net] rds: Incorrect reference counting in TCP socket creation

2018-03-02 Thread Sowmini Varadhan

On (03/01/18 21:07), Ka-Cheong Poon wrote:
> Commit 0933a578cd55 ("rds: tcp: use sock_create_lite() to create the
> accept socket") has a reference counting issue in TCP socket creation
> when accepting a new connection.  The code uses sock_create_lite() to
> create a kernel socket.  But it does not do __module_get() on the
> socket owner.  When the connection is shutdown and sock_release() is
> called to free the socket, the owner's reference count is decremented
> and becomes incorrect.  Note that this bug only shows up when the socket
> owner is configured as a kernel module.

Acked-by: Sowmini Varadhan <sowmini.varad...@oracle.com>

Re: [PATCH net] rds: Incorrect reference counting in TCP socket creation

2018-03-01 Thread Sowmini Varadhan

On (03/01/18 20:19), Ka-Cheong Poon wrote:
> >>
> >>-   new_sock->type = sock->type;
> >>-   new_sock->ops = sock->ops;
> >> ret = sock->ops->accept(sock, new_sock, O_NONBLOCK, true);
> >> if (ret < 0)
> >> goto out;
> >>
> >>+   new_sock->ops = sock->ops;
> >
> >How is this delta relevant to the commit comment? Seems unrelated?
> 
> 
> Note that sock_release() checks if sock->ops is set before
> decrementing the refcnt.  By moving the ops assignment after
> the ops->accept() call, we save increasing the refcnt in
> case the ops->accept() fails.  Otherwise, the __module_get()
> needs to be moved before ops->accept() to handle this failure
> case.

I see, thanks for clarification.

It may be helpful to have some comment in there, in case some other
module trips on something similar in the future.

--Sowmini

Re: [PATCH net] rds: Incorrect reference counting in TCP socket creation

2018-03-01 Thread Sowmini Varadhan

On Wed, Feb 28, 2018 at 11:44 PM, Ka-Cheong Poon
 wrote:
> Commit 0933a578cd55 ("rds: tcp: use sock_create_lite() to create the
> accept socket") has a reference counting issue in TCP socket creation
> when accepting a new connection.  The code uses sock_create_lite() to
> create a kernel socket.  But it does not do __module_get() on the
> socket owner.  When the connection is shutdown and sock_release() is
> called to free the socket, the owner's reference count is decremented
> and becomes incorrect.  Note that this bug only shows up when the socket
> owner is configured as a kernel module.

>
> -   new_sock->type = sock->type;
> -   new_sock->ops = sock->ops;
> ret = sock->ops->accept(sock, new_sock, O_NONBLOCK, true);
> if (ret < 0)
> goto out;
>
> +   new_sock->ops = sock->ops;

How is this delta relevant to the commit comment? Seems unrelated?

--Sowmini

[PATCH RESEND V3 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.

2018-02-27 Thread Sowmini Varadhan

PF_RDS sockets pass up cookies for zerocopy completion as ancillary
data. Update msg_zerocopy to reap this information.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Acked-by: Willem de Bruijn <will...@google.com>
Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
v2: receive zerocopy completion notification as POLLIN
v3: drop ncookies arg in do_process_zerocopy_cookies; Reverse christmas
tree declarations; check for C_TRUNC; print warning when encountering
ignored cmsghdrs in do_recvmsg_completion

 tools/testing/selftests/net/msg_zerocopy.c |   65 ---
 1 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index eff9cf2..406cc70 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,7 +344,53 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static bool do_recv_completion(int fd)
+static uint32_t do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck)
+{
+   int i;
+
+   if (ck->num > RDS_MAX_ZCOOKIES)
+   error(1, 0, "Returned %d cookies, max expected %d\n",
+ ck->num, RDS_MAX_ZCOOKIES);
+   for (i = 0; i < ck->num; i++)
+   if (cfg_verbose >= 2)
+   fprintf(stderr, "%d\n", ck->cookies[i]);
+   return ck->num;
+}
+
+static bool do_recvmsg_completion(int fd)
+{
+   char cmsgbuf[CMSG_SPACE(sizeof(struct rds_zcopy_cookies))];
+   struct rds_zcopy_cookies *ck;
+   struct cmsghdr *cmsg;
+   struct msghdr msg;
+   bool ret = false;
+
+   memset(, 0, sizeof(msg));
+   msg.msg_control = cmsgbuf;
+   msg.msg_controllen = sizeof(cmsgbuf);
+
+   if (recvmsg(fd, , MSG_DONTWAIT))
+   return ret;
+
+   if (msg.msg_flags & MSG_CTRUNC)
+   error(1, errno, "recvmsg notification: truncated");
+
+   for (cmsg = CMSG_FIRSTHDR(); cmsg; cmsg = CMSG_NXTHDR(, cmsg)) {
+   if (cmsg->cmsg_level == SOL_RDS &&
+   cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) {
+
+   ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg);
+   completions += do_process_zerocopy_cookies(ck);
+   ret = true;
+   break;
+   }
+   error(0, 0, "ignoring cmsg at level %d type %d\n",
+   cmsg->cmsg_level, cmsg->cmsg_type);
+   }
+   return ret;
+}
+
+static bool do_recv_completion(int fd, int domain)
 {
struct sock_extended_err *serr;
struct msghdr msg = {};
@@ -353,6 +399,9 @@ static bool do_recv_completion(int fd)
int ret, zerocopy;
char control[100];
 
+   if (domain == PF_RDS)
+   return do_recvmsg_completion(fd);
+
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
@@ -409,20 +458,20 @@ static bool do_recv_completion(int fd)
 }
 
 /* Read all outstanding messages on the errqueue */
-static void do_recv_completions(int fd)
+static void do_recv_completions(int fd, int domain)
 {
-   while (do_recv_completion(fd)) {}
+   while (do_recv_completion(fd, domain)) {}
 }
 
 /* Wait for all remaining completions on the errqueue */
-static void do_recv_remaining_completions(int fd)
+static void do_recv_remaining_completions(int fd, int domain)
 {
int64_t tstop = gettimeofday_ms() + cfg_waittime_ms;
 
while (completions < expected_completions &&
   gettimeofday_ms() < tstop) {
-   if (do_poll(fd, POLLERR))
-   do_recv_completions(fd);
+   if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR))
+   do_recv_completions(fd, domain);
}
 
if (completions < expected_completions)
@@ -503,13 +552,13 @@ static void do_tx(int domain, int type, int protocol)
 
while (!do_poll(fd, POLLOUT)) {
if (cfg_zerocopy)
-   do_recv_completions(fd);
+   do_recv_completions(fd, domain);
}
 
} while (gettimeofday_ms() < tstop);
 
if (cfg_zerocopy)
-   do_recv_remaining_completions(fd);
+   do_recv_remaining_completions(fd, domain);
 
if (close(fd))
error(1, errno, "close");
-- 
1.7.1

[PATCH RESEND V3 net-next 0/3] RDS: optimized notification for zerocopy completion

2018-02-27 Thread Sowmini Varadhan

Resending with acked-by additions: previous attempt does not show
up in Patchwork. This time with a new mail Message-Id.

RDS applications use predominantly request-response, transacation
based IPC, so that ingress and egress traffic are well-balanced,
and it is possible/desirable to reduce system-call overhead by
piggybacking the notifications for zerocopy completion response
with data.

Moreover, it has been pointed out that socket functions block
if sk_err is non-zero, thus if the RDS code does not plan/need
to use sk_error_queue path for completion notification, it
is preferable to remove the sk_errror_queue related paths in
RDS.

Both of these goals are implemented in this series.

v2: removed sk_error_queue support
v3: incorporated additional code review comments (details in each patch)

Sowmini Varadhan (3):
  selftests/net: revert the zerocopy Rx path for PF_RDS
  rds: deliver zerocopy completion notification with data
  selftests/net: reap zerocopy completions passed up as ancillary data.

 include/uapi/linux/errqueue.h  |2 -
 include/uapi/linux/rds.h   |7 ++
 net/rds/af_rds.c   |7 +-
 net/rds/message.c  |   38 -
 net/rds/rds.h  |2 +
 net/rds/recv.c |   31 +++-
 tools/testing/selftests/net/msg_zerocopy.c |  120 
 7 files changed, 111 insertions(+), 96 deletions(-)

[PATCH RESEND V3 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS

2018-02-27 Thread Sowmini Varadhan

In preparation for optimized reception of zerocopy completion,
revert the Rx side changes introduced by Commit dfb8434b0a94
("selftests/net: add zerocopy support for PF_RDS test case")

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Acked-by: Willem de Bruijn <will...@google.com>
Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
v2: prepare to remove sk_error_queue based path; remove recvmsg() as well,
PF_RDS can also use recv() for the usage pattern in msg_zerocopy

 tools/testing/selftests/net/msg_zerocopy.c |   67 
 1 files changed, 0 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 5cc2a53..eff9cf2 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static int do_process_zerocopy_cookies(struct sock_extended_err *serr,
-  uint32_t *ckbuf, size_t nbytes)
-{
-   int ncookies, i;
-
-   if (serr->ee_errno != 0)
-   error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
-   ncookies = serr->ee_data;
-   if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES)
-   error(1, 0, "Returned %d cookies, max expected %d\n",
- ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES);
-   if (nbytes != ncookies * sizeof(uint32_t))
-   error(1, 0, "Expected %d cookies, got %ld\n",
- ncookies, nbytes/sizeof(uint32_t));
-   for (i = 0; i < ncookies; i++)
-   if (cfg_verbose >= 2)
-   fprintf(stderr, "%d\n", ckbuf[i]);
-   return ncookies;
-}
-
 static bool do_recv_completion(int fd)
 {
struct sock_extended_err *serr;
@@ -372,17 +352,10 @@ static bool do_recv_completion(int fd)
uint32_t hi, lo, range;
int ret, zerocopy;
char control[100];
-   uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES];
-   struct iovec iov;
 
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
-   iov.iov_base = ckbuf;
-   iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0]));
-   msg.msg_iov = 
-   msg.msg_iovlen = 1;
-
ret = recvmsg(fd, , MSG_ERRQUEUE);
if (ret == -1 && errno == EAGAIN)
return false;
@@ -402,10 +375,6 @@ static bool do_recv_completion(int fd)
 
serr = (void *) CMSG_DATA(cm);
 
-   if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) {
-   completions += do_process_zerocopy_cookies(serr, ckbuf, ret);
-   return true;
-   }
if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
if (serr->ee_errno != 0)
@@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type)
bytes += cfg_payload_len;
 }
 
-
-static void do_recvmsg(int fd)
-{
-   int ret, off = 0;
-   char *buf;
-   struct iovec iov;
-   struct msghdr msg;
-   struct sockaddr_storage din;
-
-   buf = calloc(cfg_payload_len, sizeof(char));
-   iov.iov_base = buf;
-   iov.iov_len = cfg_payload_len;
-
-   memset(, 0, sizeof(msg));
-   msg.msg_name = 
-   msg.msg_namelen = sizeof(din);
-   msg.msg_iov = 
-   msg.msg_iovlen = 1;
-
-   ret = recvmsg(fd, , MSG_TRUNC);
-
-   if (ret == -1)
-   error(1, errno, "recv");
-   if (ret != cfg_payload_len)
-   error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
-
-   if (memcmp(buf + off, payload, ret))
-   error(1, 0, "recv: data mismatch");
-
-   free(buf);
-   packets++;
-   bytes += cfg_payload_len;
-}
-
 static void do_rx(int domain, int type, int protocol)
 {
uint64_t tstop;
@@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol)
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
-   else if (domain == PF_RDS)
-   do_recvmsg(fd);
else
do_flush_datagram(fd, type);
 
-- 
1.7.1

[PATCH RESEND V3 net-next 2/3] rds: deliver zerocopy completion notification with data

2018-02-27 Thread Sowmini Varadhan

This commit is an optimization over commit 01883eda72bd
("rds: support for zcopy completion notification") for PF_RDS sockets.

RDS applications are predominantly request-response transactions, so
it is more efficient to reduce the number of system calls and have
zerocopy completion notification delivered as ancillary data on the
POLLIN channel.

Cookies are passed up as ancillary data (at level SOL_RDS) in a
struct rds_zcopy_cookies when the returned value of recvmsg() is
greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed
with each message.

This commit removes support for zerocopy completion notification on
MSG_ERRQUEUE for PF_RDS sockets.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
Acked-by: Willem de Bruijn <will...@google.com>
Acked-by: Santosh Shilimkar <santosh.shilim...@oracle.com>
---
v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie()
and callers to make sure we dont remove cookies from the queue and then
fail to pass it up to caller
v3: 
  - bounds check on skb->cb to make sure there is enough room for
struct rds_zcopy_cookies as well as the rds_znotifier; 
  - Refactor cautionary checks in rds_recvmsg_zcookie: if no msg_control
has been passed, or if there not enough msg_controllen for a
a rds_zcopy_cookies, return silently (do not return error, as the
caller may have wanted other ancillary data which may happen to fit
in the space provided)
  - return bool form rds_recvmsg_zcookie, some other code cleanup

 include/uapi/linux/errqueue.h |2 --
 include/uapi/linux/rds.h  |7 +++
 net/rds/af_rds.c  |7 +--
 net/rds/message.c |   38 --
 net/rds/rds.h |2 ++
 net/rds/recv.c|   31 ++-
 6 files changed, 60 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 28812ed..dc64cfa 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,13 +20,11 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6 3
 #define SO_EE_ORIGIN_TXSTATUS  4
 #define SO_EE_ORIGIN_ZEROCOPY  5
-#define SO_EE_ORIGIN_ZCOOKIE   6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
-#defineSO_EE_ORIGIN_MAX_ZCOOKIES   8
 
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 12e3bca..a66b213 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -104,6 +104,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
 #define RDS_CMSG_RXPATH_LATENCY11
 #defineRDS_CMSG_ZCOPY_COOKIE   12
+#defineRDS_CMSG_ZCOPY_COMPLETION   13
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -317,6 +318,12 @@ struct rds_rdma_notify {
 #define RDS_RDMA_DROPPED   3
 #define RDS_RDMA_OTHER_ERROR   4
 
+#defineRDS_MAX_ZCOOKIES8
+struct rds_zcopy_cookies {
+   __u32 num;
+   __u32 cookies[RDS_MAX_ZCOOKIES];
+};
+
 /*
  * Common set of flags for all RDMA related structs
  */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a937f18..f712610 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -77,6 +77,7 @@ static int rds_release(struct socket *sock)
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
rds_notify_queue_get(rs, NULL);
+   __skb_queue_purge(>rs_zcookie_queue);
 
spin_lock_bh(_sock_lock);
list_del_init(>rs_item);
@@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr 
*uaddr,
  *  -  to signal that a previously congested destination may have become
  * uncongested
  *  -  A notification has been queued to the socket (this can be a congestion
- * update, or a RDMA completion).
+ * update, or a RDMA completion, or a MSG_ZEROCOPY completion).
  *
  * EPOLLOUT is asserted if there is room on the send queue. This does not mean
  * however, that the next sendmsg() call will succeed. If the application tries
@@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket 
*sock,
spin_unlock(>rs_lock);
}
if (!list_empty(>rs_recv_queue) ||
-   !list_empty(>rs_notify_queue))
+   !list_empty(>rs_notify_queue) ||
+   !skb_queue_empty(>rs_zcookie_queue))
mask |= (EPOLLIN | EPOLLRDNORM);
if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
mask |= (EPOLLOUT | EPOLLWRNORM);
@@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_recv_queue);
INIT_LIST_HEAD(>rs_notify_queue);

Re: [PATCH net-next 0/2] ntuple filters with RSS

2018-02-27 Thread Sowmini Varadhan

On (02/27/18 12:40), David Miller wrote:
> 
> I'm expecting a V3 respin of your zerocopy notification changes
> if that is what you're talking about, and I therefore marked
> the most recent spin as "changes requested".

sorry, I'm confused - you are waiting for V4?

I am not seeing v3 on patchwork, the only thing I see
is V2 (with "changes requested" as I expected).

V3 is archived here, and afaik needs no more respins:
 https://www.spinics.net/lists/netdev/msg485282.html

Willem had responded with
" For the series:
  Acked-by Willem de Bruijn 
  Small item I missed in v2: the recvmsg in patch 3 should fail hard with
  error() on an unexpected errno. Not worth sending a v4 just for that."

(see https://www.spinics.net/lists/netdev/msg485418.html)

Santosh had incorrectly flagged a smatch non-issue:
  https://www.spinics.net/lists/netdev/msg485424.html

I resent my patch a few minutes ago, but I suspect I may
now be hitting this well-known patchwork bug:
 https://www.spinics.net/lists/sparclinux/msg13787.html

Do I need to do something?

--Sowmini

Re: [PATCH net-next 0/2] ntuple filters with RSS

2018-02-27 Thread Sowmini Varadhan

On (02/27/18 11:49), David Miller wrote:
> > Do I need to resend?
> 
> Yes, see my other email.

do we need to resend patches not showing up in patchwork?
I recall seeing email about picking things manually from inbox
but missed this.

--Sowmini

Re: [PATCH V3 net-next 2/3] rds: deliver zerocopy completion notification with data

2018-02-26 Thread Sowmini Varadhan

On (02/26/18 09:07), Santosh Shilimkar wrote:
> Just in case you haven't seen yet, Dan Carpenter reported skb deref
> warning on previous version of the patch. Not sure why it wasn't sent
> on netdev.

yes I saw it, but that was for the previous version of the patch
around code that delivers notifications on sk_error_queue.

This patch series removes the sk_error_queue support to the
smatch warning is not applicable after this patch.

on a different note, for some odd reason I'm not seeing this patch series
on the patch queue, though its showing up in the archives.

--Sowmini

[PATCH V3 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS

2018-02-25 Thread Sowmini Varadhan

In preparation for optimized reception of zerocopy completion,
revert the Rx side changes introduced by Commit dfb8434b0a94
("selftests/net: add zerocopy support for PF_RDS test case")

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: prepare to remove sk_error_queue based path; remove recvmsg() as well,
PF_RDS can also use recv() for the usage pattern in msg_zerocopy

 tools/testing/selftests/net/msg_zerocopy.c |   67 
 1 files changed, 0 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 5cc2a53..eff9cf2 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static int do_process_zerocopy_cookies(struct sock_extended_err *serr,
-  uint32_t *ckbuf, size_t nbytes)
-{
-   int ncookies, i;
-
-   if (serr->ee_errno != 0)
-   error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
-   ncookies = serr->ee_data;
-   if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES)
-   error(1, 0, "Returned %d cookies, max expected %d\n",
- ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES);
-   if (nbytes != ncookies * sizeof(uint32_t))
-   error(1, 0, "Expected %d cookies, got %ld\n",
- ncookies, nbytes/sizeof(uint32_t));
-   for (i = 0; i < ncookies; i++)
-   if (cfg_verbose >= 2)
-   fprintf(stderr, "%d\n", ckbuf[i]);
-   return ncookies;
-}
-
 static bool do_recv_completion(int fd)
 {
struct sock_extended_err *serr;
@@ -372,17 +352,10 @@ static bool do_recv_completion(int fd)
uint32_t hi, lo, range;
int ret, zerocopy;
char control[100];
-   uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES];
-   struct iovec iov;
 
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
-   iov.iov_base = ckbuf;
-   iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0]));
-   msg.msg_iov = 
-   msg.msg_iovlen = 1;
-
ret = recvmsg(fd, , MSG_ERRQUEUE);
if (ret == -1 && errno == EAGAIN)
return false;
@@ -402,10 +375,6 @@ static bool do_recv_completion(int fd)
 
serr = (void *) CMSG_DATA(cm);
 
-   if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) {
-   completions += do_process_zerocopy_cookies(serr, ckbuf, ret);
-   return true;
-   }
if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
if (serr->ee_errno != 0)
@@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type)
bytes += cfg_payload_len;
 }
 
-
-static void do_recvmsg(int fd)
-{
-   int ret, off = 0;
-   char *buf;
-   struct iovec iov;
-   struct msghdr msg;
-   struct sockaddr_storage din;
-
-   buf = calloc(cfg_payload_len, sizeof(char));
-   iov.iov_base = buf;
-   iov.iov_len = cfg_payload_len;
-
-   memset(, 0, sizeof(msg));
-   msg.msg_name = 
-   msg.msg_namelen = sizeof(din);
-   msg.msg_iov = 
-   msg.msg_iovlen = 1;
-
-   ret = recvmsg(fd, , MSG_TRUNC);
-
-   if (ret == -1)
-   error(1, errno, "recv");
-   if (ret != cfg_payload_len)
-   error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
-
-   if (memcmp(buf + off, payload, ret))
-   error(1, 0, "recv: data mismatch");
-
-   free(buf);
-   packets++;
-   bytes += cfg_payload_len;
-}
-
 static void do_rx(int domain, int type, int protocol)
 {
uint64_t tstop;
@@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol)
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
-   else if (domain == PF_RDS)
-   do_recvmsg(fd);
else
do_flush_datagram(fd, type);
 
-- 
1.7.1

[PATCH V3 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.

2018-02-25 Thread Sowmini Varadhan

PF_RDS sockets pass up cookies for zerocopy completion as ancillary
data. Update msg_zerocopy to reap this information.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: receive zerocopy completion notification as POLLIN
v3: drop ncookies arg in do_process_zerocopy_cookies; Reverse christmas
tree declarations; check for C_TRUNC; print warning when encountering
ignored cmsghdrs in do_recvmsg_completion

 tools/testing/selftests/net/msg_zerocopy.c |   65 ---
 1 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index eff9cf2..406cc70 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,7 +344,53 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static bool do_recv_completion(int fd)
+static uint32_t do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck)
+{
+   int i;
+
+   if (ck->num > RDS_MAX_ZCOOKIES)
+   error(1, 0, "Returned %d cookies, max expected %d\n",
+ ck->num, RDS_MAX_ZCOOKIES);
+   for (i = 0; i < ck->num; i++)
+   if (cfg_verbose >= 2)
+   fprintf(stderr, "%d\n", ck->cookies[i]);
+   return ck->num;
+}
+
+static bool do_recvmsg_completion(int fd)
+{
+   char cmsgbuf[CMSG_SPACE(sizeof(struct rds_zcopy_cookies))];
+   struct rds_zcopy_cookies *ck;
+   struct cmsghdr *cmsg;
+   struct msghdr msg;
+   bool ret = false;
+
+   memset(, 0, sizeof(msg));
+   msg.msg_control = cmsgbuf;
+   msg.msg_controllen = sizeof(cmsgbuf);
+
+   if (recvmsg(fd, , MSG_DONTWAIT))
+   return ret;
+
+   if (msg.msg_flags & MSG_CTRUNC)
+   error(1, errno, "recvmsg notification: truncated");
+
+   for (cmsg = CMSG_FIRSTHDR(); cmsg; cmsg = CMSG_NXTHDR(, cmsg)) {
+   if (cmsg->cmsg_level == SOL_RDS &&
+   cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) {
+
+   ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg);
+   completions += do_process_zerocopy_cookies(ck);
+   ret = true;
+   break;
+   }
+   error(0, 0, "ignoring cmsg at level %d type %d\n",
+   cmsg->cmsg_level, cmsg->cmsg_type);
+   }
+   return ret;
+}
+
+static bool do_recv_completion(int fd, int domain)
 {
struct sock_extended_err *serr;
struct msghdr msg = {};
@@ -353,6 +399,9 @@ static bool do_recv_completion(int fd)
int ret, zerocopy;
char control[100];
 
+   if (domain == PF_RDS)
+   return do_recvmsg_completion(fd);
+
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
@@ -409,20 +458,20 @@ static bool do_recv_completion(int fd)
 }
 
 /* Read all outstanding messages on the errqueue */
-static void do_recv_completions(int fd)
+static void do_recv_completions(int fd, int domain)
 {
-   while (do_recv_completion(fd)) {}
+   while (do_recv_completion(fd, domain)) {}
 }
 
 /* Wait for all remaining completions on the errqueue */
-static void do_recv_remaining_completions(int fd)
+static void do_recv_remaining_completions(int fd, int domain)
 {
int64_t tstop = gettimeofday_ms() + cfg_waittime_ms;
 
while (completions < expected_completions &&
   gettimeofday_ms() < tstop) {
-   if (do_poll(fd, POLLERR))
-   do_recv_completions(fd);
+   if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR))
+   do_recv_completions(fd, domain);
}
 
if (completions < expected_completions)
@@ -503,13 +552,13 @@ static void do_tx(int domain, int type, int protocol)
 
while (!do_poll(fd, POLLOUT)) {
if (cfg_zerocopy)
-   do_recv_completions(fd);
+   do_recv_completions(fd, domain);
}
 
} while (gettimeofday_ms() < tstop);
 
if (cfg_zerocopy)
-   do_recv_remaining_completions(fd);
+   do_recv_remaining_completions(fd, domain);
 
if (close(fd))
error(1, errno, "close");
-- 
1.7.1

[PATCH V3 net-next 2/3] rds: deliver zerocopy completion notification with data

2018-02-25 Thread Sowmini Varadhan

This commit is an optimization over commit 01883eda72bd
("rds: support for zcopy completion notification") for PF_RDS sockets.

RDS applications are predominantly request-response transactions, so
it is more efficient to reduce the number of system calls and have
zerocopy completion notification delivered as ancillary data on the
POLLIN channel.

Cookies are passed up as ancillary data (at level SOL_RDS) in a
struct rds_zcopy_cookies when the returned value of recvmsg() is
greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed
with each message.

This commit removes support for zerocopy completion notification on
MSG_ERRQUEUE for PF_RDS sockets.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie()
and callers to make sure we dont remove cookies from the queue and then
fail to pass it up to caller
v3: 
  - bounds check on skb->cb to make sure there is enough room for
struct rds_zcopy_cookies as well as the rds_znotifier; 
  - Refactor cautionary checks in rds_recvmsg_zcookie: if no msg_control
has been passed, or if there not enough msg_controllen for a
a rds_zcopy_cookies, return silently (do not return error, as the
caller may have wanted other ancillary data which may happen to fit
in the space provided)
  - return bool form rds_recvmsg_zcookie, some other code cleanup

 include/uapi/linux/errqueue.h |2 --
 include/uapi/linux/rds.h  |7 +++
 net/rds/af_rds.c  |7 +--
 net/rds/message.c |   38 --
 net/rds/rds.h |2 ++
 net/rds/recv.c|   31 ++-
 6 files changed, 60 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 28812ed..dc64cfa 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,13 +20,11 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6 3
 #define SO_EE_ORIGIN_TXSTATUS  4
 #define SO_EE_ORIGIN_ZEROCOPY  5
-#define SO_EE_ORIGIN_ZCOOKIE   6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
-#defineSO_EE_ORIGIN_MAX_ZCOOKIES   8
 
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 12e3bca..a66b213 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -104,6 +104,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
 #define RDS_CMSG_RXPATH_LATENCY11
 #defineRDS_CMSG_ZCOPY_COOKIE   12
+#defineRDS_CMSG_ZCOPY_COMPLETION   13
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -317,6 +318,12 @@ struct rds_rdma_notify {
 #define RDS_RDMA_DROPPED   3
 #define RDS_RDMA_OTHER_ERROR   4
 
+#defineRDS_MAX_ZCOOKIES8
+struct rds_zcopy_cookies {
+   __u32 num;
+   __u32 cookies[RDS_MAX_ZCOOKIES];
+};
+
 /*
  * Common set of flags for all RDMA related structs
  */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a937f18..f712610 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -77,6 +77,7 @@ static int rds_release(struct socket *sock)
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
rds_notify_queue_get(rs, NULL);
+   __skb_queue_purge(>rs_zcookie_queue);
 
spin_lock_bh(_sock_lock);
list_del_init(>rs_item);
@@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr 
*uaddr,
  *  -  to signal that a previously congested destination may have become
  * uncongested
  *  -  A notification has been queued to the socket (this can be a congestion
- * update, or a RDMA completion).
+ * update, or a RDMA completion, or a MSG_ZEROCOPY completion).
  *
  * EPOLLOUT is asserted if there is room on the send queue. This does not mean
  * however, that the next sendmsg() call will succeed. If the application tries
@@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket 
*sock,
spin_unlock(>rs_lock);
}
if (!list_empty(>rs_recv_queue) ||
-   !list_empty(>rs_notify_queue))
+   !list_empty(>rs_notify_queue) ||
+   !skb_queue_empty(>rs_zcookie_queue))
mask |= (EPOLLIN | EPOLLRDNORM);
if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
mask |= (EPOLLOUT | EPOLLWRNORM);
@@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_recv_queue);
INIT_LIST_HEAD(>rs_notify_queue);
INIT_LIST_HEAD(>rs_cong_list);
+   skb_queue_head_init(>rs_zcookie_queue);
spin_lock_init(>

[PATCH V3 net-next 0/3] RDS: optimized notification for zerocopy completion

2018-02-25 Thread Sowmini Varadhan

RDS applications use predominantly request-response, transacation
based IPC, so that ingress and egress traffic are well-balanced,
and it is possible/desirable to reduce system-call overhead by
piggybacking the notifications for zerocopy completion response
with data.

Moreover, it has been pointed out that socket functions block
if sk_err is non-zero, thus if the RDS code does not plan/need
to use sk_error_queue path for completion notification, it
is preferable to remove the sk_errror_queue related paths in
RDS.

Both of these goals are implemented in this series.

v2: removed sk_error_queue support
v3: incorporated additional code review comments (details in each patch)

Sowmini Varadhan (3):
  selftests/net: revert the zerocopy Rx path for PF_RDS
  rds: deliver zerocopy completion notification with data
  selftests/net: reap zerocopy completions passed up as ancillary data.

 include/uapi/linux/errqueue.h  |2 -
 include/uapi/linux/rds.h   |7 ++
 net/rds/af_rds.c   |7 +-
 net/rds/message.c  |   38 -
 net/rds/rds.h  |2 +
 net/rds/recv.c |   31 +++-
 tools/testing/selftests/net/msg_zerocopy.c |  120 
 7 files changed, 111 insertions(+), 96 deletions(-)

Re: [PATCH V2 net-next 2/3] rds: deliver zerocopy completion notification with data

2018-02-25 Thread Sowmini Varadhan

On (02/25/18 10:56), Willem de Bruijn wrote:
> > @@ -91,22 +85,19 @@ static void rds_rm_zerocopy_callback(struct rds_sock 
> > *rs,
> > spin_unlock_irqrestore(>lock, flags);
> > mm_unaccount_pinned_pages(>z_mmp);
> > consume_skb(rds_skb_from_znotifier(znotif));
> > -   sk->sk_error_report(sk);
> > +   /* caller should wake up POLLIN */
> 
> sk->sk_data_ready(sk);

yes, this was my first thought, but everything else in rds
is calling rds_wake_sk_sleep (this is even done in
rds_recv_incoming(), which actually queues up the data), 
so I chose to align with that model (and call this in the caller 
of rds_rm_zerocopy_callback()

> Without the error queue, the struct no longer needs to be an skb,
> per se. Converting to a different struct with list_head is definitely
> a longer patch. But kmalloc will be cheaper than alloc_skb.
> Perhaps something to try (as separate follow-on work).

right, I was thinking along these exact lines as well,
and was already planning a follow-up.

> > +   if (!sock_flag(rds_rs_to_sk(rs), SOCK_ZEROCOPY) || !skb_peek(q))
> > +   return 0;
> 
> Racy read?

Can you elaborate? I only put the skb_peek to quickly
bail for sockets that are not using zerocopy at all- 
if you race against something that's queuing data, and 
miss it on the peek, the next read/recv should find it.
Am I missing some race?


> 
> > +
> > +   if (!msg->msg_control ||
> 
> I'd move this first, so that the cookie queue need not even be probed
> in the common case.

you mean before the check for SOCK_ZEROCOPY?

> > +   msg->msg_controllen < CMSG_SPACE(sizeof(*done)))
> > +   return 0;
> 
> if caller does not satisfy the contract on controllen size, can be
> more explicit and return an error.

if SOCK_ZEROCOPY has been set, but the recv did not specify a cmsghdr,
you mean?

> > +   ncookies = rds_recvmsg_zcookie(rs, msg);

Will take care of the remaining comments in V3.

[PATCH V2 net-next 0/3] RDS: optimized notification for zerocopy completion

2018-02-23 Thread Sowmini Varadhan

RDS applications use predominantly request-response, transacation 
based IPC, so that ingress and egress traffic are well-balanced,
and it is possible/desirable to reduce system-call overhead by
piggybacking the notifications for zerocopy completion response
with data. 

Moreover, it has been pointed out that socket functions block
if sk_err is non-zero, thus if the RDS code does not plan/need
to use sk_error_queue path for completion notification, it
is preferable to remove the sk_errror_queue related paths in
RDS.

Both of these goals are implemented in this series.

Sowmini Varadhan (3):
  selftests/net: revert the zerocopy Rx path for PF_RDS
  rds: deliver zerocopy completion notification with data
  selftests/net: reap zerocopy completions passed up as ancillary data.

 include/uapi/linux/errqueue.h  |2 -
 include/uapi/linux/rds.h   |7 ++
 net/rds/af_rds.c   |7 ++-
 net/rds/message.c  |   35 --
 net/rds/rds.h  |2 +
 net/rds/recv.c |   34 +-
 tools/testing/selftests/net/msg_zerocopy.c |  109 +++-
 7 files changed, 103 insertions(+), 93 deletions(-)

[PATCH V2 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.

2018-02-23 Thread Sowmini Varadhan

PF_RDS sockets pass up cookies for zerocopy completion as ancillary
data. Update msg_zerocopy to reap this information.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: receive zerocopy completion notification as POLLIN

 tools/testing/selftests/net/msg_zerocopy.c |   60 
 1 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index eff9cf2..8c466e8 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,7 +344,48 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static bool do_recv_completion(int fd)
+static int do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck)
+{
+   int ncookies, i;
+
+   ncookies = ck->num;
+   if (ncookies > RDS_MAX_ZCOOKIES)
+   error(1, 0, "Returned %d cookies, max expected %d\n",
+ ncookies, RDS_MAX_ZCOOKIES);
+   for (i = 0; i < ncookies; i++)
+   if (cfg_verbose >= 2)
+   fprintf(stderr, "%d\n", ck->cookies[i]);
+   return ncookies;
+}
+
+static int do_recvmsg_completion(int fd)
+{
+   struct msghdr msg;
+   char cmsgbuf[256];
+   struct cmsghdr *cmsg;
+   bool ret = false;
+   struct rds_zcopy_cookies *ck;
+
+   memset(, 0, sizeof(msg));
+   msg.msg_control = cmsgbuf;
+   msg.msg_controllen = sizeof(cmsgbuf);
+
+   if (recvmsg(fd, , MSG_DONTWAIT))
+   return ret;
+   for (cmsg = CMSG_FIRSTHDR(); cmsg; cmsg = CMSG_NXTHDR(, cmsg)) {
+   if (cmsg->cmsg_level == SOL_RDS &&
+   cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) {
+
+   ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg);
+   completions += do_process_zerocopy_cookies(ck);
+   ret = true;
+   break;
+   }
+   }
+   return ret;
+}
+
+static bool do_recv_completion(int fd, int domain)
 {
struct sock_extended_err *serr;
struct msghdr msg = {};
@@ -353,6 +394,9 @@ static bool do_recv_completion(int fd)
int ret, zerocopy;
char control[100];
 
+   if (domain == PF_RDS)
+   return do_recvmsg_completion(fd);
+
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
@@ -409,20 +453,20 @@ static bool do_recv_completion(int fd)
 }
 
 /* Read all outstanding messages on the errqueue */
-static void do_recv_completions(int fd)
+static void do_recv_completions(int fd, int domain)
 {
-   while (do_recv_completion(fd)) {}
+   while (do_recv_completion(fd, domain)) {}
 }
 
 /* Wait for all remaining completions on the errqueue */
-static void do_recv_remaining_completions(int fd)
+static void do_recv_remaining_completions(int fd, int domain)
 {
int64_t tstop = gettimeofday_ms() + cfg_waittime_ms;
 
while (completions < expected_completions &&
   gettimeofday_ms() < tstop) {
-   if (do_poll(fd, POLLERR))
-   do_recv_completions(fd);
+   if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR))
+   do_recv_completions(fd, domain);
}
 
if (completions < expected_completions)
@@ -503,13 +547,13 @@ static void do_tx(int domain, int type, int protocol)
 
while (!do_poll(fd, POLLOUT)) {
if (cfg_zerocopy)
-   do_recv_completions(fd);
+   do_recv_completions(fd, domain);
}
 
} while (gettimeofday_ms() < tstop);
 
if (cfg_zerocopy)
-   do_recv_remaining_completions(fd);
+   do_recv_remaining_completions(fd, domain);
 
if (close(fd))
error(1, errno, "close");
-- 
1.7.1

[PATCH V2 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS

2018-02-23 Thread Sowmini Varadhan

In preparation for optimized reception of zerocopy completion,
revert the Rx side changes introduced by Commit dfb8434b0a94
("selftests/net: add zerocopy support for PF_RDS test case")

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: prepare to remove sk_error_queue based path; remove recvmsg() as well,
PF_RDS can also use recv() for the usage pattern in msg_zerocopy

 tools/testing/selftests/net/msg_zerocopy.c |   67 
 1 files changed, 0 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 5cc2a53..eff9cf2 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol)
return fd;
 }
 
-static int do_process_zerocopy_cookies(struct sock_extended_err *serr,
-  uint32_t *ckbuf, size_t nbytes)
-{
-   int ncookies, i;
-
-   if (serr->ee_errno != 0)
-   error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
-   ncookies = serr->ee_data;
-   if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES)
-   error(1, 0, "Returned %d cookies, max expected %d\n",
- ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES);
-   if (nbytes != ncookies * sizeof(uint32_t))
-   error(1, 0, "Expected %d cookies, got %ld\n",
- ncookies, nbytes/sizeof(uint32_t));
-   for (i = 0; i < ncookies; i++)
-   if (cfg_verbose >= 2)
-   fprintf(stderr, "%d\n", ckbuf[i]);
-   return ncookies;
-}
-
 static bool do_recv_completion(int fd)
 {
struct sock_extended_err *serr;
@@ -372,17 +352,10 @@ static bool do_recv_completion(int fd)
uint32_t hi, lo, range;
int ret, zerocopy;
char control[100];
-   uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES];
-   struct iovec iov;
 
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
 
-   iov.iov_base = ckbuf;
-   iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0]));
-   msg.msg_iov = 
-   msg.msg_iovlen = 1;
-
ret = recvmsg(fd, , MSG_ERRQUEUE);
if (ret == -1 && errno == EAGAIN)
return false;
@@ -402,10 +375,6 @@ static bool do_recv_completion(int fd)
 
serr = (void *) CMSG_DATA(cm);
 
-   if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) {
-   completions += do_process_zerocopy_cookies(serr, ckbuf, ret);
-   return true;
-   }
if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
if (serr->ee_errno != 0)
@@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type)
bytes += cfg_payload_len;
 }
 
-
-static void do_recvmsg(int fd)
-{
-   int ret, off = 0;
-   char *buf;
-   struct iovec iov;
-   struct msghdr msg;
-   struct sockaddr_storage din;
-
-   buf = calloc(cfg_payload_len, sizeof(char));
-   iov.iov_base = buf;
-   iov.iov_len = cfg_payload_len;
-
-   memset(, 0, sizeof(msg));
-   msg.msg_name = 
-   msg.msg_namelen = sizeof(din);
-   msg.msg_iov = 
-   msg.msg_iovlen = 1;
-
-   ret = recvmsg(fd, , MSG_TRUNC);
-
-   if (ret == -1)
-   error(1, errno, "recv");
-   if (ret != cfg_payload_len)
-   error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
-
-   if (memcmp(buf + off, payload, ret))
-   error(1, 0, "recv: data mismatch");
-
-   free(buf);
-   packets++;
-   bytes += cfg_payload_len;
-}
-
 static void do_rx(int domain, int type, int protocol)
 {
uint64_t tstop;
@@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol)
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
-   else if (domain == PF_RDS)
-   do_recvmsg(fd);
else
do_flush_datagram(fd, type);
 
-- 
1.7.1

[PATCH V2 net-next 2/3] rds: deliver zerocopy completion notification with data

2018-02-23 Thread Sowmini Varadhan

This commit is an optimization of the commit 01883eda72bd
("rds: support for zcopy completion notification") for PF_RDS sockets.

RDS applications are predominantly request-response transactions, so
it is more efficient to reduce the number of system calls and have
zerocopy completion notification delivered as ancillary data on the
POLLIN channel.

Cookies are passed up as ancillary data (at level SOL_RDS) in a
struct rds_zcopy_cookies when the returned value of recvmsg() is
greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed
with each message.

This commit removes support for zerocopy completion notification on
MSG_ERRQUEUE for PF_RDS sockets.

Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie()
and callers to make sure we dont remove cookies from the queue and then
fail to pass it up to caller

 include/uapi/linux/errqueue.h |2 --
 include/uapi/linux/rds.h  |7 +++
 net/rds/af_rds.c  |7 +--
 net/rds/message.c |   35 +--
 net/rds/rds.h |2 ++
 net/rds/recv.c|   34 +-
 6 files changed, 60 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 28812ed..dc64cfa 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,13 +20,11 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6 3
 #define SO_EE_ORIGIN_TXSTATUS  4
 #define SO_EE_ORIGIN_ZEROCOPY  5
-#define SO_EE_ORIGIN_ZCOOKIE   6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
-#defineSO_EE_ORIGIN_MAX_ZCOOKIES   8
 
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 12e3bca..a66b213 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -104,6 +104,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
 #define RDS_CMSG_RXPATH_LATENCY11
 #defineRDS_CMSG_ZCOPY_COOKIE   12
+#defineRDS_CMSG_ZCOPY_COMPLETION   13
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -317,6 +318,12 @@ struct rds_rdma_notify {
 #define RDS_RDMA_DROPPED   3
 #define RDS_RDMA_OTHER_ERROR   4
 
+#defineRDS_MAX_ZCOOKIES8
+struct rds_zcopy_cookies {
+   __u32 num;
+   __u32 cookies[RDS_MAX_ZCOOKIES];
+};
+
 /*
  * Common set of flags for all RDMA related structs
  */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a937f18..f712610 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -77,6 +77,7 @@ static int rds_release(struct socket *sock)
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
rds_notify_queue_get(rs, NULL);
+   __skb_queue_purge(>rs_zcookie_queue);
 
spin_lock_bh(_sock_lock);
list_del_init(>rs_item);
@@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr 
*uaddr,
  *  -  to signal that a previously congested destination may have become
  * uncongested
  *  -  A notification has been queued to the socket (this can be a congestion
- * update, or a RDMA completion).
+ * update, or a RDMA completion, or a MSG_ZEROCOPY completion).
  *
  * EPOLLOUT is asserted if there is room on the send queue. This does not mean
  * however, that the next sendmsg() call will succeed. If the application tries
@@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket 
*sock,
spin_unlock(>rs_lock);
}
if (!list_empty(>rs_recv_queue) ||
-   !list_empty(>rs_notify_queue))
+   !list_empty(>rs_notify_queue) ||
+   !skb_queue_empty(>rs_zcookie_queue))
mask |= (EPOLLIN | EPOLLRDNORM);
if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
mask |= (EPOLLOUT | EPOLLWRNORM);
@@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_recv_queue);
INIT_LIST_HEAD(>rs_notify_queue);
INIT_LIST_HEAD(>rs_cong_list);
+   skb_queue_head_init(>rs_zcookie_queue);
spin_lock_init(>rs_rdma_lock);
rs->rs_rdma_keys = RB_ROOT;
rs->rs_rx_traces = 0;
diff --git a/net/rds/message.c b/net/rds/message.c
index 6518345..2e8bdaf 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -58,32 +58,26 @@ void rds_message_addref(struct rds_message *rm)
 
 static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie)
 {
-   struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
-   int ncookies;
-   u32 *ptr;
+   struct rds_zcopy_cookies *ck = (st

[PATCH net-next] rds: rds_msg_zcopy should return error of null rm->data.op_mmp_znotifier

2018-02-22 Thread Sowmini Varadhan

if either or both of MSG_ZEROCOPY and SOCK_ZEROCOPY have not been
specified, the rm->data.op_mmp_znotifier allocation will be skipped.
In this case, it is invalid ot pass down a cmsghdr with
RDS_CMSG_ZCOPY_COOKIE, so return EINVAL from rds_msg_zcopy for this
case.

Reported-by: syzbot+f893ae7bb2f6456df...@syzkaller.appspotmail.com
Fixes: 0cebaccef3ac ("rds: zerocopy Tx support.")
Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>
---
 net/rds/send.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 028ab59..c848cbb 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -939,7 +939,8 @@ static int rds_cmsg_zcopy(struct rds_sock *rs, struct 
rds_message *rm,
 {
u32 *cookie;
 
-   if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)))
+   if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)) ||
+   !rm->data.op_mmp_znotifier)
return -EINVAL;
cookie = CMSG_DATA(cmsg);
rm->data.op_mmp_znotifier->z_cookie = *cookie;
-- 
1.7.1

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-22 Thread Sowmini Varadhan

On (02/21/18 19:39), Willem de Bruijn wrote:
> >> By the way, the put_cmsg is unconditional even if the caller did
> >> not supply msg_control. So it is basically no longer safe to ever
> >> call read, recv or recvfrom on a socket if zerocopy notifications
> >> are outstanding.
> >
> > Wait, I thought put_cmsg already checks for these things.
> 
> It does, and sets MSG_CTRUNC to signal that it was unable to
> write all control data. But by then the notifications have already
> been dequeued.

Putting hyperbole about "no longer safe to ever call read etc" aside,

put_cmsg can also return EFAULT if uspace provides a bogus cmsghdr,
(i.e., copy_to_user fails). So the only thing you can do to really
protect against every possible thing is to requeue the notification 
if put_cmsg fails.

1 2 3 4 5 6 7 >

1 - 100 of 628 matches

Mail list logo