date:20180603

答复: ANNOUNCE: Enhanced IP v1.4

2018-06-03 Thread PKU . 孙斌

On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
> 
> 
> On 06/03/2018 01:37 PM, Tom Herbert wrote:
> 
> > This is not an inconsequential mechanism that is being proposed. It's
> > a modification to IP protocol that is intended to work on the
> > Internet, but it looks like the draft hasn't been updated for two
> > years and it is not adopted by any IETF working group. I don't see how
> > this can go anywhere without IETF support. Also, I suggest that you
> > look at the IPv10 proposal since that was very similar in intent. One
> > of the reasons that IPv10 shot down was because protocol transition
> > mechanisms were more interesting ten years ago than today. IPv6 has
> > good traction now. In fact, it's probably the case that it's now
> > easier to bring up IPv6 than to try to make IPv4 options work over the
> > Internet.
> 
> +1
> 
> Many hosts do not use IPv4 anymore.
> 
> We even have the project making IPv4 support in linux optional.

I guess then Linux kernel wouldn't be able to boot itself without IPv4 built 
in, e.g., when we only have old L2 links (without the IPv6 frame type)...

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-03 Thread Willy Tarreau

On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
> 
> 
> On 06/03/2018 01:37 PM, Tom Herbert wrote:
> 
> > This is not an inconsequential mechanism that is being proposed. It's
> > a modification to IP protocol that is intended to work on the
> > Internet, but it looks like the draft hasn't been updated for two
> > years and it is not adopted by any IETF working group. I don't see how
> > this can go anywhere without IETF support. Also, I suggest that you
> > look at the IPv10 proposal since that was very similar in intent. One
> > of the reasons that IPv10 shot down was because protocol transition
> > mechanisms were more interesting ten years ago than today. IPv6 has
> > good traction now. In fact, it's probably the case that it's now
> > easier to bring up IPv6 than to try to make IPv4 options work over the
> > Internet.
> 
> +1
> 
> Many hosts do not use IPv4 anymore.
> 
> We even have the project making IPv4 support in linux optional.

I agree on these points, but I'd like to figure what can be done to put
a bit more pressure on ISPs to *always* provide IPv6. It's still very
hard to have decent connectivity at home and without this it will continue
to be marginalize.

I do have IPv6 at home (a /48, waste of addressing space, I'd be fine
with less), there's none at work (I don't even know if the ISP supports
it, at least it was never ever mentioned so probably they don't know
about this), and some ISPs only provide a /64 which is as ridiculous
as providing a single address as it forces the end user to NAT thus
breaking the end-to-end principle. Ideally with IoT at the door, every
home connection should have at least a /60 and enterprises should have
a /56, and this by default, without having to request anything.

Maybe setting up a public list of ISPs where users don't have at least
a /60 by default could help, but I suspect that most of them will
consider that as long as their competitors are on the list there's no
emergency.

Willy

Re: [PATCH net] net: ipv6: prevent use after free in ip6_route_mpath_notify()

2018-06-03 Thread Eric Dumazet




On 06/03/2018 07:46 AM, David Ahern wrote:

> It was a mistake to set rt_last before checking err. So the
> use-after-free exposed the semantic error.
> 

SGTM, please send the formal patch then, thanks !

Re: [PATCH bpf-next] bpf: flowlabel in bpf_fib_lookup should be flowinfo

2018-06-03 Thread Alexei Starovoitov

On Sun, Jun 03, 2018 at 07:47:11PM -0600, David Ahern wrote:
> On 6/3/18 7:41 PM, Alexei Starovoitov wrote:
> > On Sun, Jun 03, 2018 at 08:15:19AM -0700, dsah...@kernel.org wrote:
> >> From: David Ahern 
> >>
> >> As Michal noted the flow struct takes both the flow label and priority.
> >> Update the bpf_fib_lookup API to note that it is flowinfo and not just
> >> the flow label.
> >>
> >> Cc: Michal Kubecek 
> >> Signed-off-by: David Ahern 
> > 
> > Applied, Thanks
> > 
> 
> I noticed 4.17 was released. Just to make sure we are on the same page,
> this patch needs to be 4.18.

It was applied to bpf-next obviously.

As soon as we resolve the situation with af_xdp the PR will be sent to Dave
for net-next, so net-next can be sent to Linus.

Re: [PATCH bpf-next] bpf: flowlabel in bpf_fib_lookup should be flowinfo

2018-06-03 Thread David Ahern

On 6/3/18 7:41 PM, Alexei Starovoitov wrote:
> On Sun, Jun 03, 2018 at 08:15:19AM -0700, dsah...@kernel.org wrote:
>> From: David Ahern 
>>
>> As Michal noted the flow struct takes both the flow label and priority.
>> Update the bpf_fib_lookup API to note that it is flowinfo and not just
>> the flow label.
>>
>> Cc: Michal Kubecek 
>> Signed-off-by: David Ahern 
> 
> Applied, Thanks
> 

I noticed 4.17 was released. Just to make sure we are on the same page,
this patch needs to be 4.18.

Re: [PATCH bpf-next] bpf: flowlabel in bpf_fib_lookup should be flowinfo

2018-06-03 Thread Alexei Starovoitov

On Sun, Jun 03, 2018 at 08:15:19AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> As Michal noted the flow struct takes both the flow label and priority.
> Update the bpf_fib_lookup API to note that it is flowinfo and not just
> the flow label.
> 
> Cc: Michal Kubecek 
> Signed-off-by: David Ahern 

Applied, Thanks

Re: [PATCH net-next v2 0/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Alexei Starovoitov

On Sun, Jun 03, 2018 at 03:59:40PM -0700, Yonghong Song wrote:
> bpf has been used extensively for tracing. For example, bcc
> contains an almost full set of bpf-based tools to trace kernel
> and user functions/events. Most tracing tools are currently
> either filtered based on pid or system-wide.
> 
> Containers have been used quite extensively in industry and
> cgroup is often used together to provide resource isolation
> and protection. Several processes may run inside the same
> container. It is often desirable to get container-level tracing
> results as well, e.g. syscall count, function count, I/O
> activity, etc.
> 
> This patch implements a new helper, bpf_get_current_cgroup_id(),
> which will return cgroup id based on the cgroup within which
> the current task is running.
> 
> Patch #1 implements the new helper in the kernel.
> Patch #2 syncs the uapi bpf.h header and helper between tools
> and kernel.
> Patch #3 shows how to get the same cgroup id in user space,
> so a filter or policy could be configgured in the bpf program
> based on current task cgroup.
> 
> Changelog:
>   v1 -> v2:
>  . rebase to resolve merge conflict with latest bpf-next.

Applied, Thanks.

Re: [PATCH iproute2] iplink_vrf: Save device index from response for return code

2018-06-03 Thread Hangbin Liu

On Fri, Jun 01, 2018 at 08:50:16AM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> A recent commit changed rtnl_talk_* to return the response message in
> allocated memory so callers need to free it. The change to name_is_vrf
> did not save the device index which is pointing to a struct inside the
> now allocated and freed memory resulting in garbage getting returned
> in some cases.
> 
> Fix by using a stack variable to save the return value and only set
> it to ifi->ifi_index after all checks are done and before the answer
> buffer is freed.
> 
> Fixes: 86bf43c7c2fdc ("lib/libnetlink: update rtnl_talk to support malloc 
> buff at run time")
> Cc: Hangbin Liu 
> Cc: Phil Sutter 
> Signed-off-by: David Ahern 
> ---
>  ip/iplink_vrf.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
> index e9dd0df98412..6004bb4f305e 100644
> --- a/ip/iplink_vrf.c
> +++ b/ip/iplink_vrf.c
> @@ -191,6 +191,7 @@ int name_is_vrf(const char *name)
>   struct rtattr *tb[IFLA_MAX+1];
>   struct rtattr *li[IFLA_INFO_MAX+1];
>   struct ifinfomsg *ifi;
> + int ifindex = 0;
>   int len;
>  
>   addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
> @@ -218,7 +219,8 @@ int name_is_vrf(const char *name)
>   if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
>   goto out;
>  
> + ifindex = ifi->ifi_index;
>  out:
>   free(answer);
> - return ifi->ifi_index;
> + return ifindex;
>  }
> -- 
> 2.11.0
> 

Thanks for the fix.

Acked-by: Hangbin Liu

Re: [PATCH v5 net] stmmac: 802.1ad tag stripping fix

2018-06-03 Thread Toshiaki Makita

On 2018/06/03 23:33, David Miller wrote:
> From: Elad Nachman 
> Date: Wed, 30 May 2018 08:48:25 +0300
> 
>>  static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
>>  {
>> -struct ethhdr *ehdr;
>> +struct vlan_ethhdr *veth;
>>  u16 vlanid;
>> +__be16 vlan_proto;
> 
> Please order local variables from longest to shortest line.
> 
>>  
>> -if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
>> -NETIF_F_HW_VLAN_CTAG_RX &&
>> -!__vlan_get_tag(skb, )) {
>> +if (!__vlan_get_tag(skb, )) {
>>  /* pop the vlan tag */
>> -ehdr = (struct ethhdr *)skb->data;
>> -memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
>> +veth = (struct vlan_ethhdr *)skb->data;
>> +vlan_proto = veth->h_vlan_proto;
>> +memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
>>  skb_pull(skb, VLAN_HLEN);
>> -__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
>> +__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
>>  }
>>  }
> 
> I can't see how it is valid to do an unconditional software VLAN
> untagging even when VLAN is disabled in the kernel config or the
> NETIF_F_* feature bits are not set.

Right. It is not valid.

> 
> At a minimum that feature test has to stay there, and when it's clear
> we let the generic VLAN code untag the packet.

Since NETIF_F_HW_VLAN_*_RX are not protocol agnostic, we need two kind
of similar checking here.

veth = (struct vlan_ethhdr *)skb->data;
vlan_proto = veth->h_vlan_proto;
if ((vlan_proto == htons(ETH_P_8021Q) &&
 dev->features & NETIF_F_HW_VLAN_CTAG_RX) ||
(vlan_proto == htons(ETH_P_8021AD) &&
 dev->features & NETIF_F_HW_VLAN_STAG_RX) {
vlanid = ntohs(veth->h_vlan_TCI);
memmove(...);
skb_pull(...);
__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
}

An alternative way is not to check vlan_proto or features here but
compile this code only when VLAN is enabled in the kernel config. This
can be valid only because this driver does not have NETIF_F_HW_VLAN_*_RX
in hw_features and they can not be toggled for now.

static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
{
#ifdef STMMAC_VLAN_TAG_USED
...
if (!__vlan_get_tag(skb, )) {
...
__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
}
#endif
}

-- 
Toshiaki Makita

Re: [PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices

2018-06-03 Thread 吉藤英明

Hello,

2018-06-04 6:54 GMT+09:00 Subash Abhinov Kasiviswanathan
:
> RAWIP devices such as rmnet do not have a hardware address and
> instead require the kernel to generate a random IID for the
> temporary addresses. For permanent addresses, the device IID is
> used along with prefix received.
>
> Signed-off-by: Subash Abhinov Kasiviswanathan 
> ---
>  net/ipv6/addrconf.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index f09afc2..e4c4540 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -2230,6 +2230,18 @@ static int addrconf_ifid_ip6tnl(u8 *eui, struct 
> net_device *dev)
> return 0;
>  }
>
> +static int addrconf_ifid_rawip(u8 *eui, struct net_device *dev)
> +{
> +   struct in6_addr lladdr;
> +
> +   if (ipv6_get_lladdr(dev, , IFA_F_TENTATIVE))
> +   get_random_bytes(eui, 8);

Please be aware of I/G bit and G/L bit.

--yoshfuji

Re: [PATCH net-next 0/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song





On 6/3/18 1:00 PM, Alexei Starovoitov wrote:

On Sun, Jun 03, 2018 at 12:36:51AM -0700, Yonghong Song wrote:

bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

Patch #1 implements the new helper in the kernel.
Patch #2 syncs the uapi bpf.h header and helper between tools
and kernel.
Patch #3 shows how to get the same cgroup id in user space,
so a filter or policy could be configgured in the bpf program
based on current task cgroup.


for all patches:
Acked-by: Alexei Starovoitov 

please rebase, so it can be applied and s/net-next/bpf-next/ in subj.


Sorry. Missed to change subject line from "net-next" to "bpf-next".
Do you want to submit another revision?


Thanks!

[PATCH net-next v2 3/3] tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

Syscall name_to_handle_at() can be used to get cgroup id
for a particular cgroup path in user space. The selftest
got cgroup id from both user and kernel, and compare to
ensure they are equal to each other.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/.gitignore   |   1 +
 tools/testing/selftests/bpf/Makefile |   6 +-
 tools/testing/selftests/bpf/cgroup_helpers.c |  57 +
 tools/testing/selftests/bpf/cgroup_helpers.h |   1 +
 tools/testing/selftests/bpf/get_cgroup_id_kern.c |  28 +
 tools/testing/selftests/bpf/get_cgroup_id_user.c | 141 +++
 6 files changed, 232 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_kern.c
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_user.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 6ea8359..49938d7 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -18,3 +18,4 @@ urandom_read
 test_btf
 test_sockmap
 test_lirc_mode2_user
+get_cgroup_id_user
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 553d181..607ed87 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,7 +24,7 @@ urandom_read: urandom_read.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-   test_sock test_btf test_sockmap test_lirc_mode2_user
+   test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
@@ -34,7 +34,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
-   test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o
+   test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
+   get_cgroup_id_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -63,6 +64,7 @@ $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
+$(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c 
b/tools/testing/selftests/bpf/cgroup_helpers.c
index f3bca3a..c87b4e0 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -176,3 +177,59 @@ int create_and_get_cgroup(char *path)
 
return fd;
 }
+
+/**
+ * get_cgroup_id() - Get cgroup id for a particular cgroup path
+ * @path: The cgroup path, relative to the workdir, to join
+ *
+ * On success, it returns the cgroup id. On failure it returns 0,
+ * which is an invalid cgroup id.
+ * If there is a failure, it prints the error to stderr.
+ */
+unsigned long long get_cgroup_id(char *path)
+{
+   int dirfd, err, flags, mount_id, fhsize;
+   union {
+   unsigned long long cgid;
+   unsigned char raw_bytes[8];
+   } id;
+   char cgroup_workdir[PATH_MAX + 1];
+   struct file_handle *fhp, *fhp2;
+   unsigned long long ret = 0;
+
+   format_cgroup_path(cgroup_workdir, path);
+
+   dirfd = AT_FDCWD;
+   flags = 0;
+   fhsize = sizeof(*fhp);
+   fhp = calloc(1, fhsize);
+   if (!fhp) {
+   log_err("calloc");
+   return 0;
+   }
+   err = name_to_handle_at(dirfd, cgroup_workdir, fhp, _id, flags);
+   if (err >= 0 || fhp->handle_bytes != 8) {
+   log_err("name_to_handle_at");
+   goto free_mem;
+   }
+
+   fhsize = sizeof(struct file_handle) + fhp->handle_bytes;
+   fhp2 = realloc(fhp, fhsize);
+   if (!fhp2) {
+   log_err("realloc");
+   goto free_mem;
+   }
+   err = name_to_handle_at(dirfd, cgroup_workdir, fhp2, _id, flags);
+   fhp = fhp2;
+   if (err < 0) {
+   log_err("name_to_handle_at");
+   goto free_mem;
+   }
+
+   memcpy(id.raw_bytes, fhp->f_handle, 8);
+   ret = id.cgid;
+
+free_mem:
+   free(fhp);
+   return ret;
+}
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h 
b/tools/testing/selftests/bpf/cgroup_helpers.h
index 06485e0..20a4a5d

[PATCH net-next v2 0/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

Patch #1 implements the new helper in the kernel.
Patch #2 syncs the uapi bpf.h header and helper between tools
and kernel.
Patch #3 shows how to get the same cgroup id in user space,
so a filter or policy could be configgured in the bpf program
based on current task cgroup.

Changelog:
  v1 -> v2:
 . rebase to resolve merge conflict with latest bpf-next.

Yonghong Song (3):
  bpf: implement bpf_get_current_cgroup_id() helper
  tools/bpf: sync uapi bpf.h for bpf_get_current_cgroup_id() helper
  tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper

 include/linux/bpf.h  |   1 +
 include/uapi/linux/bpf.h |   8 +-
 kernel/bpf/core.c|   1 +
 kernel/bpf/helpers.c |  15 +++
 kernel/trace/bpf_trace.c |   2 +
 tools/include/uapi/linux/bpf.h   |   8 +-
 tools/testing/selftests/bpf/.gitignore   |   1 +
 tools/testing/selftests/bpf/Makefile |   6 +-
 tools/testing/selftests/bpf/bpf_helpers.h|   2 +
 tools/testing/selftests/bpf/cgroup_helpers.c |  57 +
 tools/testing/selftests/bpf/cgroup_helpers.h |   1 +
 tools/testing/selftests/bpf/get_cgroup_id_kern.c |  28 +
 tools/testing/selftests/bpf/get_cgroup_id_user.c | 141 +++
 13 files changed, 267 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_kern.c
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_user.c

-- 
2.9.5

[PATCH net-next v2 1/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

The later patch will provide an example to show that
userspace can get the same cgroup id so it could
configure a filter or policy in the bpf program based on
task cgroup id.

The helper is currently implemented for tracing. It can
be added to other program types as well when needed.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  1 +
 include/uapi/linux/bpf.h |  8 +++-
 kernel/bpf/core.c|  1 +
 kernel/bpf/helpers.c | 15 +++
 kernel/trace/bpf_trace.c |  2 ++
 5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbe2974..995c3b1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -746,6 +746,7 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 extern const struct bpf_func_proto bpf_sock_hash_update_proto;
+extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f0b6608..18712b0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2070,6 +2070,11 @@ union bpf_attr {
  * **CONFIG_SOCK_CGROUP_DATA** configuration option.
  * Return
  * The id is returned or 0 in case the id could not be retrieved.
+ *
+ * u64 bpf_get_current_cgroup_id(void)
+ * Return
+ * A 64-bit integer containing the current cgroup id based
+ * on the cgroup within which the current task is running.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2151,7 +2156,8 @@ union bpf_attr {
FN(lwt_seg6_action),\
FN(rc_repeat),  \
FN(rc_keydown), \
-   FN(skb_cgroup_id),
+   FN(skb_cgroup_id),  \
+   FN(get_current_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 527587d..9f14937 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1765,6 +1765,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto 
__weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
 const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
+const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 3d24e23..73065e2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -179,3 +179,18 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
.arg1_type  = ARG_PTR_TO_UNINIT_MEM,
.arg2_type  = ARG_CONST_SIZE,
 };
+
+#ifdef CONFIG_CGROUPS
+BPF_CALL_0(bpf_get_current_cgroup_id)
+{
+   struct cgroup *cgrp = task_dfl_cgroup(current);
+
+   return cgrp->kn->id.id;
+}
+
+const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
+   .func   = bpf_get_current_cgroup_id,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+};
+#endif
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 752992c..e2ab5b7 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -564,6 +564,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _get_prandom_u32_proto;
case BPF_FUNC_probe_read_str:
return _probe_read_str_proto;
+   case BPF_FUNC_get_current_cgroup_id:
+   return _get_current_cgroup_id_proto;
default:
return NULL;
}
-- 
2.9.5

[PATCH net-next v2 2/3] tools/bpf: sync uapi bpf.h for bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

Sync kernel uapi/linux/bpf.h with tools uapi/linux/bpf.h.
Also add the necessary helper define in bpf_helpers.h.

Acked-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 8 +++-
 tools/testing/selftests/bpf/bpf_helpers.h | 2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f0b6608..18712b0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2070,6 +2070,11 @@ union bpf_attr {
  * **CONFIG_SOCK_CGROUP_DATA** configuration option.
  * Return
  * The id is returned or 0 in case the id could not be retrieved.
+ *
+ * u64 bpf_get_current_cgroup_id(void)
+ * Return
+ * A 64-bit integer containing the current cgroup id based
+ * on the cgroup within which the current task is running.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2151,7 +2156,8 @@ union bpf_attr {
FN(lwt_seg6_action),\
FN(rc_repeat),  \
FN(rc_keydown), \
-   FN(skb_cgroup_id),
+   FN(skb_cgroup_id),  \
+   FN(get_current_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index a66a9d9..f2f28b6 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -131,6 +131,8 @@ static int (*bpf_rc_repeat)(void *ctx) =
 static int (*bpf_rc_keydown)(void *ctx, unsigned int protocol,
 unsigned long long scancode, unsigned int toggle) =
(void *) BPF_FUNC_rc_keydown;
+static unsigned long long (*bpf_get_current_cgroup_id)(void) =
+   (void *) BPF_FUNC_get_current_cgroup_id;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-03 Thread Eric Dumazet




On 06/03/2018 01:37 PM, Tom Herbert wrote:

> This is not an inconsequential mechanism that is being proposed. It's
> a modification to IP protocol that is intended to work on the
> Internet, but it looks like the draft hasn't been updated for two
> years and it is not adopted by any IETF working group. I don't see how
> this can go anywhere without IETF support. Also, I suggest that you
> look at the IPv10 proposal since that was very similar in intent. One
> of the reasons that IPv10 shot down was because protocol transition
> mechanisms were more interesting ten years ago than today. IPv6 has
> good traction now. In fact, it's probably the case that it's now
> easier to bring up IPv6 than to try to make IPv4 options work over the
> Internet.

+1

Many hosts do not use IPv4 anymore.

We even have the project making IPv4 support in linux optional.

[PATCH net] net: qualcomm: rmnet: Fix use after free while sending command ack

2018-06-03 Thread Subash Abhinov Kasiviswanathan

When sending an ack to a command packet, the skb is still referenced
after it is sent to the real device. Since the real device could
free the skb, the device pointer would be invalid.

Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial 
implementation")
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 78fdad0..f530b07 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -67,6 +67,7 @@ static void rmnet_map_send_ack(struct sk_buff *skb,
   struct rmnet_port *port)
 {
struct rmnet_map_control_command *cmd;
+   struct net_device *dev = skb->dev;
int xmit_status;
 
if (port->data_format & RMNET_FLAGS_INGRESS_MAP_CKSUMV4) {
@@ -86,9 +87,9 @@ static void rmnet_map_send_ack(struct sk_buff *skb,
cmd = RMNET_MAP_GET_CMD_START(skb);
cmd->cmd_type = type & 0x03;
 
-   netif_tx_lock(skb->dev);
-   xmit_status = skb->dev->netdev_ops->ndo_start_xmit(skb, skb->dev);
-   netif_tx_unlock(skb->dev);
+   netif_tx_lock(dev);
+   xmit_status = dev->netdev_ops->ndo_start_xmit(skb, dev);
+   netif_tx_unlock(dev);
 }
 
 /* Process MAP command frame and send N/ACK message as appropriate. Message cmd
-- 
1.9.1

[PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices

2018-06-03 Thread Subash Abhinov Kasiviswanathan

RAWIP devices such as rmnet do not have a hardware address and
instead require the kernel to generate a random IID for the
temporary addresses. For permanent addresses, the device IID is
used along with prefix received.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv6/addrconf.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f09afc2..e4c4540 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2230,6 +2230,18 @@ static int addrconf_ifid_ip6tnl(u8 *eui, struct 
net_device *dev)
return 0;
 }
 
+static int addrconf_ifid_rawip(u8 *eui, struct net_device *dev)
+{
+   struct in6_addr lladdr;
+
+   if (ipv6_get_lladdr(dev, , IFA_F_TENTATIVE))
+   get_random_bytes(eui, 8);
+   else
+   memcpy(eui, lladdr.s6_addr + 8, 8);
+
+   return 0;
+}
+
 static int ipv6_generate_eui64(u8 *eui, struct net_device *dev)
 {
switch (dev->type) {
@@ -2252,6 +2264,8 @@ static int ipv6_generate_eui64(u8 *eui, struct net_device 
*dev)
case ARPHRD_TUNNEL6:
case ARPHRD_IP6GRE:
return addrconf_ifid_ip6tnl(eui, dev);
+   case ARPHRD_RAWIP:
+   return addrconf_ifid_rawip(eui, dev);
}
return -1;
 }
@@ -3286,7 +3300,8 @@ static void addrconf_dev_config(struct net_device *dev)
(dev->type != ARPHRD_IP6GRE) &&
(dev->type != ARPHRD_IPGRE) &&
(dev->type != ARPHRD_TUNNEL) &&
-   (dev->type != ARPHRD_NONE)) {
+   (dev->type != ARPHRD_NONE) &&
+   (dev->type != ARPHRD_RAWIP)) {
/* Alas, we support only Ethernet autoconfiguration. */
return;
}
-- 
1.9.1

Re: ANNOUNCE: Enhanced IP v1.4

2018-06-03 Thread Tom Herbert

On Sat, Jun 2, 2018 at 9:17 AM, Sam Patton  wrote:
> Hello Willy, netdev,
>
> Thank you for your reply and advice.  I couldn't agree more with you
> about containers and the exciting prospects there,
>
> as well as the ADSL scenario you mention.
>
> As far as application examples, check out this simple netcat-like
> program I use for testing:
>
> https://github.com/EnIP/enhancedip/blob/master/userspace/netcat/netcat.c
>
> Lines 61-67 show how to connect directly via an EnIP address.  The
> netcat-like application uses
>
> a header file called eip.h.  You can look at it here:
>
> https://github.com/EnIP/enhancedip/blob/master/userspace/include/eip.h
>
> EnIP makes use of IPv6  records for DNS lookup.  We simply put
> 2001:0101 (which is an IPv6 experimental prefix) and
>
> then we put the 64-bit EnIP address into the next 8 bytes of the
> address.  The remaining bytes are set to zero.
>
> In the kernel, if you want to see how we convert the IPv6 DNS lookup
> into something connect() can manage,
>
> check out the add_enhanced_ip() routine found here:
>
> https://github.com/EnIP/enhancedip/blob/master/kernel/4.9.28/socket.c
>
> The reason we had to do changes for openssh and not other applications
> (that use DNS) is openssh has a check to
>
> see if the socket is using IP options.  If the socket does, sshd drops
> the connection.  I had to work around that to get openssh working
>
> with EnIP.  The result: if you want to connect the netcat-like program
> with IP addresses you'll end up doing something like the
>
> example above.  If you're using DNS (getaddrinfo) to connect(), it
> should just work (except for sshd as noted).
>
> Here's the draft experimental RFC:
> https://tools.ietf.org/html/draft-chimiak-enhanced-ipv4-03
> I'll also note that I am doing this code part time as a hobby for a long
> time so I appreciate your help and support.  It would be really
>
> great if the kernel community decided to pick this up, but if it's not a
> reality please let me know soonest so I can move on to a
>
Hi Sam,

This is not an inconsequential mechanism that is being proposed. It's
a modification to IP protocol that is intended to work on the
Internet, but it looks like the draft hasn't been updated for two
years and it is not adopted by any IETF working group. I don't see how
this can go anywhere without IETF support. Also, I suggest that you
look at the IPv10 proposal since that was very similar in intent. One
of the reasons that IPv10 shot down was because protocol transition
mechanisms were more interesting ten years ago than today. IPv6 has
good traction now. In fact, it's probably the case that it's now
easier to bring up IPv6 than to try to make IPv4 options work over the
Internet.

Tom


> different hobby.  :)
>
> Thank you.
>
> Sam Patton
>
> On 6/2/18 1:57 AM, Willy Tarreau wrote:
>> Hello Sam,
>>
>> On Fri, Jun 01, 2018 at 09:48:28PM -0400, Sam Patton wrote:
>>> Hello!
>>>
>>> If you do not know what Enhanced IP is, read this post on netdev first:
>>>
>>> https://www.spinics.net/lists/netdev/msg327242.html
>>>
>>>
>>> The Enhanced IP project presents:
>>>
>>>  Enhanced IP v1.4
>>>
>>> The Enhanced IP (EnIP) code has been updated.  It now builds with OpenWRT 
>>> barrier breaker (for 148 different devices). We've been testing with the 
>>> Western Digital N600 and N750 wireless home routers.
>> (...) First note, please think about breaking your lines if you want your
>> mails to be read by the widest audience, as for some of us here, reading
>> lines wider than a terminal is really annoying, and often not considered
>> worth spending time on them considering there are so many easier ones
>> left to read.
>>
>>> Interested in seeing Enhanced IP in the Linux kernel, read on.  Not
>>> interested in seeing Enhanced IP in the Linux kernel read on.
>> (...)
>>
>> So I personally find the concept quite interesting. It reminds me of the
>> previous IPv5/IPv7/IPv8 initiatives, which in my opinion were a bit hopeless.
>> Here the fact that you decide to consider the IPv4 address as a network opens
>> new perspectives. For containerized environments it could be considered that
>> each server, with one IPv4, can host 2^32 guests and that NAT is not needed
>> anymore for example. It could also open the possibility that enthousiasts
>> can more easily host some services at home behind their ADSL line without
>> having to run on strange ports.
>>
>> However I think your approach is not the most efficient to encourage 
>> adoption.
>> It's important to understand that there will be little incentive for people
>> to patch their kernels to run some code if they don't have the applications
>> on top of it. The kernel is not the end goal for most users, the kernel is
>> just the lower layer needed to run applications on top. I looked at your site
>> and the github repo, and all I could find was a pre-patched openssh, no 
>> simple
>> explanation of what to change in an application.
>>
>> What you need to do

Re: [PATCH net-next 0/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Alexei Starovoitov

On Sun, Jun 03, 2018 at 12:36:51AM -0700, Yonghong Song wrote:
> bpf has been used extensively for tracing. For example, bcc
> contains an almost full set of bpf-based tools to trace kernel
> and user functions/events. Most tracing tools are currently
> either filtered based on pid or system-wide.
> 
> Containers have been used quite extensively in industry and
> cgroup is often used together to provide resource isolation
> and protection. Several processes may run inside the same
> container. It is often desirable to get container-level tracing
> results as well, e.g. syscall count, function count, I/O
> activity, etc.
> 
> This patch implements a new helper, bpf_get_current_cgroup_id(),
> which will return cgroup id based on the cgroup within which
> the current task is running.
> 
> Patch #1 implements the new helper in the kernel.
> Patch #2 syncs the uapi bpf.h header and helper between tools
> and kernel.
> Patch #3 shows how to get the same cgroup id in user space,
> so a filter or policy could be configgured in the bpf program
> based on current task cgroup.

for all patches:
Acked-by: Alexei Starovoitov 

please rebase, so it can be applied and s/net-next/bpf-next/ in subj.
Thanks!

Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-03 Thread Christoph Paasch

Hello,

On Sun, Jun 3, 2018 at 10:47 AM, Maciej Żenczykowski
 wrote:
> From: Maciej Żenczykowski 
>
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
>
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
>
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.

as a side-note: Some time back I realized that one can also - on the
active opener side - create two TCP connections with the same 5-tuple
going out over the same interface.

One simply needs to first create a connection with a socket that has
SO_BINDTODEV set that specifies the same interface as the default
route. The second socket (which doesn't uses SO_BINDTODEV) then can
end up using the same source-port, if the range of available ports has
been exhausted.
This makes for some interesting packet-traces! :)

This is because INET_MATCH in __inet_check_established only checks for
!(sk->sk_bound_dev_if). inet_hash_connect() probably would need info
of the route's outgoing interface (of the new socket) to decide
whether or not there is a match.

But even that wouldn't be failsafe as the routing could change later
on... So, I dropped the ball on that.

Not sure if it's a big deal or not...


Cheers,
Christoph



>
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
>
> This does unfortunately run the risk of breaking buggy
> userspace programs...
>
> Signed-off-by: Maciej Żenczykowski 
> Cc: Eric Dumazet 
>
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> ---
>  net/core/sock.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int 
> optname,
> sock_valbool_flag(sk, SOCK_DBG, valbool);
> break;
> case SO_REUSEADDR:
> -   sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +   val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +   if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +   inet_sk(sk)->inet_num &&
> +   (sk->sk_reuse != val)) {
> +   ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : 
> -EUCLEAN;
> +   break;
> +   }
> +   sk->sk_reuse = val;
> break;
> case SO_REUSEPORT:
> +   if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +   inet_sk(sk)->inet_num &&
> +   (sk->sk_reuseport != valbool)) {
> +   ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : 
> -EUCLEAN;
> +   break;
> +   }
> sk->sk_reuseport = valbool;
> break;
> case SO_TYPE:
> --
> 2.17.1.1185.g55be947832-goog
>

Re: [PATCH net-next 0/2] cls_flower: Various fixes

2018-06-03 Thread Jiri Pirko

Sun, Jun 03, 2018 at 08:33:25PM CEST, xiyou.wangc...@gmail.com wrote:
>On Wed, May 30, 2018 at 1:17 AM, Paul Blakey  wrote:
>> Two of the fixes are for my multiple mask patch
>>
>> Paul Blakey (2):
>>   cls_flower: Fix missing free of rhashtable
>>   cls_flower: Fix comparing of old filter mask with new filter
>
>Both are bug fixes and one-line fixes, so definitely should go
>to -net tree and -stable tree.

I agree.

Re: [PATCH net-next 0/2] cls_flower: Various fixes

2018-06-03 Thread Cong Wang

On Wed, May 30, 2018 at 1:17 AM, Paul Blakey  wrote:
> Two of the fixes are for my multiple mask patch
>
> Paul Blakey (2):
>   cls_flower: Fix missing free of rhashtable
>   cls_flower: Fix comparing of old filter mask with new filter

Both are bug fixes and one-line fixes, so definitely should go
to -net tree and -stable tree.

I don't understand why you decide to rebase on net-next.

[PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-03 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

It is not safe to do so because such sockets are already in the
hash tables and changing these options can result in invalidating
the tb->fastreuse(port) caching.

This can have later far reaching consequences wrt. bind conflict checks
which rely on these caches (for optimization purposes).

Not to mention that you can currently end up with two identical
non-reuseport listening sockets bound to the same local ip:port
by clearing reuseport on them after they've already both been bound.

There is unfortunately no EISBOUND error or anything similar,
and EISCONN seems to be misleading for a bound-but-not-connected
socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
is the closest you can get to meaning 'socket in bad state'.
(although perhaps EINVAL wouldn't be a bad choice either?)

This does unfortunately run the risk of breaking buggy
userspace programs...

Signed-off-by: Maciej Żenczykowski 
Cc: Eric Dumazet 

Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
---
 net/core/sock.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 435a0ba85e52..feca4c98f8a0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
sock_valbool_flag(sk, SOCK_DBG, valbool);
break;
case SO_REUSEADDR:
-   sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
+   val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
+   if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
+   inet_sk(sk)->inet_num &&
+   (sk->sk_reuse != val)) {
+   ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : 
-EUCLEAN;
+   break;
+   }
+   sk->sk_reuse = val;
break;
case SO_REUSEPORT:
+   if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
+   inet_sk(sk)->inet_num &&
+   (sk->sk_reuseport != valbool)) {
+   ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : 
-EUCLEAN;
+   break;
+   }
sk->sk_reuseport = valbool;
break;
case SO_TYPE:
-- 
2.17.1.1185.g55be947832-goog

[PATCH] net-tcp: extend tcp_tw_reuse sysctl to enable loopback only optimization

2018-06-03 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean
to an integer.

It now takes the values 0, 1 and 2, where 0 and 1 behave as before,
while 2 enables timewait socket reuse only for sockets that we can
prove are loopback connections:
  ie. bound to 'lo' interface or where one of source or destination
  IPs is 127.0.0.0/8, :::127.0.0.0/104 or ::1.

This enables quicker reuse of ephemeral ports for loopback connections
- where tcp_tw_reuse is 100% safe from a protocol perspective
(this assumes no artificially induced packet loss on 'lo').

This also makes estblishing many loopback connections *much* faster
(allocating ports out of the first half of the ephemeral port range
is significantly faster, then allocating from the second half)

Without this change in a 32K ephemeral port space my sample program
(it just establishes and closes [::1]:ephemeral -> [::1]:server_port
connections in a tight loop) fails after 32765 connections in 24 seconds.
With it enabled 5 connections only take 4.7 seconds.

This is particularly problematic for IPv6 where we only have one local
address and cannot play tricks with varying source IP from 127.0.0.0/8
pool.

Signed-off-by: Maciej Żenczykowski 
Cc: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
Cc: Wei Wang 

Change-Id: I0377961749979d0301b7b62871a32a4b34b654e1
---
 Documentation/networking/ip-sysctl.txt | 10 +---
 net/ipv4/sysctl_net_ipv4.c |  5 +++-
 net/ipv4/tcp_ipv4.c| 35 +++---
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 924bd51327b7..6841c74eac00 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -667,11 +667,15 @@ tcp_tso_win_divisor - INTEGER
building larger TSO frames.
Default: 3
 
-tcp_tw_reuse - BOOLEAN
-   Allow to reuse TIME-WAIT sockets for new connections when it is
-   safe from protocol viewpoint. Default value is 0.
+tcp_tw_reuse - INTEGER
+   Enable reuse of TIME-WAIT sockets for new connections when it is
+   safe from protocol viewpoint.
+   0 - disable
+   1 - global enable
+   2 - enable for loopback traffic only
It should not be changed without advice/request of technical
experts.
+   Default: 2
 
 tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d2eed3ddcb0a..d06247ba08b2 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,7 @@
 
 static int zero;
 static int one = 1;
+static int two = 2;
 static int four = 4;
 static int thousand = 1000;
 static int gso_max_segs = GSO_MAX_SEGS;
@@ -845,7 +846,9 @@ static struct ctl_table ipv4_net_table[] = {
.data   = _net.ipv4.sysctl_tcp_tw_reuse,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
+   .extra2 = ,
},
{
.procname   = "tcp_max_tw_buckets",
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index adbdb503db0c..29f922d5e55d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -110,8 +110,38 @@ static u32 tcp_v4_init_ts_off(const struct net *net, const 
struct sk_buff *skb)
 
 int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 {
+   const struct inet_timewait_sock *tw = inet_twsk(sktw);
const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
struct tcp_sock *tp = tcp_sk(sk);
+   int reuse = sock_net(sk)->ipv4.sysctl_tcp_tw_reuse;
+
+   if (reuse == 2) {
+   /* Still does not detect *everything* that goes through
+* lo, since we require a loopback src or dst address
+* or direct binding to 'lo' interface.
+*/
+   bool loopback = false;
+   if (tw->tw_bound_dev_if == LOOPBACK_IFINDEX)
+   loopback = true;
+#if IS_ENABLED(CONFIG_IPV6)
+   if (tw->tw_family == AF_INET6) {
+   if (ipv6_addr_loopback(>tw_v6_daddr) ||
+   (ipv6_addr_v4mapped(>tw_v6_daddr) &&
+(tw->tw_v6_daddr.s6_addr[12] == 127)) ||
+   ipv6_addr_loopback(>tw_v6_rcv_saddr) ||
+   (ipv6_addr_v4mapped(>tw_v6_rcv_saddr) &&
+(tw->tw_v6_rcv_saddr.s6_addr[12] == 127)))
+   loopback = true;
+   } else
+#endif
+   {
+   if (ipv4_is_loopback(tw->tw_daddr) ||
+   ipv4_is_loopback(tw->tw_rcv_saddr))
+

Re: [PATCH bpf-next v3 05/11] bpf: avoid retpoline for lookup/update/delete calls on maps

2018-06-03 Thread Jesper Dangaard Brouer

On Sun, 3 Jun 2018 18:11:45 +0200
Daniel Borkmann  wrote:

> On 06/03/2018 08:56 AM, Jesper Dangaard Brouer wrote:
> > On Sat,  2 Jun 2018 23:06:35 +0200
> > Daniel Borkmann  wrote:
> >   
> >> Before:
> >>
> >>   # bpftool p d x i 1  
> > 
> > Could this please be changed to:
> > 
> >  # bpftool prog dump xlated id 1
> > 
> > I requested this before, but you seem to have missed my feedback...
> > This makes the command "self-documenting" and searchable by Google.  
> 
> I recently wrote a howto here, but there's also excellent documentation
> in terms of man pages for bpftool.
> 
> http://cilium.readthedocs.io/en/latest/bpf/#bpftool
> 
> My original thinking was that it might be okay to also show usage of
> short option matching, like in iproute2 probably few people only write
> 'ip address' but majority uses 'ip a' instead. But I'm fine either way
> if there are strong opinions ... thanks Alexei for fixing up!

First of all I love your documentation effort.

Secondly I personally *hate* how the 'ip' does it's short options
parsing and especially order/precedence ambiguity.  Phil Sutter
(Fedora/RHEL iproute2 maintainer) have a funny quiz illustrating the
ambiguity issues.

Quiz: https://youtu.be/cymH9pcFGa0?t=7m10s
Code problem: https://youtu.be/cymH9pcFGa0?t=9m8s

I hope the maintainers and developers of bpftool make sure we don't end
up in an ambiguity mess like we have with 'ip', pretty please.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net-next v2] qed: Add srq core support for RoCE and iWARP

2018-06-03 Thread Yuval Bason

This patch adds support for configuring SRQ and provides the necessary
APIs for rdma upper layer driver (qedr) to enable the SRQ feature.

Signed-off-by: Michal Kalderon 
Signed-off-by: Ariel Elior 
Signed-off-by: Yuval Bason 
---
Changes from v1:
- sparse warnings
- replace memset with ={}
---
 drivers/net/ethernet/qlogic/qed/qed_cxt.c   |   5 +-
 drivers/net/ethernet/qlogic/qed/qed_cxt.h   |   1 +
 drivers/net/ethernet/qlogic/qed/qed_hsi.h   |   2 +
 drivers/net/ethernet/qlogic/qed/qed_iwarp.c |  23 
 drivers/net/ethernet/qlogic/qed/qed_main.c  |   2 +
 drivers/net/ethernet/qlogic/qed/qed_rdma.c  | 178 +++-
 drivers/net/ethernet/qlogic/qed/qed_rdma.h  |   2 +
 drivers/net/ethernet/qlogic/qed/qed_roce.c  |  17 ++-
 include/linux/qed/qed_rdma_if.h |  12 +-
 9 files changed, 234 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c 
b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
index 820b226..7ed6aa0 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
@@ -47,6 +47,7 @@
 #include "qed_hsi.h"
 #include "qed_hw.h"
 #include "qed_init_ops.h"
+#include "qed_rdma.h"
 #include "qed_reg_addr.h"
 #include "qed_sriov.h"
 
@@ -426,7 +427,7 @@ static void qed_cxt_set_srq_count(struct qed_hwfn *p_hwfn, 
u32 num_srqs)
p_mgr->srq_count = num_srqs;
 }
 
-static u32 qed_cxt_get_srq_count(struct qed_hwfn *p_hwfn)
+u32 qed_cxt_get_srq_count(struct qed_hwfn *p_hwfn)
 {
struct qed_cxt_mngr *p_mgr = p_hwfn->p_cxt_mngr;
 
@@ -2071,7 +2072,7 @@ static void qed_rdma_set_pf_params(struct qed_hwfn 
*p_hwfn,
u32 num_cons, num_qps, num_srqs;
enum protocol_type proto;
 
-   num_srqs = min_t(u32, 32 * 1024, p_params->num_srqs);
+   num_srqs = min_t(u32, QED_RDMA_MAX_SRQS, p_params->num_srqs);
 
if (p_hwfn->mcp_info->func_info.protocol == QED_PCI_ETH_RDMA) {
DP_NOTICE(p_hwfn,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.h 
b/drivers/net/ethernet/qlogic/qed/qed_cxt.h
index a4e9586..758a8b4 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.h
@@ -235,6 +235,7 @@ u32 qed_cxt_get_proto_tid_count(struct qed_hwfn *p_hwfn,
enum protocol_type type);
 u32 qed_cxt_get_proto_cid_start(struct qed_hwfn *p_hwfn,
enum protocol_type type);
+u32 qed_cxt_get_srq_count(struct qed_hwfn *p_hwfn);
 int qed_cxt_free_proto_ilt(struct qed_hwfn *p_hwfn, enum protocol_type proto);
 
 #define QED_CTX_WORKING_MEM 0
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 8e1e6e1..82ce401 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -9725,6 +9725,8 @@ enum iwarp_eqe_async_opcode {
IWARP_EVENT_TYPE_ASYNC_EXCEPTION_DETECTED,
IWARP_EVENT_TYPE_ASYNC_QP_IN_ERROR_STATE,
IWARP_EVENT_TYPE_ASYNC_CQ_OVERFLOW,
+   IWARP_EVENT_TYPE_ASYNC_SRQ_EMPTY,
+   IWARP_EVENT_TYPE_ASYNC_SRQ_LIMIT,
MAX_IWARP_EQE_ASYNC_OPCODE
 };
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c 
b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
index 2a2b101..474e6cf 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
@@ -271,6 +271,8 @@ int qed_iwarp_create_qp(struct qed_hwfn *p_hwfn,
p_ramrod->sq_num_pages = qp->sq_num_pages;
p_ramrod->rq_num_pages = qp->rq_num_pages;
 
+   p_ramrod->srq_id.srq_idx = cpu_to_le16(qp->srq_id);
+   p_ramrod->srq_id.opaque_fid = cpu_to_le16(p_hwfn->hw_info.opaque_fid);
p_ramrod->qp_handle_for_cqe.hi = cpu_to_le32(qp->qp_handle.hi);
p_ramrod->qp_handle_for_cqe.lo = cpu_to_le32(qp->qp_handle.lo);
 
@@ -3004,8 +3006,11 @@ static int qed_iwarp_async_event(struct qed_hwfn *p_hwfn,
 union event_ring_data *data,
 u8 fw_return_code)
 {
+   struct qed_rdma_events events = p_hwfn->p_rdma_info->events;
struct regpair *fw_handle = >rdma_data.async_handle;
struct qed_iwarp_ep *ep = NULL;
+   u16 srq_offset;
+   u16 srq_id;
u16 cid;
 
ep = (struct qed_iwarp_ep *)(uintptr_t)HILO_64(fw_handle->hi,
@@ -3067,6 +3072,24 @@ static int qed_iwarp_async_event(struct qed_hwfn *p_hwfn,
qed_iwarp_cid_cleaned(p_hwfn, cid);
 
break;
+   case IWARP_EVENT_TYPE_ASYNC_SRQ_EMPTY:
+   DP_NOTICE(p_hwfn, "IWARP_EVENT_TYPE_ASYNC_SRQ_EMPTY\n");
+   srq_offset = p_hwfn->p_rdma_info->srq_id_offset;
+   /* FW assigns value that is no greater than u16 */
+   srq_id = ((u16)le32_to_cpu(fw_handle->lo)) - srq_offset;
+   events.affiliated_event(events.context,
+   QED_IWARP_EVENT_SRQ_EMPTY,
+

Re: [PATCH bpf-next v3 05/11] bpf: avoid retpoline for lookup/update/delete calls on maps

2018-06-03 Thread Daniel Borkmann

On 06/03/2018 08:56 AM, Jesper Dangaard Brouer wrote:
> On Sat,  2 Jun 2018 23:06:35 +0200
> Daniel Borkmann  wrote:
> 
>> Before:
>>
>>   # bpftool p d x i 1
> 
> Could this please be changed to:
> 
>  # bpftool prog dump xlated id 1
> 
> I requested this before, but you seem to have missed my feedback...
> This makes the command "self-documenting" and searchable by Google.

I recently wrote a howto here, but there's also excellent documentation
in terms of man pages for bpftool.

http://cilium.readthedocs.io/en/latest/bpf/#bpftool

My original thinking was that it might be okay to also show usage of
short option matching, like in iproute2 probably few people only write
'ip address' but majority uses 'ip a' instead. But I'm fine either way
if there are strong opinions ... thanks Alexei for fixing up!

RE: [PATCH net-next] qed: Add srq core support for RoCE and iWARP

2018-06-03 Thread Bason, Yuval

From: Leon Romanovsky [mailto:l...@kernel.org]
Sent: Thursday, May 31, 2018 8:33 PM
> On Wed, May 30, 2018 at 04:11:37PM +0300, Yuval Bason wrote:
> > This patch adds support for configuring SRQ and provides the necessary
> > APIs for rdma upper layer driver (qedr) to enable the SRQ feature.
> >
> > Signed-off-by: Michal Kalderon 
> > Signed-off-by: Ariel Elior 
> > Signed-off-by: Yuval Bason 
> > ---
> >  drivers/net/ethernet/qlogic/qed/qed_cxt.c   |   5 +-
> >  drivers/net/ethernet/qlogic/qed/qed_cxt.h   |   1 +
> >  drivers/net/ethernet/qlogic/qed/qed_hsi.h   |   2 +
> >  drivers/net/ethernet/qlogic/qed/qed_iwarp.c |  23 
> >  drivers/net/ethernet/qlogic/qed/qed_main.c  |   2 +
> >  drivers/net/ethernet/qlogic/qed/qed_rdma.c  | 179
> +++-
> >  drivers/net/ethernet/qlogic/qed/qed_rdma.h  |   2 +
> >  drivers/net/ethernet/qlogic/qed/qed_roce.c  |  17 ++-
> >  include/linux/qed/qed_rdma_if.h |  12 +-
> >  9 files changed, 235 insertions(+), 8 deletions(-)
> >
> 
> ...
> 
> > +   struct qed_sp_init_data init_data;
> 
> ...
> 
> > +   memset(_data, 0, sizeof(init_data));
> 
> This patter is so common in this patch, why?
> 
> "struct qed_sp_init_data init_data = {};" will do the trick.
> 
Thanks for pointing out, will be fixed in v2.

> Thanks

Re: [PATCH net] vrf: check the original netdevice for generating redirect

2018-06-03 Thread David Ahern

On 5/31/18 10:05 PM, Stephen Suryaputra wrote:
> Use the right device to determine if redirect should be sent especially
> when using vrf. Same as well as when sending the redirect.
> 
> Signed-off-by: Stephen Suryaputra 
> ---
>  net/ipv6/ip6_output.c | 3 ++-
>  net/ipv6/ndisc.c  | 6 ++
>  2 files changed, 8 insertions(+), 1 deletion(-)

skb->dev in this path is set to the vrf device if applicable, so yes the
change is needed. Thanks for the fix.

Acked-by: David Ahern

Re: [bpf-next V2 PATCH 0/8] bpf/xdp: add flags argument to ndo_xdp_xmit and flag flush operation

2018-06-03 Thread Alexei Starovoitov

On Thu, May 31, 2018 at 10:59:42AM +0200, Jesper Dangaard Brouer wrote:
> As I mentioned in merge commit 10f678683e4 ("Merge branch 'xdp_xmit-bulking'")
> I plan to change the API for ndo_xdp_xmit once more, by adding a flags
> argument, which is done in this patchset.
> 
> I know it is late in the cycle (currently at rc7), but it would be
> nice to avoid changing NDOs over several kernel releases, as it is
> annoying to vendors and distro backporters, but it is not strictly
> UAPI so it is allowed (according to Alexei).
> 
> The end-goal is getting rid of the ndo_xdp_flush operation, as it will
> make it possible for drivers to implement a TXQ synchronization mechanism
> that is not necessarily derived from the CPU id (smp_processor_id).
> 
> This patchset removes all callers of the ndo_xdp_flush operation, but
> it doesn't take the last step of removing it from all drivers.  This
> can be done later, or I can update the patchset on request.
> 
> Micro-benchmarks only show a very small performance improvement, for
> map-redirect around ~2 ns, and for non-map redirect ~7 ns.  I've not
> benchmarked this with CONFIG_RETPOLINE, but the performance benefit
> should be more visible given we end-up removing an indirect call.
> 
> ---
> V2: Updated based on feedback from Song Liu 

Applied, but please send a follow up patch to remove ndo_xdp_flush().
Otherwise this patch set is just a code churn that doing the opposite
of what you're trying to achieve and creating more backport pains.

[PATCH bpf-next] bpf: flowlabel in bpf_fib_lookup should be flowinfo

2018-06-03 Thread dsahern

From: David Ahern 

As Michal noted the flow struct takes both the flow label and priority.
Update the bpf_fib_lookup API to note that it is flowinfo and not just
the flow label.

Cc: Michal Kubecek 
Signed-off-by: David Ahern 
---
 include/uapi/linux/bpf.h   | 2 +-
 net/core/filter.c  | 2 +-
 samples/bpf/xdp_fwd_kern.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f0b6608b1f1c..5ef032bc4746 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2623,7 +2623,7 @@ struct bpf_fib_lookup {
union {
/* inputs to lookup */
__u8tos;/* AF_INET  */
-   __be32  flowlabel;  /* AF_INET6 */
+   __be32  flowinfo;   /* AF_INET6, flow_label + priority */
 
/* output: metric of fib result (IPv4/IPv6 only) */
__u32   rt_metric;
diff --git a/net/core/filter.c b/net/core/filter.c
index 28e864777c0f..704d515de2df 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4222,7 +4222,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
fl6.flowi6_oif = 0;
strict = RT6_LOOKUP_F_HAS_SADDR;
}
-   fl6.flowlabel = params->flowlabel;
+   fl6.flowlabel = params->flowinfo;
fl6.flowi6_scope = 0;
fl6.flowi6_flags = 0;
fl6.mp_hash = 0;
diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 4a6be0f87505..6673cdb9f55c 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -88,7 +88,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, 
u32 flags)
return XDP_PASS;
 
fib_params.family   = AF_INET6;
-   fib_params.flowlabel= *(__be32 *)ip6h & IPV6_FLOWINFO_MASK;
+   fib_params.flowinfo = *(__be32 *)ip6h & IPV6_FLOWINFO_MASK;
fib_params.l4_protocol  = ip6h->nexthdr;
fib_params.sport= 0;
fib_params.dport= 0;
-- 
2.11.0

Re: [PATCH bpf-next v3 00/11] Misc BPF improvements

2018-06-03 Thread Alexei Starovoitov

On Sat, Jun 02, 2018 at 11:06:30PM +0200, Daniel Borkmann wrote:
> This set adds various patches I still had in my queue, first two
> are test cases to provide coverage for the recent two fixes that
> went to bpf tree, then a small improvement on the error message
> for gpl helpers. Next, we expose prog and map id into fdinfo in
> order to allow for inspection of these objections currently used
> in applications. Patch after that removes a retpoline call for
> map lookup/update/delete helpers. A new helper is added in the
> subsequent patch to lookup the skb's socket's cgroup v2 id which
> can be used in an efficient way for e.g. lookups on egress side.
> Next one is a fix to fully clear state info in tunnel/xfrm helpers.
> Given this is full cap_sys_admin from init ns and has same priv
> requirements like tracing, bpf-next should be okay. A small bug
> fix for bpf_asm follows, and next a fix for context access in
> tracing which was recently reported. Lastly, a small update in
> the maintainer's file to add patchwork url and missing files.
> 
> Thanks!
> 
> v2 -> v3:
>   - Noticed a merge artefact inside uapi header comment, sigh,
> fixed now.
> v1 -> v2:
>   - minor fix in getting context access work on 32 bit for tracing
>   - add paragraph to uapi helper doc to better describe kernel
> build deps for cggroup helper

Applied, Thanks Daniel.
fixed up commit log s/bpftool p d x i/bpftool prog dump xlated id/
while applying, since it was indeed a bit cryptic.

Re: [PATCH] vlan: use non-archaic spelling of failes

2018-06-03 Thread David Miller

From: Thadeu Lima de Souza Cascardo 
Date: Thu, 31 May 2018 09:20:20 -0300

> Signed-off-by: Thadeu Lima de Souza Cascardo 

Applied.

Re: [PATCH v2 net] mlx4_core: restore optimal ICM memory allocation

2018-06-03 Thread David Miller

From: Eric Dumazet 
Date: Thu, 31 May 2018 05:52:24 -0700

> Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
> brought two regressions caught in our regression suite.
> 
> The big one is an additional cost of 256 bytes of overhead per 4096 bytes,
> or 6.25 % which is unacceptable since ICM can be pretty large.
> 
> This comes from having to allocate one struct mlx4_icm_chunk (256 bytes)
> per MLX4_TABLE_CHUNK, which the buggy commit shrank to 4KB
> (instead of prior 256KB)
> 
> Note that mlx4_alloc_icm() is already able to try high order allocations
> and fallback to low-order allocations under high memory pressure.
> 
> Most of these allocations happen right after boot time, when we get
> plenty of non fragmented memory, there is really no point being so
> pessimistic and break huge pages into order-0 ones just for fun.
> 
> We only have to tweak gfp_mask a bit, to help falling back faster,
> without risking OOM killings.
> 
> Second regression is an KASAN fault, that will need further investigations.
> 
> Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
> Signed-off-by: Eric Dumazet 
> Acked-by: Tariq Toukan 

Applied, thanks Eric.

Re: [PATCH net] net: ipv6: prevent use after free in ip6_route_mpath_notify()

2018-06-03 Thread David Ahern

On 6/3/18 8:31 AM, Eric Dumazet wrote:
> 
> 
> On 06/03/2018 07:01 AM, David Ahern wrote:
>> On 6/3/18 7:35 AM, Eric Dumazet wrote:
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index 
>>> f4d61736c41abe8cd7f439c4a37100e90c1eacca..830eefdbdb6734eb81ea0322fb6077ee20be1889
>>>  100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -4263,7 +4263,9 @@ static int ip6_route_multipath_add(struct fib6_config 
>>> *cfg,
>>>  
>>> err_nh = NULL;
>>> list_for_each_entry(nh, _nh_list, next) {
>>> +   dst_release(_last->dst);
>>> rt_last = nh->rt6_info;
>>> +   dst_hold(_last->dst);
>>> err = __ip6_ins_rt(nh->rt6_info, info, >mxc, extack);
>>> /* save reference to first route for notification */
>>> if (!rt_notif && !err)
>>> @@ -4317,7 +4319,7 @@ static int ip6_route_multipath_add(struct fib6_config 
>>> *cfg,
>>> list_del(>next);
>>> kfree(nh);
>>> }
>>> -
>>> +   dst_release(_last->dst);
>>> return err;
>>>  }
>>
>> Since the rtnl lock is held, a successfully inserted route can not be
>> removed until ip6_route_multipath_add finishes. This is a simpler change
>> that works with net-next as well:
> 
> Your patch changes the intent of your original commit.
> 
> It seems you wanted rt_last to point to the last attempted insertion,
> not the last successful one ?

The note in ip6_route_mpath_notify explains it:

/* if this is an APPEND route, then rt points to the first route
 * inserted and rt_last points to last route inserted. Userspace

> 
> Or have I misunderstood, and not only we had a use-after-free, but also
> a semantic error ?

It was a mistake to set rt_last before checking err. So the
use-after-free exposed the semantic error.

Re: [PATCH net] net: ipv6: prevent use after free in ip6_route_mpath_notify()

2018-06-03 Thread David Ahern

On 6/3/18 8:01 AM, David Ahern wrote:
> Is there a reproducer for the syzbot case?

One reproducer is to insert a route and then add a multipath route that
has a duplicate nexthop.e.g,:

ip -6 ro add vrf red 2001:db8:101::/64 nexthop via 2001:db8:1::2

ip -6 ro append vrf red 2001:db8:101::/64 nexthop via 2001:db8:1::4
nexthop via 2001:db8:1::2

Current net and next-next generates the trace; with the fix I proposed I
don't see it on either branch and I do see the expected notifications to
userspace.

Re: [PATCH net-next] net/smc: fix error return code in smc_setsockopt()

2018-06-03 Thread David Miller

From: Wei Yongjun 
Date: Thu, 31 May 2018 02:31:22 +

> Fix to return error code -EINVAL instead of 0 if optlen is invalid.
> 
> Fixes: 01d2f7e2cdd3 ("net/smc: sockopts TCP_NODELAY and TCP_CORK")
> Signed-off-by: Wei Yongjun 

Although the TCP code should be checking this in the previous lines,
it's not good practice to depend so tightly upon that.

And it makes this code easier to audit if the check exists here
explicitly too.

So I'll apply this, thanks.

Re: [PATCH net-next] net/mlx5: Make function mlx5_fpga_tls_send_teardown_cmd() static

2018-06-03 Thread David Miller

From: Wei Yongjun 
Date: Thu, 31 May 2018 02:31:12 +

> Fixes the following sparse warning:
> 
> drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c:199:6: warning:
>  symbol 'mlx5_fpga_tls_send_teardown_cmd' was not declared. Should it be 
> static?
> 
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH net-next] hv_netvsc: fix error return code in netvsc_probe()

2018-06-03 Thread David Miller

From: Wei Yongjun 
Date: Thu, 31 May 2018 02:04:43 +

> Fix to return a negative error code from the failover register fail
> error handling case instead of 0, as done elsewhere in this function.
> 
> Fixes: 1ff78076d8dd ("netvsc: refactor notifier/event handling code to use 
> the failover framework")
> Signed-off-by: Wei Yongjun 

Applied, thank you.

Re: [PATCH net] vrf: check the original netdevice for generating redirect

2018-06-03 Thread David Miller

From: Stephen Suryaputra 
Date: Fri,  1 Jun 2018 00:05:21 -0400

> Use the right device to determine if redirect should be sent especially
> when using vrf. Same as well as when sending the redirect.
> 
> Signed-off-by: Stephen Suryaputra 

David A., please review.

Re: [PATCH net-next] net: phy: consider PHY_IGNORE_INTERRUPT in state machine PHY_NOLINK handling

2018-06-03 Thread David Miller

From: Heiner Kallweit 
Date: Wed, 30 May 2018 22:13:20 +0200

> We can bail out immediately also in case of PHY_IGNORE_INTERRUPT because
> phy_mac_interupt() informs us once the link is up.
> 
> Signed-off-by: Heiner Kallweit 

Applied, thanks.

Re: [PATCH v5 net] stmmac: 802.1ad tag stripping fix

2018-06-03 Thread David Miller

From: Elad Nachman 
Date: Wed, 30 May 2018 08:48:25 +0300

>  static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
>  {
> - struct ethhdr *ehdr;
> + struct vlan_ethhdr *veth;
>   u16 vlanid;
> + __be16 vlan_proto;

Please order local variables from longest to shortest line.

>  
> - if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
> - NETIF_F_HW_VLAN_CTAG_RX &&
> - !__vlan_get_tag(skb, )) {
> + if (!__vlan_get_tag(skb, )) {
>   /* pop the vlan tag */
> - ehdr = (struct ethhdr *)skb->data;
> - memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
> + veth = (struct vlan_ethhdr *)skb->data;
> + vlan_proto = veth->h_vlan_proto;
> + memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
>   skb_pull(skb, VLAN_HLEN);
> - __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
> + __vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
>   }
>  }

I can't see how it is valid to do an unconditional software VLAN
untagging even when VLAN is disabled in the kernel config or the
NETIF_F_* feature bits are not set.

At a minimum that feature test has to stay there, and when it's clear
we let the generic VLAN code untag the packet.

Re: [PATCH net] net: ipv6: prevent use after free in ip6_route_mpath_notify()

2018-06-03 Thread Eric Dumazet




On 06/03/2018 07:01 AM, David Ahern wrote:
> On 6/3/18 7:35 AM, Eric Dumazet wrote:
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 
>> f4d61736c41abe8cd7f439c4a37100e90c1eacca..830eefdbdb6734eb81ea0322fb6077ee20be1889
>>  100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -4263,7 +4263,9 @@ static int ip6_route_multipath_add(struct fib6_config 
>> *cfg,
>>  
>>  err_nh = NULL;
>>  list_for_each_entry(nh, _nh_list, next) {
>> +dst_release(_last->dst);
>>  rt_last = nh->rt6_info;
>> +dst_hold(_last->dst);
>>  err = __ip6_ins_rt(nh->rt6_info, info, >mxc, extack);
>>  /* save reference to first route for notification */
>>  if (!rt_notif && !err)
>> @@ -4317,7 +4319,7 @@ static int ip6_route_multipath_add(struct fib6_config 
>> *cfg,
>>  list_del(>next);
>>  kfree(nh);
>>  }
>> -
>> +dst_release(_last->dst);
>>  return err;
>>  }
> 
> Since the rtnl lock is held, a successfully inserted route can not be
> removed until ip6_route_multipath_add finishes. This is a simpler change
> that works with net-next as well:

Your patch changes the intent of your original commit.

It seems you wanted rt_last to point to the last attempted insertion,
not the last successful one ?

Or have I misunderstood, and not only we had a use-after-free, but also
a semantic error ?


> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f4d61736c41a..1684197c189f 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4263,11 +4263,12 @@ static int ip6_route_multipath_add(struct
> fib6_config *cfg,
> 
> err_nh = NULL;
> list_for_each_entry(nh, _nh_list, next) {
> -   rt_last = nh->rt6_info;
> err = __ip6_ins_rt(nh->rt6_info, info, >mxc, extack);
> /* save reference to first route for notification */
> if (!rt_notif && !err)
> rt_notif = nh->rt6_info;
> +   if (!err)
> +   rt_last = nh->rt6_info;
> 
> /* nh->rt6_info is used or freed at this point, reset to
> NULL*/
> nh->rt6_info = NULL;
> 
> 
> Is there a reproducer for the syzbot case?

Not yet.

Re: [PATCH net] net: ipv6: prevent use after free in ip6_route_mpath_notify()

2018-06-03 Thread David Ahern

On 6/3/18 7:35 AM, Eric Dumazet wrote:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 
> f4d61736c41abe8cd7f439c4a37100e90c1eacca..830eefdbdb6734eb81ea0322fb6077ee20be1889
>  100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4263,7 +4263,9 @@ static int ip6_route_multipath_add(struct fib6_config 
> *cfg,
>  
>   err_nh = NULL;
>   list_for_each_entry(nh, _nh_list, next) {
> + dst_release(_last->dst);
>   rt_last = nh->rt6_info;
> + dst_hold(_last->dst);
>   err = __ip6_ins_rt(nh->rt6_info, info, >mxc, extack);
>   /* save reference to first route for notification */
>   if (!rt_notif && !err)
> @@ -4317,7 +4319,7 @@ static int ip6_route_multipath_add(struct fib6_config 
> *cfg,
>   list_del(>next);
>   kfree(nh);
>   }
> -
> + dst_release(_last->dst);
>   return err;
>  }

Since the rtnl lock is held, a successfully inserted route can not be
removed until ip6_route_multipath_add finishes. This is a simpler change
that works with net-next as well:

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4d61736c41a..1684197c189f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4263,11 +4263,12 @@ static int ip6_route_multipath_add(struct
fib6_config *cfg,

err_nh = NULL;
list_for_each_entry(nh, _nh_list, next) {
-   rt_last = nh->rt6_info;
err = __ip6_ins_rt(nh->rt6_info, info, >mxc, extack);
/* save reference to first route for notification */
if (!rt_notif && !err)
rt_notif = nh->rt6_info;
+   if (!err)
+   rt_last = nh->rt6_info;

/* nh->rt6_info is used or freed at this point, reset to
NULL*/
nh->rt6_info = NULL;


Is there a reproducer for the syzbot case?

[PATCH net] net: ipv6: prevent use after free in ip6_route_mpath_notify()

2018-06-03 Thread Eric Dumazet

syzbot reported a use-after-free [1]

Issue here is that rt_last might have been freed already.
We need to grab a refcount on it to prevent this.

[1]
BUG: KASAN: use-after-free in ip6_route_mpath_notify+0xe9/0x100 
net/ipv6/route.c:4180
Read of size 4 at addr 8801bf789cf0 by task syz-executor756/4555

CPU: 1 PID: 4555 Comm: syz-executor756 Not tainted 4.17.0-rc7+ #78
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
 ip6_route_mpath_notify+0xe9/0x100 net/ipv6/route.c:4180
 ip6_route_multipath_add+0x615/0x1910 net/ipv6/route.c:4303
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441819
RSP: 002b:7ffe841e19d8 EFLAGS: 0217 ORIG_RAX: 002e
RAX: ffda RBX: 0003 RCX: 00441819
RDX:  RSI: 2080 RDI: 0004
RBP: 006cd018 R08:  R09: 
R10:  R11: 0217 R12: 00402510
R13: 004025a0 R14:  R15: 

Allocated by task 4555:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
 dst_alloc+0xbb/0x1d0 net/core/dst.c:104
 __ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:361
 ip6_dst_alloc+0x29/0xb0 net/ipv6/route.c:376
 ip6_route_info_create+0x4d4/0x3a30 net/ipv6/route.c:2834
 ip6_route_multipath_add+0xc7e/0x1910 net/ipv6/route.c:4240
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 4555:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
 dst_destroy+0x267/0x3c0 net/core/dst.c:140
 dst_release_immediate+0x71/0x9e net/core/dst.c:205
 fib6_add+0xa40/0x1650 net/ipv6/ip6_fib.c:1305
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
 ip6_route_multipath_add+0x513/0x1910 net/ipv6/route.c:4267
 inet6_rtm_newroute+0xe3/0x160 net/ipv6/route.c:4391
 rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 8801bf789c40
 which belongs to the cache ip6_dst_cache of size 320
The buggy address is located 176 bytes inside of
 320-byte region

[PATCH net-next 2/3] tools/bpf: sync uapi bpf.h for bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

Sync kernel uapi/linux/bpf.h with tools uapi/linux/bpf.h.
Also add the necessary helper define in bpf_helpers.h.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 9 -
 tools/testing/selftests/bpf/bpf_helpers.h | 2 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 64ac0f7..1108936 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2054,6 +2054,12 @@ union bpf_attr {
  *
  * Return
  * 0
+ *
+ * u64 bpf_get_current_cgroup_id(void)
+ * Return
+ * A 64-bit integer containing the current cgroup id based
+ * on the cgroup within which the current task is running.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2134,7 +2140,8 @@ union bpf_attr {
FN(lwt_seg6_adjust_srh),\
FN(lwt_seg6_action),\
FN(rc_repeat),  \
-   FN(rc_keydown),
+   FN(rc_keydown), \
+   FN(get_current_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index a66a9d9..f2f28b6 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -131,6 +131,8 @@ static int (*bpf_rc_repeat)(void *ctx) =
 static int (*bpf_rc_keydown)(void *ctx, unsigned int protocol,
 unsigned long long scancode, unsigned int toggle) =
(void *) BPF_FUNC_rc_keydown;
+static unsigned long long (*bpf_get_current_cgroup_id)(void) =
+   (void *) BPF_FUNC_get_current_cgroup_id;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

[PATCH net-next 0/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

Patch #1 implements the new helper in the kernel.
Patch #2 syncs the uapi bpf.h header and helper between tools
and kernel.
Patch #3 shows how to get the same cgroup id in user space,
so a filter or policy could be configgured in the bpf program
based on current task cgroup.

Yonghong Song (3):
  bpf: implement bpf_get_current_cgroup_id() helper
  tools/bpf: sync uapi bpf.h for bpf_get_current_cgroup_id() helper
  tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper

 include/linux/bpf.h  |   1 +
 include/uapi/linux/bpf.h |   9 +-
 kernel/bpf/core.c|   1 +
 kernel/bpf/helpers.c |  15 +++
 kernel/trace/bpf_trace.c |   2 +
 tools/include/uapi/linux/bpf.h   |   9 +-
 tools/testing/selftests/bpf/.gitignore   |   1 +
 tools/testing/selftests/bpf/Makefile |   6 +-
 tools/testing/selftests/bpf/bpf_helpers.h|   2 +
 tools/testing/selftests/bpf/cgroup_helpers.c |  57 +
 tools/testing/selftests/bpf/cgroup_helpers.h |   1 +
 tools/testing/selftests/bpf/get_cgroup_id_kern.c |  28 +
 tools/testing/selftests/bpf/get_cgroup_id_user.c | 141 +++
 13 files changed, 269 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_kern.c
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_user.c

-- 
2.9.5

[PATCH net-next 3/3] tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

Syscall name_to_handle_at() can be used to get cgroup id
for a particular cgroup path in user space. The selftest
got cgroup id from both user and kernel, and compare to
ensure they are equal to each other.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/.gitignore   |   1 +
 tools/testing/selftests/bpf/Makefile |   6 +-
 tools/testing/selftests/bpf/cgroup_helpers.c |  57 +
 tools/testing/selftests/bpf/cgroup_helpers.h |   1 +
 tools/testing/selftests/bpf/get_cgroup_id_kern.c |  28 +
 tools/testing/selftests/bpf/get_cgroup_id_user.c | 141 +++
 6 files changed, 232 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_kern.c
 create mode 100644 tools/testing/selftests/bpf/get_cgroup_id_user.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 6ea8359..49938d7 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -18,3 +18,4 @@ urandom_read
 test_btf
 test_sockmap
 test_lirc_mode2_user
+get_cgroup_id_user
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 553d181..607ed87 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,7 +24,7 @@ urandom_read: urandom_read.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-   test_sock test_btf test_sockmap test_lirc_mode2_user
+   test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
@@ -34,7 +34,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o 
\
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
-   test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o
+   test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
+   get_cgroup_id_kern.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -63,6 +64,7 @@ $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
+$(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c 
b/tools/testing/selftests/bpf/cgroup_helpers.c
index f3bca3a..c87b4e0 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -176,3 +177,59 @@ int create_and_get_cgroup(char *path)
 
return fd;
 }
+
+/**
+ * get_cgroup_id() - Get cgroup id for a particular cgroup path
+ * @path: The cgroup path, relative to the workdir, to join
+ *
+ * On success, it returns the cgroup id. On failure it returns 0,
+ * which is an invalid cgroup id.
+ * If there is a failure, it prints the error to stderr.
+ */
+unsigned long long get_cgroup_id(char *path)
+{
+   int dirfd, err, flags, mount_id, fhsize;
+   union {
+   unsigned long long cgid;
+   unsigned char raw_bytes[8];
+   } id;
+   char cgroup_workdir[PATH_MAX + 1];
+   struct file_handle *fhp, *fhp2;
+   unsigned long long ret = 0;
+
+   format_cgroup_path(cgroup_workdir, path);
+
+   dirfd = AT_FDCWD;
+   flags = 0;
+   fhsize = sizeof(*fhp);
+   fhp = calloc(1, fhsize);
+   if (!fhp) {
+   log_err("calloc");
+   return 0;
+   }
+   err = name_to_handle_at(dirfd, cgroup_workdir, fhp, _id, flags);
+   if (err >= 0 || fhp->handle_bytes != 8) {
+   log_err("name_to_handle_at");
+   goto free_mem;
+   }
+
+   fhsize = sizeof(struct file_handle) + fhp->handle_bytes;
+   fhp2 = realloc(fhp, fhsize);
+   if (!fhp2) {
+   log_err("realloc");
+   goto free_mem;
+   }
+   err = name_to_handle_at(dirfd, cgroup_workdir, fhp2, _id, flags);
+   fhp = fhp2;
+   if (err < 0) {
+   log_err("name_to_handle_at");
+   goto free_mem;
+   }
+
+   memcpy(id.raw_bytes, fhp->f_handle, 8);
+   ret = id.cgid;
+
+free_mem:
+   free(fhp);
+   return ret;
+}
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h 
b/tools/testing/selftests/bpf/cgroup_helpers.h
index 06485e0..20a4a5d 100644
---

[PATCH net-next 1/3] bpf: implement bpf_get_current_cgroup_id() helper

2018-06-03 Thread Yonghong Song

bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

The later patch will provide an example to show that
userspace can get the same cgroup id so it could
configure a filter or policy in the bpf program based on
task cgroup id.

The helper is currently implemented for tracing. It can
be added to other program types as well when needed.

Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  1 +
 include/uapi/linux/bpf.h |  9 -
 kernel/bpf/core.c|  1 +
 kernel/bpf/helpers.c | 15 +++
 kernel/trace/bpf_trace.c |  2 ++
 5 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbe2974..995c3b1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -746,6 +746,7 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 extern const struct bpf_func_proto bpf_sock_hash_update_proto;
+extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 64ac0f7..1108936 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2054,6 +2054,12 @@ union bpf_attr {
  *
  * Return
  * 0
+ *
+ * u64 bpf_get_current_cgroup_id(void)
+ * Return
+ * A 64-bit integer containing the current cgroup id based
+ * on the cgroup within which the current task is running.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2134,7 +2140,8 @@ union bpf_attr {
FN(lwt_seg6_adjust_srh),\
FN(lwt_seg6_action),\
FN(rc_repeat),  \
-   FN(rc_keydown),
+   FN(rc_keydown), \
+   FN(get_current_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 527587d..9f14937 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1765,6 +1765,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto 
__weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
 const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
+const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 3d24e23..73065e2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -179,3 +179,18 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
.arg1_type  = ARG_PTR_TO_UNINIT_MEM,
.arg2_type  = ARG_CONST_SIZE,
 };
+
+#ifdef CONFIG_CGROUPS
+BPF_CALL_0(bpf_get_current_cgroup_id)
+{
+   struct cgroup *cgrp = task_dfl_cgroup(current);
+
+   return cgrp->kn->id.id;
+}
+
+const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
+   .func   = bpf_get_current_cgroup_id,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+};
+#endif
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index af1486d..6e4ade7 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -564,6 +564,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
return _get_prandom_u32_proto;
case BPF_FUNC_probe_read_str:
return _probe_read_str_proto;
+   case BPF_FUNC_get_current_cgroup_id:
+   return _get_current_cgroup_id_proto;
default:
return NULL;
}
-- 
2.9.5

[PATCH net-next V2 2/2] cls_flower: Fix comparing of old filter mask with new filter

2018-06-03 Thread Paul Blakey

We incorrectly compare the mask and the result is that we can't modify
an already existing rule.

Fix that by comparing correctly.

Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
Reported-by: Vlad Buslov 
Reviewed-by: Roi Dayan 
Reviewed-by: Jiri Pirko 
Signed-off-by: Paul Blakey 
---

Changelog: v0 -> v2: rebased.

 net/sched/cls_flower.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 159efd9..2b5be42 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -877,7 +877,7 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
return PTR_ERR(newmask);
 
fnew->mask = newmask;
-   } else if (fold && fold->mask == fnew->mask) {
+   } else if (fold && fold->mask != fnew->mask) {
return -EINVAL;
}
 
-- 
2.7.4

[PATCH net-next V2 1/2] cls_flower: Fix missing free of rhashtable

2018-06-03 Thread Paul Blakey

When destroying the instance, destroy the head rhashtable.

Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
Reported-by: Vlad Buslov 
Reviewed-by: Roi Dayan 
Reviewed-by: Jiri Pirko 
Signed-off-by: Paul Blakey 
---

Changelog: v0 -> v2: rebased.

 net/sched/cls_flower.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 3786fea..159efd9 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -326,6 +326,8 @@ static void fl_destroy_sleepable(struct work_struct *work)
struct cls_fl_head *head = container_of(to_rcu_work(work),
struct cls_fl_head,
rwork);
+
+   rhashtable_destroy(>ht);
kfree(head);
module_put(THIS_MODULE);
 }
-- 
2.7.4

Re: [PATCH bpf-next v3 05/11] bpf: avoid retpoline for lookup/update/delete calls on maps

2018-06-03 Thread Jesper Dangaard Brouer

On Sat,  2 Jun 2018 23:06:35 +0200
Daniel Borkmann  wrote:

> Before:
> 
>   # bpftool p d x i 1

Could this please be changed to:

 # bpftool prog dump xlated id 1

I requested this before, but you seem to have missed my feedback...
This makes the command "self-documenting" and searchable by Google.


> 0: (bf) r2 = r10
> 1: (07) r2 += -8
> 2: (7a) *(u64 *)(r2 +0) = 0
> 3: (18) r1 = map[id:1]
> 5: (85) call __htab_map_lookup_elem#232656
> 6: (15) if r0 == 0x0 goto pc+4
> 7: (71) r1 = *(u8 *)(r0 +35)
> 8: (55) if r1 != 0x0 goto pc+1
> 9: (72) *(u8 *)(r0 +35) = 1
>10: (07) r0 += 56
>11: (15) if r0 == 0x0 goto pc+4
>12: (bf) r2 = r0
>13: (18) r1 = map[id:1]
>15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
>16: (95) exit helper
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

53 matches

Mail list logo