date:20180802

Re: UDP packets arriving on wrong sockets

2018-08-02 Thread Andrew Cann

On Thu, Aug 02, 2018 at 11:21:41AM -0400, Willem de Bruijn wrote:
> You have two sockets bound to the same address and port? Is this using
> SO_REUSEPORT?

Yes, this is using SO_REUSEPORT.

My colleague wrote a python reproducer for this here:
https://gist.github.com/povilasb/53f1c802dbc2aca36a0ffa5b4cb95536

If you run server.py, then client.py, you should see packets arriving at
opposite sockets about half the time. My kernel version is NixOS 4.14.51, he
tested on two different machines - a debian 4.9.0 and fedora 28 4.17.2. We can
reproduce this on all kernels we tested.

Thanks in advance for any help you can give :)

 - Andrew

signature.asc
Description: Digital signature

Re: [PATCH net-next] tools: bpf: fix BTF code added twice to different trees

2018-08-02 Thread David Miller

From: Jakub Kicinski 
Date: Thu,  2 Aug 2018 19:30:27 -0700

> commit 38d5d3b3d5db ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
> 
> added to the bpf and net trees what
> 
> commit 92b57121ca79 ("bpf: btf: export btf types and name by offset from lib")
> 
> has already added to bpf-next/net-next, but in slightly different
> location.  Remove the duplicates (to fix build of libbpf).
> 
> Signed-off-by: Jakub Kicinski 

Aha, you beat me to it.

Applied, thanks!

[PATCH net-next] tools: bpf: fix BTF code added twice to different trees

2018-08-02 Thread Jakub Kicinski

commit 38d5d3b3d5db ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")

added to the bpf and net trees what

commit 92b57121ca79 ("bpf: btf: export btf types and name by offset from lib")

has already added to bpf-next/net-next, but in slightly different
location.  Remove the duplicates (to fix build of libbpf).

Signed-off-by: Jakub Kicinski 
---
 tools/lib/bpf/btf.c | 17 -
 tools/lib/bpf/btf.h |  1 -
 2 files changed, 18 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 1622a309f169..09ecf8162f7a 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -37,14 +37,6 @@ struct btf {
int fd;
 };
 
-static const char *btf_name_by_offset(const struct btf *btf, __u32 offset)
-{
-   if (offset < btf->hdr->str_len)
-   return >strings[offset];
-   else
-   return NULL;
-}
-
 static int btf_add_type(struct btf *btf, struct btf_type *t)
 {
if (btf->types_size - btf->nr_types < 2) {
@@ -401,12 +393,3 @@ const char *btf__name_by_offset(const struct btf *btf, 
__u32 offset)
else
return NULL;
 }
-
-const struct btf_type *btf__type_by_id(const struct btf *btf,
-  __u32 type_id)
-{
-   if (type_id > btf->nr_types)
-   return NULL;
-
-   return btf->types[type_id];
-}
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index dd8a86eab8ca..43c658ccfc2b 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -22,6 +22,5 @@ __s64 btf__resolve_size(const struct btf *btf, __u32 type_id);
 int btf__resolve_type(const struct btf *btf, __u32 type_id);
 int btf__fd(const struct btf *btf);
 const char *btf__name_by_offset(const struct btf *btf, __u32 offset);
-const struct btf_type *btf__type_by_id(const struct btf *btf, __u32 type_id);
 
 #endif
-- 
2.17.1

Re: [PATCH rdma-next 00/10] IPoIB uninit

2018-08-02 Thread Jason Gunthorpe

On Sun, Jul 29, 2018 at 11:34:50AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> IP link was broken due to the changes in IPoIB for the rdma_netdev
> support after commit cd565b4b51e5 ("IB/IPoIB: Support acceleration options 
> callbacks").
> 
> This patchset restores IPoIB pkey creation and removal using rtnetlink.
> 
> It is completely rewritten variant of
> https://marc.info/?l=linux-rdma=151553425830918=2 patch series.
> 
> Thanks
> 
> Erez Shitrit (2):
>   IB/ipoib: Use cancel_delayed_work_sync for neigh-clean task
>   IB/ipoib: Make ipoib_neigh_hash_uninit fully fence its work
> 
> Jason Gunthorpe (8):
>   IB/ipoib: Get rid of IPOIB_FLAG_GOING_DOWN
>   IB/ipoib: Move all uninit code into ndo_uninit
>   IB/ipoib: Move init code to ndo_init
>   RDMA/netdev: Use priv_destructor for netdev cleanup
>   IB/ipoib: Get rid of the sysfs_mutex
>   IB/ipoib: Do not remove child devices from within the ndo_uninit
>   IB/ipoib: Maintain the child_intfs list from ndo_init/uninit
>   IB/ipoib: Consolidate checking of the proposed child interface

Applied to for-next, finally this bug is fixed..

I squashed these two patches though:

   IB/ipoib: Use cancel_delayed_work_sync for neigh-clean task
   IB/ipoib: Make ipoib_neigh_hash_uninit fully fence its work

Thanks,
Jason

RE: [PATCH net-next] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-02 Thread Vakul Garg




> -Original Message-
> From: Doron Roberts-Kedes [mailto:doro...@fb.com]
> Sent: Friday, August 3, 2018 6:00 AM
> To: David S . Miller 
> Cc: Dave Watson ; Vakul Garg
> ; Boris Pismenny ; Aviad
> Yehezkel ; netdev@vger.kernel.org; Doron
> Roberts-Kedes 
> Subject: [PATCH net-next] net/tls: Calculate nsg for zerocopy path without
> skb_cow_data.
> 
> decrypt_skb fails if the number of sg elements required to map is greater
> than MAX_SKB_FRAGS. As noted by Vakul Garg, nsg must always be
> calculated, but skb_cow_data adds unnecessary memcpy's for the zerocopy
> case.
> 
> The new function skb_nsg calculates the number of scatterlist elements
> required to map the skb without the extra overhead of skb_cow_data. This
> function mimics the structure of skb_to_sgvec.
> 
> Fixes: c46234ebb4d1 ("tls: RX path for ktls")
> Signed-off-by: Doron Roberts-Kedes 
> ---
>  net/tls/tls_sw.c | 89
> ++--
>  1 file changed, 86 insertions(+), 3 deletions(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index
> ff3a6904a722..c62793601cfc 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -43,6 +43,76 @@
> 
>  #define MAX_IV_SIZE  TLS_CIPHER_AES_GCM_128_IV_SIZE
> 
> +static int __skb_nsg(struct sk_buff *skb, int offset, int len,
> +unsigned int recursion_level)
> +{
> +int start = skb_headlen(skb);
> +int i, copy = start - offset;
> +struct sk_buff *frag_iter;
> +int elt = 0;
> +
> +if (unlikely(recursion_level >= 24))
> +return -EMSGSIZE;
> +
> +if (copy > 0) {
> +if (copy > len)
> +copy = len;
> +elt++;
> +if ((len -= copy) == 0)
> +return elt;
> +offset += copy;
> +}
> +
> +for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +int end;
> +
> +WARN_ON(start > offset + len);
> +
> +end = start + skb_frag_size(_shinfo(skb)->frags[i]);
> +if ((copy = end - offset) > 0) {
> +if (copy > len)
> +copy = len;
> +elt++;
> +if (!(len -= copy))
> +return elt;
> +offset += copy;
> +}
> +start = end;
> +}
> +
> +skb_walk_frags(skb, frag_iter) {
> +int end, ret;
> +
> +WARN_ON(start > offset + len);
> +
> +end = start + frag_iter->len;
> +if ((copy = end - offset) > 0) {
> +
> +if (copy > len)
> +copy = len;
> +ret = __skb_nsg(frag_iter, offset - start, copy,
> + recursion_level + 1);
> +if (unlikely(ret < 0))
> +return ret;
> +elt += ret;
> +if ((len -= copy) == 0)
> +return elt;
> +offset += copy;
> +}
> +start = end;
> +}
> +BUG_ON(len);
> +return elt;
> +}
> +
> +/* Return the number of scatterlist elements required to completely map
> +the
> + * skb, or -EMSGSIZE if the recursion depth is exceeded.
> + */
> +static int skb_nsg(struct sk_buff *skb, int offset, int len) {
> + return __skb_nsg(skb, offset, len, 0); }
> +

These is generic function and useful elsewhere too.
Should the above two functions be exported by skbuff.c?

>  static int tls_do_decryption(struct sock *sk,
>struct scatterlist *sgin,
>struct scatterlist *sgout,
> @@ -693,7 +763,7 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
>   struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
>   struct scatterlist *sgin = _arr[0];
>   struct strp_msg *rxm = strp_msg(skb);
> - int ret, nsg = ARRAY_SIZE(sgin_arr);
> + int ret, nsg;
>   struct sk_buff *unused;
> 
>   ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE, @@ -
> 704,10 +774,23 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
> 
>   memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
>   if (!sgout) {
> - nsg = skb_cow_data(skb, 0, ) + 1;
> + nsg = skb_cow_data(skb, 0, );
> + } else {
> + nsg = skb_nsg(skb,
> +   rxm->offset + tls_ctx->rx.prepend_size,
> +   rxm->full_len - tls_ctx->rx.prepend_size);
> + if (nsg <= 0)
> + return nsg;
Comparison should be (nsg < 1). TLS forbids '0' sized records.

> + }
> +
> + // We need one extra for ctx->rx_aad_ciphertext
> + nsg++;
> +
> + if (nsg > ARRAY_SIZE(sgin_arr))
>   sgin =

[PATCH net-next] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-02 Thread Doron Roberts-Kedes

decrypt_skb fails if the number of sg elements required to map is
greater than MAX_SKB_FRAGS. As noted by Vakul Garg, nsg must always be
calculated, but skb_cow_data adds unnecessary memcpy's for the zerocopy
case.

The new function skb_nsg calculates the number of scatterlist elements
required to map the skb without the extra overhead of skb_cow_data. This
function mimics the structure of skb_to_sgvec.

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Signed-off-by: Doron Roberts-Kedes 
---
 net/tls/tls_sw.c | 89 ++--
 1 file changed, 86 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ff3a6904a722..c62793601cfc 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -43,6 +43,76 @@
 
 #define MAX_IV_SIZETLS_CIPHER_AES_GCM_128_IV_SIZE
 
+static int __skb_nsg(struct sk_buff *skb, int offset, int len,
+  unsigned int recursion_level)
+{
+int start = skb_headlen(skb);
+int i, copy = start - offset;
+struct sk_buff *frag_iter;
+int elt = 0;
+
+if (unlikely(recursion_level >= 24))
+return -EMSGSIZE;
+
+if (copy > 0) {
+if (copy > len)
+copy = len;
+elt++;
+if ((len -= copy) == 0)
+return elt;
+offset += copy;
+}
+
+for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+int end;
+
+WARN_ON(start > offset + len);
+
+end = start + skb_frag_size(_shinfo(skb)->frags[i]);
+if ((copy = end - offset) > 0) {
+if (copy > len)
+copy = len;
+elt++;
+if (!(len -= copy))
+return elt;
+offset += copy;
+}
+start = end;
+}
+
+skb_walk_frags(skb, frag_iter) {
+int end, ret;
+
+WARN_ON(start > offset + len);
+
+end = start + frag_iter->len;
+if ((copy = end - offset) > 0) {
+
+if (copy > len)
+copy = len;
+ret = __skb_nsg(frag_iter, offset - start, copy,
+   recursion_level + 1);
+if (unlikely(ret < 0))
+return ret;
+elt += ret;
+if ((len -= copy) == 0)
+return elt;
+offset += copy;
+}
+start = end;
+}
+BUG_ON(len);
+return elt;
+}
+
+/* Return the number of scatterlist elements required to completely map the
+ * skb, or -EMSGSIZE if the recursion depth is exceeded.
+ */
+static int skb_nsg(struct sk_buff *skb, int offset, int len)
+{
+   return __skb_nsg(skb, offset, len, 0);
+}
+
 static int tls_do_decryption(struct sock *sk,
 struct scatterlist *sgin,
 struct scatterlist *sgout,
@@ -693,7 +763,7 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
struct scatterlist *sgin = _arr[0];
struct strp_msg *rxm = strp_msg(skb);
-   int ret, nsg = ARRAY_SIZE(sgin_arr);
+   int ret, nsg;
struct sk_buff *unused;
 
ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE,
@@ -704,10 +774,23 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
 
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
if (!sgout) {
-   nsg = skb_cow_data(skb, 0, ) + 1;
+   nsg = skb_cow_data(skb, 0, );
+   } else {
+   nsg = skb_nsg(skb,
+ rxm->offset + tls_ctx->rx.prepend_size,
+ rxm->full_len - tls_ctx->rx.prepend_size);
+   if (nsg <= 0)
+   return nsg;
+   }
+
+   // We need one extra for ctx->rx_aad_ciphertext
+   nsg++;
+
+   if (nsg > ARRAY_SIZE(sgin_arr))
sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
+
+   if (!sgout)
sgout = sgin;
-   }
 
sg_init_table(sgin, nsg);
sg_set_buf([0], ctx->rx_aad_ciphertext, TLS_AAD_SPACE_SIZE);
-- 
2.17.1

Re: [PATCH net-next] net/tls: Always get number of sg entries for skb to be decrypted

2018-08-02 Thread Doron Roberts-Kedes

On Thu, Aug 02, 2018 at 09:50:58AM -0700, Dave Watson wrote:
> On 08/02/18 09:50 PM, Vakul Garg wrote:
> > Function decrypt_skb() made a bad assumption that number of sg entries
> > required for mapping skb to be decrypted would always be less than
> > MAX_SKB_FRAGS. The required count of sg entries for skb should always be
> > calculated. If they cannot fit in local array sgin_arr[], allocate them
> > from heap irrespective whether it is zero-copy case or otherwise. The
> > change also benefits the non-zero copy case as we could use sgin_arr[]
> > instead of always allocating sg entries from heap.
> > 
> > Signed-off-by: Vakul Garg 
> 
> Looks great, thanks.
> 
> Reviewed-by: Dave Waston 

I agree that this is a problem, but I'm not sure that this is the best
way to fix it. Calling skb_cow_data unconditionally is expensive. In my
benchmarks this patch cause a 5% CPU regression, all from memcpy from
skb_cow_data, and a 15% regression in encrypted NBD IOPS. It is possible
to calculate the number of scatterlist elements required to map the skb
without invoking skb_cow_data. I'll have a patch up shortly.

[PATCH net-next v3] ipv6: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Florian Westphal

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..aed9766559c9 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -557,6 +557,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..a976b143463b 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -455,6 +455,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.16.4

Re: [PATCH nf-next v2 1/1] ipv6: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Stephen Hemminger

On Fri,  3 Aug 2018 02:05:39 +0200
Florian Westphal  wrote:

> + if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&

Is there a reason to write the arithmetic expression in this order?
Why not:
if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&

Re: [PATCH v2 net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Stephen Hemminger

On Thu,  2 Aug 2018 23:34:37 +
Peter Oskolkov  wrote:

> This behavior is required in IPv6, and there is little need
> to tolerate overlapping fragments in IPv4. This change
> simplifies the code and eliminates potential DDoS attack vectors.
> 
> Tested: ran ip_defrag selftest (not yet available uptream).
> 
> Suggested-by: David S. Miller 
> Signed-off-by: Peter Oskolkov 
> Signed-off-by: Eric Dumazet 
> Cc: Florian Westphal 

There are a couple of relevant RFC's

RFC 1858 - Security Considerations for IP Fragment Filtering
RFC 2460 - Handling of Overlapping IPv6 Fragments

Acked-by: Stephen Hemminger

[PATCH nf-next v2 1/1] ipv6: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Florian Westphal

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..c121d534d321 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -557,6 +557,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..ff00ada6128f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -455,6 +455,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.16.4

Re: [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Stephen Hemminger

On Thu, 2 Aug 2018 16:33:55 -0700
Eric Dumazet  wrote:

> On 08/02/2018 03:54 PM, Stephen Hemminger wrote:
> > On Thu,  2 Aug 2018 22:45:58 +
> > Peter Oskolkov  wrote:
> >   
> >> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> >> index e5ebc83827ab..da1a144f1a51 100644
> >> --- a/include/uapi/linux/snmp.h
> >> +++ b/include/uapi/linux/snmp.h
> >> @@ -40,6 +40,7 @@ enum
> >>IPSTATS_MIB_REASMREQDS, /* ReasmReqds */
> >>IPSTATS_MIB_REASMOKS,   /* ReasmOKs */
> >>IPSTATS_MIB_REASMFAILS, /* ReasmFails */
> >> +  IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
> >>IPSTATS_MIB_FRAGOKS,/* FragOKs */
> >>IPSTATS_MIB_FRAGFAILS,  /* FragFails */
> >>IPSTATS_MIB_FRAGCREATES,/* FragCreates */  
> > 
> > Inserting new entries in the middle of an enum means the numeric
> > values will change. Isn't this going to break userspace ABI?
> >   
> 
> I would argue the only exported thing from the kernel are the Key:Val in the 
> /proc files.
> 
> Not sure why these enum are uapi.
> 
> Commit 46c2fa39877ed70415ee2b1acfb9129e956f6de4 added 
> LINUX_MIB_TCPFASTOPENBLACKHOLE in the middle,
> and really nobody complained.
> 
> 
> Commit 0604475119de5f80dc051a5db055c6a2a75bd542 added 
> LINUX_MIBSTCPMEMORYPRESSURESCHRONO
> in the middle as well.
> 
> I am pretty sure we should maintain locality of these counters to lower number
> of dirtied cache lines, say under IP frag DDOS ;)
> 

Agree with eric, go ahead and keep things local

Re: [PATCH net-next] inet: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Eric Dumazet




On 08/02/2018 04:43 PM, Florian Westphal wrote:
> don't bother with pathological cases, they only waste cycles.
> IPv6 requires a minimum MTU of 1280 so we should never see fragments
> smaller than this (except last frag).
> 
> For IPv4, in practice, we could probably also adopt a higher limit,
> but for now use ipv4 min mtu (68).

...

> + if (-skb_network_offset(skb) + skb->len < IPV4_MIN_MTU &&
> + ip_hdr(skb)->frag_off & htons(IP_MF))
> + goto drop;
> +
>

I am not totally sure this is legit for IPv4.

Some intermediate nodes can try to be smart and could decide to further split 
fragments.

I am pretty sure I have seen this behavior on some radio environments :/

Eventually we could add a sysctl to allow an admin to set the threshold ?

[PATCH net-next] inet: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Florian Westphal

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

For IPv4, in practice, we could probably also adopt a higher limit,
but for now use ipv4 min mtu (68).

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/ipv4/ip_fragment.c  | 5 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 3 files changed, 13 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8e9528ebaa8e..19aa10abc6ab 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -605,6 +605,10 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 
user)
int vif = l3mdev_master_ifindex_rcu(dev);
struct ipq *qp;
 
+   if (-skb_network_offset(skb) + skb->len < IPV4_MIN_MTU &&
+   ip_hdr(skb)->frag_off & htons(IP_MF))
+   goto drop;
+
__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
skb_orphan(skb);
 
@@ -622,6 +626,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 
user)
return ret;
}
 
+drop:
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
kfree_skb(skb);
return -ENOMEM;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..c121d534d321 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -557,6 +557,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..ff00ada6128f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -455,6 +455,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.16.4

[PATCH v2 net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-08-02 Thread Peter Oskolkov

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH v2 net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Peter Oskolkov

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..f80135e5feaa 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -56,6 +56,7 @@ enum
IPSTATS_MIB_ECT1PKTS,   /* InECT1Pkts */
IPSTATS_MIB_ECT0PKTS,   /* InECT0Pkts */
IPSTATS_MIB_CEPKTS, /* InCEPkts */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
dev = skb->dev;
@@ -463,6 +422,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
skb_dst_drop(skb);
return -EINPROGRESS;
 
+discard_qp:
+   inet_frag_kill(>q);
+   err = -EINVAL;

[PATCH v2 net-next 0/3] ip: Use rb trees for IP frag queue

2018-08-02 Thread Peter Oskolkov

This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h  |  11 +-
 include/net/inet_frag.h |   3 +-
 include/uapi/linux/snmp.h   |   1 +
 net/core/skbuff.c   |   6 +-
 net/ipv4/inet_fragment.c|  16 +-
 net/ipv4/ip_fragment.c  | 239 +++-
 net/ipv4/proc.c |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

Re: [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Eric Dumazet




On 08/02/2018 03:54 PM, Stephen Hemminger wrote:
> On Thu,  2 Aug 2018 22:45:58 +
> Peter Oskolkov  wrote:
> 
>> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
>> index e5ebc83827ab..da1a144f1a51 100644
>> --- a/include/uapi/linux/snmp.h
>> +++ b/include/uapi/linux/snmp.h
>> @@ -40,6 +40,7 @@ enum
>>  IPSTATS_MIB_REASMREQDS, /* ReasmReqds */
>>  IPSTATS_MIB_REASMOKS,   /* ReasmOKs */
>>  IPSTATS_MIB_REASMFAILS, /* ReasmFails */
>> +IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
>>  IPSTATS_MIB_FRAGOKS,/* FragOKs */
>>  IPSTATS_MIB_FRAGFAILS,  /* FragFails */
>>  IPSTATS_MIB_FRAGCREATES,/* FragCreates */
> 
> Inserting new entries in the middle of an enum means the numeric
> values will change. Isn't this going to break userspace ABI?
> 

I would argue the only exported thing from the kernel are the Key:Val in the 
/proc files.

Not sure why these enum are uapi.

Commit 46c2fa39877ed70415ee2b1acfb9129e956f6de4 added 
LINUX_MIB_TCPFASTOPENBLACKHOLE in the middle,
and really nobody complained.


Commit 0604475119de5f80dc051a5db055c6a2a75bd542 added 
LINUX_MIBSTCPMEMORYPRESSURESCHRONO
in the middle as well.

I am pretty sure we should maintain locality of these counters to lower number
of dirtied cache lines, say under IP frag DDOS ;)

[PATCH v2 net-next 3/3] ip: use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 

---
 include/linux/skbuff.h  |   9 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
-   int ip_defrag_offset;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
struct list_headlist;
};
-   struct sock *sk;
+
+   union {
+   struct sock *sk;
+   int ip_defrag_offset;
+   };
 
union {
ktime_t tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+   head = skb_rb_first(>q.rb_fragments);
+   if (!head)

Re: [EXTERNAL] Re: VRF with enslaved L3 enabled bridge

2018-08-02 Thread D'Souza, Nelson

Hi David,

Turns out the VRF bridge Rx issue is triggered by a docker install.

Docker makes the following sysctl changes:
  net.bridge.bridge-nf-call-arptables = 1
  net.bridge.bridge-nf-call-ip6tables = 1
  net.bridge.bridge-nf-call-iptables = 1 <<< exposes the ipv4 VRF Rx issue 
when a bridge is enslaved to a VRF

which causes packets flowing through all bridges to be subjected to netfilter 
rules. This is required for bridge net filtering when ip forwarding is enabled.

Please refer to 
https://github.com/docker/libnetwork/blob/master/drivers/bridge/setup_bridgenetfiltering.go#L53

Setting net.bridge.bridge-nf-call-iptables = 0 resolves the issue, but is not 
really a viable option given that bridge net filtering is a basic requirement 
in existing docker deployments.

It's not clear to me why this conf setting breaks local Rx delivery for a 
bridge enslaved to a VRF, because these packets would always be sent up by the 
bridge for IP netfilter processing.

This issue is easily reproducible on an Ubuntu 18.04.1 VM. Simply installing 
docker will cause pings running on test-vrf to fail. Clearing the sysctl conf 
restores Rx local delivery.

Thanks,
Nelson

On 7/27/18, 4:29 PM, "D'Souza, Nelson"  wrote:

David,

With Ubuntu 18.04.1 (kernel 4.15.0-29) pings sent out on test-vrf and br0 
are successful.

# uname -rv
4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018

# ping -c 1 -I test-vrf 172.16.2.2
ping: Warning: source address might be selected on device other than 
test-vrf.
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 test-vrf: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.050 ms

--- 172.16.2.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.050/0.050/0.050/0.000 ms

# ping -c 1 -I br0 172.16.2.2
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 br0: 56(84) bytes of data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.026 ms

--- 172.16.2.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.026/0.026/0.026/0.000 ms

However, with Ubuntu 17.10.1 (kernel  4.13.0-21) pings on only test-vrf are 
successful. Pings on br0 are not successful.
So it seems like there maybe a change in versions after 4.13.0-21 that 
causes pings on br0 to pass.

Nelson

On 7/25/18, 5:35 PM, "D'Souza, Nelson"  wrote:

David, 

I tried out the commands on an Ubuntu 17.10.1 VM.
The pings on test-vrf are successful, but the pings on br0 are not 
successful.

# uname -rv  
4.13.0-21-generic #24-Ubuntu SMP Mon Dec 18 17:29:16 UTC 2017

 # lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 17.10
Release:17.10
Codename:   artful

# ip rule  --> Note: its missing the l3mdev rule
0:  from all lookup local 
32766:  from all lookup main 
32767:  from all lookup default

Ran the configs from a bash script vrf.sh

 # ./vrf.sh 
+ ip netns add foo
+ ip li add veth1 type veth peer name veth2
+ ip li set veth2 netns foo
+ ip -netns foo li set lo up
+ ip -netns foo li set veth2 up
+ ip -netns foo addr add 172.16.1.2/24 dev veth2
+ ip li add test-vrf type vrf table 123
+ ip li set test-vrf up
+ ip ro add vrf test-vrf unreachable default
+ ip li add br0 type bridge
+ ip li set veth1 master br0
+ ip li set veth1 up
+ ip li set br0 up
+ ip addr add dev br0 172.16.1.1/24
+ ip li set br0 master test-vrf
+ ip -netns foo addr add 172.16.2.2/32 dev lo
+ ip ro add vrf test-vrf 172.16.2.2/32 via 172.16.1.2

# ping -I test-vrf 172.16.2.2 -c 2  <<< successful on test-vrf
ping: Warning: source address might be selected on device other than 
test-vrf.
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 test-vrf: 56(84) bytes of 
data.
64 bytes from 172.16.2.2: icmp_seq=1 ttl=64 time=0.035 ms
64 bytes from 172.16.2.2: icmp_seq=2 ttl=64 time=0.045 ms

--- 172.16.2.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1022ms
rtt min/avg/max/mdev = 0.035/0.040/0.045/0.005 ms

#ping -I br0 172.16.2.2 -c 2   <<< fails on br0
PING 172.16.2.2 (172.16.2.2) from 172.16.1.1 br0: 56(84) bytes of data.

--- 172.16.2.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1022ms

Please let me know if I should try a different version.

Nelson

On 7/24/18, 9:08 AM, "D'Souza, Nelson"  wrote:

It's

Re: [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Stephen Hemminger

On Thu,  2 Aug 2018 22:45:58 +
Peter Oskolkov  wrote:

> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index e5ebc83827ab..da1a144f1a51 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -40,6 +40,7 @@ enum
>   IPSTATS_MIB_REASMREQDS, /* ReasmReqds */
>   IPSTATS_MIB_REASMOKS,   /* ReasmOKs */
>   IPSTATS_MIB_REASMFAILS, /* ReasmFails */
> + IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
>   IPSTATS_MIB_FRAGOKS,/* FragOKs */
>   IPSTATS_MIB_FRAGFAILS,  /* FragFails */
>   IPSTATS_MIB_FRAGCREATES,/* FragCreates */

Inserting new entries in the middle of an enum means the numeric
values will change. Isn't this going to break userspace ABI?

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread Jakub Kicinski

On Thu, 2 Aug 2018 18:07:18 +0300, Eran Ben Elisha wrote:
> On 8/2/2018 4:40 AM, David Miller wrote:
> > From: Jakub Kicinski 
> > Date: Wed, 1 Aug 2018 17:00:47 -0700
> >   
> >> On Wed,  1 Aug 2018 14:52:45 -0700, Saeed Mahameed wrote:  
> >>> - According to the discussion outcome, we are keeping the congestion 
> >>> control
> >>>setting as mlx5 device specific for the current HW generation.  
> >>
> >> I still see queuing and marking based on queue level.  You want to add
> >> a Qdisc that will mirror your HW's behaviour to offload, if you really
> >> believe this is not a subset of RED, why not...  But devlink params?  
> > 
> > I totally agree, devlink seems like absolutely to wrong level and set
> > of interfaces to be doing this stuff.
> > 
> > I will not pull these changes in and I probably should have not
> > accepted the DCB changes from the other day and they were sneakily
> > leading up to this crap.
> > 
> > Sorry, please follow Jakub's lead as I think his approach makes much
> > more technical sense than using devlink for this.
> > 
> > Thanks.
> >   
> 
> Hi Dave,
> I would like to re-state that this feature was not meant to be a generic 
> one. This feature was added in order to resolve a HW bug which exist in
> a small portion of our devices. 

Would you mind describing the HW bug in more detail?  To a outside
reviewer it really looks like you're adding a feature.  What are you
working around?  Is the lack of full AQM on the PCIe side of the chip 
considered a bug?

> Those params will be used only on those current HWs and won't be in
> use for our future devices.

I'm glad that is your plan today, however, customers may get used to
the simple interface you're adding now.  This means the API you are
adding is effectively becoming an API other drivers may need to
implement to keep compatibility with someone's proprietary
orchestration.

> During the discussions, several alternatives where offered to be used by 
> various members of the community. These alternatives includes TC and 
> enhancements to PCI configuration tools.
> 
> Regarding the TC, from my perspective, this is not an option as:
> 1) The HW mechanism handles multiple functions and therefore cannot be 
> configured on as a regular TC

Could you elaborate?  What are the multiple functions?  You seem to be
adding a knob to enable ECN marking and a knob for choosing between
some predefined slopes.

In what way would your solution not behave like a RED offload?

With TC offload you'd also get a well-defined set of statistics, I
presume right now you're planning on adding a set of ethtool -S
counters?

> 2) No PF + representors modeling can be applied here, this is a 
> MultiHost environment where one host is not aware to the other hosts, 
> and each is running on its own pci/driver. It is a device working mode 
> configuration.

Yes, the multihost part makes it less pleasant.  But this is a problem
we have to tackle separately, at some point.  It's not a center of
attention here.

> 3) The current HW W/A is very limited, maybe it has a similar algorithm 
> as WRED, but is being used for much simpler different use case (pci bus 
> congestion).

No one is requesting full RED offload here..  if someone sets the
parameters you can't support you simply won't offload them.  And ignore
the parameters which only make sense in software terms.  Look at the
docs for mlxsw:

https://github.com/Mellanox/mlxsw/wiki/Queues-Management#offloading-red

It says "not offloaded" in a number of places.

> It cannot be compared to a standard TC capability (RED/WRED), and
> defining it as a offload fully controlled by the user will be a big
> misuse. 

It's generally preferable to implement a subset of exiting well defined
API than create vendor knobs, hence hardly a misuse.

> (for example, drop rate cannot be configured)

I don't know what "configuring drop rate" means in case of RED..

> regarding the PCI config tools, there was a consensus that such tool is 
> not acceptable as it is not a part of the PCI spec.

As I said, this has nothing to do with PCI being the transport.  The
port you're running over could be serial, SPI or anything else.  You
have congestion on a port of a device, that's a networking problem.

> Since module param/sysfs/debugfs/etc are no longer acceptable, and 
> current drivers still desired with a way to do some configurations to 
> the device/driver which cannot used standard Linux tool or by other 
> vendors, devlink params was developed (under the assumption that this 
> tool will be helpful for those needs, and those only).
>
> From my perspective, Devlink is the tool to configure the device for 
> handling such unexpected bugs, i.e "PCIe buffer congestion handling 
> workaround".

Hm.  Are you calling it a bug because you had to work around silicon
limitation in firmware?  Hm.  I'm very intrigued by the framing :)

[PATCH net-next 3/3] ip: use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/linux/skbuff.h  |   9 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 * UDP receive path is one user.
 */
unsigned long   dev_scratch;
-   int ip_defrag_offset;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
struct list_headlist;
};
-   struct sock *sk;
+
+   union {
+   struct sock *sk;
+   int ip_defrag_offset;
+   };
 
union {
ktime_t tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
refcount_t  refcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
struct inet_frag_queue *frag = from_timer(frag, t, timer);
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+   head = skb_rb_first(>q.rb_fragments);
+   if (!head)
+

[PATCH net-next 0/3] ip: Use rb trees for IP frag queue.

2018-08-02 Thread Peter Oskolkov

This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h  |  11 +-
 include/net/inet_frag.h |   3 +-
 include/uapi/linux/snmp.h   |   1 +
 net/core/skbuff.c   |   6 +-
 net/ipv4/inet_fragment.c|  16 +-
 net/ipv4/ip_fragment.c  | 239 +++-
 net/ipv4/proc.c |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-08-02 Thread Peter Oskolkov

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.

2018-08-02 Thread Peter Oskolkov

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..da1a144f1a51 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -40,6 +40,7 @@ enum
IPSTATS_MIB_REASMREQDS, /* ReasmReqds */
IPSTATS_MIB_REASMOKS,   /* ReasmOKs */
IPSTATS_MIB_REASMFAILS, /* ReasmFails */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
IPSTATS_MIB_FRAGOKS,/* FragOKs */
IPSTATS_MIB_FRAGFAILS,  /* FragFails */
IPSTATS_MIB_FRAGCREATES,/* FragCreates */
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
dev = skb->dev;
@@ -463,6 +422,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)

Re: [PATCH net-next] net/socket: remove duplicated init code

2018-08-02 Thread David Miller

From: Matthieu Baerts 
Date: Thu,  2 Aug 2018 18:14:33 +0200

> This refactoring work has been started by David Howells in cdfbabfb2f0c
> (net: Work around lockdep limitation in sockets that use sockets) but
> the exact same day in 581319c58600 (net/socket: use per af lockdep
> classes for sk queues), Paolo Abeni added new classes.
> 
> This reduces the amount of (nearly) duplicated code and eases the
> addition of new socket types.
> 
> Signed-off-by: Matthieu Baerts 

Nice cleanup, applied, thanks.

But:

> DISCLAIMER.
> This email and any files transmitted with it are confidential 
> and intended solely for the use of the individual or entity to whom they 
> are addressed. If you have received this email in error please notify the 
> system manager. This message contains confidential information and is 
> intended only for the individual named. If you are not the named addressee 
> you should not disseminate, distribute or copy this e-mail. Please notify 
> the sender immediately by e-mail if you have received this e-mail by 
> mistake and delete this e-mail from your system. If you are not the 
> intended recipient you are notified that disclosing, copying, distributing 
> or taking any action in reliance on the contents of this information is 
> strictly prohibited.

Please disable this footer for all future postings to this mailing
list, it is absolutely not appropriate.

Thank you.

Re: [PATCH net-next v2 1/3] qed: Add DCBX API - qed_dcbx_get_priority_tc()

2018-08-02 Thread David Miller

From: Denis Bolotin 
Date: Thu, 2 Aug 2018 11:12:49 +0300

> +int qed_dcbx_get_priority_tc(struct qed_hwfn *p_hwfn, u8 pri, u8 *p_tc)

Since the value range of the tc priority is 8-bit unsigned, you don't
need to return it by reference.

Simply return the value straight to the caller as an integer.

If it's negative, it has to be an error code.  Otherwise it is
the value in question.

Thank you.

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread Petr Machata

David Miller  writes:

> From: Jakub Kicinski 
> Date: Thu, 2 Aug 2018 10:11:12 -0700
>
>> On Thu, 02 Aug 2018 11:29:12 +0300, Petr Machata wrote:
>>> Could you please clarify your remark?
>> 
>> Oh, I think David meant the patches I was objecting to a while ago,
>> which were doing buffer configuration via the DCB API.
>
> Exactly.

Ah, OK then.

Thanks,
Petr

RE: Microsoft Ignite - 2018

2018-08-02 Thread Lisa Parker

Hello,

I hope you're doing well! 

 

I am following up with you on the below since I have not heard back from
you. Please let me know if you would like to get more information on the
same.

Looking forward to hearing from you.

Best Wishes,

Lisa

 

From: Lisa Parker [mailto:lisa.par...@pre-conferencelist.com] 
Sent: 30 July 2018 13:15
To: 'netdev@vger.kernel.org'
Subject: Microsoft Ignite - 2018

 

Hi,

 

I hope you and your team must be well prepared for the event Microsoft
Ignite - held on 09/24/2018 - 09/28/2018.

 

I am checking in, if you would like to drive more traffic to your booth? If
yes, we can help you with the current registered attendee list with the
complete contact information (including emails and phone numbers).

 

Please feel free to get back to me if you are interested in getting more
information on the total count and pricing options.

 

 

Best Wishes,

Lisa Parker

Events Specialist

 

 
If you are no longer interested, please replay back with "unsubscribe"  as
sub line

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread David Miller

From: Jakub Kicinski 
Date: Thu, 2 Aug 2018 10:11:12 -0700

> On Thu, 02 Aug 2018 11:29:12 +0300, Petr Machata wrote:
>> Could you please clarify your remark?
> 
> Oh, I think David meant the patches I was objecting to a while ago,
> which were doing buffer configuration via the DCB API.

Exactly.

Re: [PATCH net-next] net: Fix coding style in skb_push()

2018-08-02 Thread David Miller

From: Ganesh Goudar 
Date: Thu,  2 Aug 2018 15:34:52 +0530

> Signed-off-by: Ganesh Goudar 

Applied.

RE: [PATCH net-next] net/tls: Mark the end in scatterlist table

2018-08-02 Thread Vakul Garg




> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Thursday, August 2, 2018 10:47 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; bor...@mellanox.com;
> avia...@mellanox.com; da...@davemloft.net
> Subject: Re: [PATCH net-next] net/tls: Mark the end in scatterlist table
> 
> On 08/02/18 05:05 PM, Vakul Garg wrote:
> > In case zerocopy_from_iter() fails, 'end' won't get marked.
> > So fallback path is fine.
> >
> > > Which codepath is calling sg_nents()?
> >
> > While testing my WIP implementation of combined dynamic memory
> > allocation for (aead_req || sgin || sgout || aad || iv), I was getting
> random kernel crashes.
> > To debug it I had inserted sg_nents() in my code. The KASAN then
> > immediately complained that sg_nents() went beyond the allocated
> memory for scatterlist.
> > This led me to find that scatterlist table end has not been marked.
> >
> 
> If this isn't causing KASAN issues for the existing code, it probably makes
> more sense to put in a future series with that WIP work then.

Isn't using a sgtable with unmarked end already bad and should be fixed?
Crypto hardware accelerator drivers could be broken while using sg lists with
unmarked end.

RE: Security enhancement proposal for kernel TLS

2018-08-02 Thread Vakul Garg




> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Thursday, August 2, 2018 2:17 AM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris
> Pismenny 
> Subject: Re: Security enhancement proposal for kernel TLS
> 
> On 07/31/18 10:45 AM, Vakul Garg wrote:
> > > > IIUC, with the upstream implementation of tls record layer in
> > > > kernel, the decryption of tls FINISHED message happens in kernel.
> > > > Therefore the keys are already being sent to kernel tls socket
> > > > before handshake is
> > > completed.
> > >
> > > This is incorrect.
> >
> > Let us first reach a common ground on this.
> >
> >  The kernel TLS implementation can decrypt only after setting the keys on
> the socket.
> > The TLS message 'finished' (which is encrypted) is received after receiving
> 'CCS'
> > message. After the user space  TLS library receives CCS message, it
> > sets the keys on kernel TLS socket. Therefore, the next message in the
> > socket receive queue which is TLS finished gets decrypted in kernel only.
> >
> > Please refer to following Boris's patch on openssl. The  commit log says:
> > " We choose to set this option at the earliest - just after CCS is 
> > complete".
> 
> I agree that Boris' patch does what you say it does - it sets keys immediately
> after CCS instead of after FINISHED message.  I disagree that the kernel tls
> implementation currently requires that specific ordering, nor do I think that 
> it
> should require that ordering.

The current kernel implementation assumes record sequence number to start from 
'0'.
If keys have to be set after FINISHED message, then record sequence number need 
to
be communicated from user space TLS stack to kernel. IIRC, sequence number is 
not 
part of the interface through which key is transferred.

Re: [PATCH net-next] net/tls: Mark the end in scatterlist table

2018-08-02 Thread Dave Watson

On 08/02/18 05:05 PM, Vakul Garg wrote:
> In case zerocopy_from_iter() fails, 'end' won't get marked.
> So fallback path is fine.
> 
> > Which codepath is calling sg_nents()?
> 
> While testing my WIP implementation of combined dynamic memory allocation for 
> (aead_req || sgin || sgout || aad || iv), I was getting random kernel crashes.
> To debug it I had inserted sg_nents() in my code. The KASAN then immediately
> complained that sg_nents() went beyond the allocated memory for scatterlist.
> This led me to find that scatterlist table end has not been marked.
> 

If this isn't causing KASAN issues for the existing code, it probably
makes more sense to put in a future series with that WIP work then.

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread Jakub Kicinski

On Thu, 02 Aug 2018 11:29:12 +0300, Petr Machata wrote:
> David Miller  writes:
> 
> > From: Jakub Kicinski 
> > Date: Wed, 1 Aug 2018 17:00:47 -0700
> >  
> >> On Wed,  1 Aug 2018 14:52:45 -0700, Saeed Mahameed wrote:  
> >>> - According to the discussion outcome, we are keeping the congestion 
> >>> control
> >>>   setting as mlx5 device specific for the current HW generation.  
> >> 
> >> I still see queuing and marking based on queue level.  You want to add 
> >> a Qdisc that will mirror your HW's behaviour to offload, if you really
> >> believe this is not a subset of RED, why not...  But devlink params?  
> >
> > I totally agree, devlink seems like absolutely to wrong level and set
> > of interfaces to be doing this stuff.
> >
> > I will not pull these changes in and I probably should have not
> > accepted the DCB changes from the other day and they were sneakily
> > leading up to this crap.  
> 
> Are you talking about the recent additions of DCB helpers
> dcb_ieee_getapp_prio_dscp_mask_map() etc.?
> 
> If yes, I can assure there were no sneaky intentions at all. I'm at a
> loss to understand the relation to mlx5 team's decision to use devlink
> for congestion control configuration.
> 
> Could you please clarify your remark?

Oh, I think David meant the patches I was objecting to a while ago,
which were doing buffer configuration via the DCB API.

RE: [PATCH net-next] net/tls: Mark the end in scatterlist table

2018-08-02 Thread Vakul Garg




> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Thursday, August 2, 2018 10:17 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; bor...@mellanox.com;
> avia...@mellanox.com; da...@davemloft.net
> Subject: Re: [PATCH net-next] net/tls: Mark the end in scatterlist table
> 
> On 08/02/18 08:43 PM, Vakul Garg wrote:
> > Function zerocopy_from_iter() unmarks the 'end' in input sgtable while
> > adding new entries in it. The last entry in sgtable remained unmarked.
> > This results in KASAN error report on using apis like sg_nents().
> > Before returning, the function needs to mark the 'end' in the last
> > entry it adds.
> >
> > Signed-off-by: Vakul Garg 
> 
> Looks good to me, it looks like the fallback path should unmark the end
> appropriately.
> 
In case zerocopy_from_iter() fails, 'end' won't get marked.
So fallback path is fine.

> Which codepath is calling sg_nents()?

While testing my WIP implementation of combined dynamic memory allocation for 
(aead_req || sgin || sgout || aad || iv), I was getting random kernel crashes.
To debug it I had inserted sg_nents() in my code. The KASAN then immediately
complained that sg_nents() went beyond the allocated memory for scatterlist.
This led me to find that scatterlist table end has not been marked.

> Acked-by: Dave Watson

Re: [PATCH net-next] net/tls: Always get number of sg entries for skb to be decrypted

2018-08-02 Thread Dave Watson

On 08/02/18 09:50 PM, Vakul Garg wrote:
> Function decrypt_skb() made a bad assumption that number of sg entries
> required for mapping skb to be decrypted would always be less than
> MAX_SKB_FRAGS. The required count of sg entries for skb should always be
> calculated. If they cannot fit in local array sgin_arr[], allocate them
> from heap irrespective whether it is zero-copy case or otherwise. The
> change also benefits the non-zero copy case as we could use sgin_arr[]
> instead of always allocating sg entries from heap.
> 
> Signed-off-by: Vakul Garg 

Looks great, thanks.

Reviewed-by: Dave Waston

Re: [PATCH net-next] net/tls: Mark the end in scatterlist table

2018-08-02 Thread Dave Watson

On 08/02/18 08:43 PM, Vakul Garg wrote:
> Function zerocopy_from_iter() unmarks the 'end' in input sgtable while
> adding new entries in it. The last entry in sgtable remained unmarked.
> This results in KASAN error report on using apis like sg_nents(). Before
> returning, the function needs to mark the 'end' in the last entry it
> adds.
> 
> Signed-off-by: Vakul Garg 

Looks good to me, it looks like the fallback path should unmark the
end appropriately.

Which codepath is calling sg_nents()? 

Acked-by: Dave Watson

Re: [Query]: DSA Understanding

2018-08-02 Thread Florian Fainelli




On 08/02/2018 09:05 AM, Andrew Lunn wrote:
>> I dont see any Reply's on the PC with tcpdump on PC
> 
> So try ethool -S on the PC. Any packets dropped because of errors?
> 
> Try turning off hardware checksums on the switch. ethtool -K.

Also make sure that cpsw is properly sending 64 bytes (including FCS)
packets to the switch, some switches will just discard packets when
packets are RUNT
-- 
Florian

[PATCH net-next] net/socket: remove duplicated init code

2018-08-02 Thread Matthieu Baerts

This refactoring work has been started by David Howells in cdfbabfb2f0c
(net: Work around lockdep limitation in sockets that use sockets) but
the exact same day in 581319c58600 (net/socket: use per af lockdep
classes for sk queues), Paolo Abeni added new classes.

This reduces the amount of (nearly) duplicated code and eases the
addition of new socket types.

Signed-off-by: Matthieu Baerts 
---
 net/core/sock.c | 51 +++--
 1 file changed, 3 insertions(+), 48 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 9c6ebbdfebf3..e31233f5ba39 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -250,58 +250,13 @@ static const char *const 
af_family_kern_clock_key_strings[AF_MAX+1] = {
_sock_locks("k-clock-")
 };
 static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
-  "rlock-AF_UNSPEC", "rlock-AF_UNIX" , "rlock-AF_INET" ,
-  "rlock-AF_AX25"  , "rlock-AF_IPX"  , "rlock-AF_APPLETALK",
-  "rlock-AF_NETROM", "rlock-AF_BRIDGE"   , "rlock-AF_ATMPVC"   ,
-  "rlock-AF_X25"   , "rlock-AF_INET6", "rlock-AF_ROSE" ,
-  "rlock-AF_DECnet", "rlock-AF_NETBEUI"  , "rlock-AF_SECURITY" ,
-  "rlock-AF_KEY"   , "rlock-AF_NETLINK"  , "rlock-AF_PACKET"   ,
-  "rlock-AF_ASH"   , "rlock-AF_ECONET"   , "rlock-AF_ATMSVC"   ,
-  "rlock-AF_RDS"   , "rlock-AF_SNA"  , "rlock-AF_IRDA" ,
-  "rlock-AF_PPPOX" , "rlock-AF_WANPIPE"  , "rlock-AF_LLC"  ,
-  "rlock-27"   , "rlock-28"  , "rlock-AF_CAN"  ,
-  "rlock-AF_TIPC"  , "rlock-AF_BLUETOOTH", "rlock-AF_IUCV" ,
-  "rlock-AF_RXRPC" , "rlock-AF_ISDN" , "rlock-AF_PHONET"   ,
-  "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG"  ,
-  "rlock-AF_NFC"   , "rlock-AF_VSOCK", "rlock-AF_KCM"  ,
-  "rlock-AF_QIPCRTR", "rlock-AF_SMC" , "rlock-AF_XDP"  ,
-  "rlock-AF_MAX"
+   _sock_locks("rlock-")
 };
 static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
-  "wlock-AF_UNSPEC", "wlock-AF_UNIX" , "wlock-AF_INET" ,
-  "wlock-AF_AX25"  , "wlock-AF_IPX"  , "wlock-AF_APPLETALK",
-  "wlock-AF_NETROM", "wlock-AF_BRIDGE"   , "wlock-AF_ATMPVC"   ,
-  "wlock-AF_X25"   , "wlock-AF_INET6", "wlock-AF_ROSE" ,
-  "wlock-AF_DECnet", "wlock-AF_NETBEUI"  , "wlock-AF_SECURITY" ,
-  "wlock-AF_KEY"   , "wlock-AF_NETLINK"  , "wlock-AF_PACKET"   ,
-  "wlock-AF_ASH"   , "wlock-AF_ECONET"   , "wlock-AF_ATMSVC"   ,
-  "wlock-AF_RDS"   , "wlock-AF_SNA"  , "wlock-AF_IRDA" ,
-  "wlock-AF_PPPOX" , "wlock-AF_WANPIPE"  , "wlock-AF_LLC"  ,
-  "wlock-27"   , "wlock-28"  , "wlock-AF_CAN"  ,
-  "wlock-AF_TIPC"  , "wlock-AF_BLUETOOTH", "wlock-AF_IUCV" ,
-  "wlock-AF_RXRPC" , "wlock-AF_ISDN" , "wlock-AF_PHONET"   ,
-  "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG"  ,
-  "wlock-AF_NFC"   , "wlock-AF_VSOCK", "wlock-AF_KCM"  ,
-  "wlock-AF_QIPCRTR", "wlock-AF_SMC" , "wlock-AF_XDP"  ,
-  "wlock-AF_MAX"
+   _sock_locks("wlock-")
 };
 static const char *const af_family_elock_key_strings[AF_MAX+1] = {
-  "elock-AF_UNSPEC", "elock-AF_UNIX" , "elock-AF_INET" ,
-  "elock-AF_AX25"  , "elock-AF_IPX"  , "elock-AF_APPLETALK",
-  "elock-AF_NETROM", "elock-AF_BRIDGE"   , "elock-AF_ATMPVC"   ,
-  "elock-AF_X25"   , "elock-AF_INET6", "elock-AF_ROSE" ,
-  "elock-AF_DECnet", "elock-AF_NETBEUI"  , "elock-AF_SECURITY" ,
-  "elock-AF_KEY"   , "elock-AF_NETLINK"  , "elock-AF_PACKET"   ,
-  "elock-AF_ASH"   , "elock-AF_ECONET"   , "elock-AF_ATMSVC"   ,
-  "elock-AF_RDS"   , "elock-AF_SNA"  , "elock-AF_IRDA" ,
-  "elock-AF_PPPOX" , "elock-AF_WANPIPE"  , "elock-AF_LLC"  ,
-  "elock-27"   , "elock-28"  , "elock-AF_CAN"  ,
-  "elock-AF_TIPC"  , "elock-AF_BLUETOOTH", "elock-AF_IUCV" ,
-  "elock-AF_RXRPC" , "elock-AF_ISDN" , "elock-AF_PHONET"   ,
-  "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG"  ,
-  "elock-AF_NFC"   , "elock-AF_VSOCK", "elock-AF_KCM"  ,
-  "elock-AF_QIPCRTR", "elock-AF_SMC" , "elock-AF_XDP"  ,
-  "elock-AF_MAX"
+   _sock_locks("elock-")
 };
 
 /*
-- 
2.17.1


-- 


DISCLAIMER.
This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they 
are addressed. If you have received this email in error please notify the 
system manager. This message contains confidential information and is 
intended only for the individual named. If you are not the named addressee 
you should not disseminate, distribute or copy this e-mail. Please notify 
the sender immediately by e-mail if you have received this e-mail by 
mistake and delete this e-mail from your system. If you are not the 
intended recipient you are notified that disclosing, copying, distributing 
or taking any action in reliance on the contents of this information is 
strictly prohibited.

[PATCH net] dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()

2018-08-02 Thread Alexey Kodanev

Make sure that the value of "(now - hc->tx_lsndtime) / hc->tx_rto" is
properly limited when shifting 'u32 cwnd' with it, otherwise we can get:

[40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
[40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
[40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: GW   E 
4.18.0-rc7.x86_64 #1
...
[40851.377176] Call Trace:
[40851.408503]  dump_stack+0xf1/0x17b
[40851.451331]  ? show_regs_print_info+0x5/0x5
[40851.503555]  ubsan_epilogue+0x9/0x7c
[40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
[40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
[40851.686796]  ? xfrm4_output_finish+0x80/0x80
[40851.739827]  ? lock_downgrade+0x6d0/0x6d0
[40851.789744]  ? xfrm4_prepare_output+0x160/0x160
[40851.845912]  ? ip_queue_xmit+0x810/0x1db0
[40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
[40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
[40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
[40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
[40852.254833]  ? sched_clock+0x5/0x10
[40852.298508]  ? sched_clock+0x5/0x10
[40852.342194]  ? inet_create+0xdf0/0xdf0
[40852.388988]  sock_sendmsg+0xd9/0x160
...

Fixes: 113ced1f52e5 ("dccp ccid-2: Perform congestion-window validation")
Signed-off-by: Alexey Kodanev 
---
 net/dccp/ccids/ccid2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index 2b75df4..2821180 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -234,7 +234,7 @@ static void ccid2_cwnd_restart(struct sock *sk, const u32 
now)
 
/* don't reduce cwnd below the initial window (IW) */
restart_cwnd = min(cwnd, iwnd);
-   cwnd >>= (now - hc->tx_lsndtime) / hc->tx_rto;
+   cwnd >>= min((now - hc->tx_lsndtime) / hc->tx_rto, 31U);
hc->tx_cwnd = max(cwnd, restart_cwnd);
 
hc->tx_cwnd_stamp = now;
-- 
1.8.3.1

Re: [Query]: DSA Understanding

2018-08-02 Thread Andrew Lunn

> I dont see any Reply's on the PC with tcpdump on PC

So try ethool -S on the PC. Any packets dropped because of errors?

Try turning off hardware checksums on the switch. ethtool -K.

Andrew

Re: [PATCH net-next 7/9] net: stmmac: Integrate XGMAC into main driver flow

2018-08-02 Thread Andrew Lunn

> Looks like a plan. So, I will remove the adjust_link speed
> selection for > 1g and the C45 support. I will leave the SS
> selection in dwxgmac2_core_init (patch 2/9) as this makes no
> difference because only 1g will be selected for now. I will also
> clearly refer in cover letter the BW results for 1g tests.
> 
> Later on I will add support for 10g once shipping arrives.
> 
> Looks okay?

Hi Jose

Yes, this is good.

Thanks
Andrew

Re: [PATCH net-next 7/9] net: stmmac: Integrate XGMAC into main driver flow

2018-08-02 Thread Jose Abreu

On 02-08-2018 15:36, Andrew Lunn wrote:
>> Sorry, I made a mistake. Where it reads SGMII in my reply I was
>> referring to XGMII.
> So you have XGMII between the MAC and the PHY. That should support
> 2.5G, 5G and 10G. What i don't know is if you can also do 10/100/1000
> over XGMII?

Acording to databook I can only do 1G/2.5G and 10G.

>
> How are you currently connecting your 1G PHY to the MAC? XGMII is a
> big parallel bus, where as SGMII is a small serial bus.

I will check with HW team because I've no idea how is this
connected ...

>
> I would say, before this patchset goes anywhere, you need to test
> 10/100/1000/2.5G/10G, with at least one PHY.
>
> Alternatively, take out support for 2.5G/10G and C45, and post patches
> for just > 1G and C22. That you can test and you know works. You can
> add the rest later.

Looks like a plan. So, I will remove the adjust_link speed
selection for > 1g and the C45 support. I will leave the SS
selection in dwxgmac2_core_init (patch 2/9) as this makes no
difference because only 1g will be selected for now. I will also
clearly refer in cover letter the BW results for 1g tests.

Later on I will add support for 10g once shipping arrives.

Looks okay?

Thanks and Best Regards,
Jose Miguel Abreu

>
>   Andrew

Re: UDP packets arriving on wrong sockets

2018-08-02 Thread Eric Dumazet




On 08/02/2018 08:21 AM, Willem de Bruijn wrote:
> On Thu, Aug 2, 2018 at 9:21 AM Eric Dumazet  wrote:
>>
>>
>>
>> On 08/02/2018 02:05 AM, Andrew Cann wrote:
>>> Hi,
>>>
>>> I posted this on stackoverflow yesterday but I'm reposting it here since it 
>>> got
>>> no response. Original post: 
>>> https://stackoverflow.com/questions/51630337/udp-packets-arriving-on-wrong-sockets-on-linux
>>>
>>> I have two UDP sockets bound to the same address and connected to addresses 
>>> A
>>> and B. I have two more UDP sockets bound to A and B and not connected.
>>>
>>> This is what my /proc/net/udp looks like (trimmed for readability):
>>>
>>>   sl  local_address rem_address
>>>  3937: 017F:DD9C 037F:9910
>>>  3937: 017F:DD9C 027F:907D
> 
> You have two sockets bound to the same address and port? Is this using
> SO_REUSEPORT?


Yeah, time to add listen() and accept() support for UDP :)

Re: UDP packets arriving on wrong sockets

2018-08-02 Thread Willem de Bruijn

On Thu, Aug 2, 2018 at 9:21 AM Eric Dumazet  wrote:
>
>
>
> On 08/02/2018 02:05 AM, Andrew Cann wrote:
> > Hi,
> >
> > I posted this on stackoverflow yesterday but I'm reposting it here since it 
> > got
> > no response. Original post: 
> > https://stackoverflow.com/questions/51630337/udp-packets-arriving-on-wrong-sockets-on-linux
> >
> > I have two UDP sockets bound to the same address and connected to addresses 
> > A
> > and B. I have two more UDP sockets bound to A and B and not connected.
> >
> > This is what my /proc/net/udp looks like (trimmed for readability):
> >
> >   sl  local_address rem_address
> >  3937: 017F:DD9C 037F:9910
> >  3937: 017F:DD9C 027F:907D

You have two sockets bound to the same address and port? Is this using
SO_REUSEPORT?

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread Eran Ben Elisha

On 8/2/2018 4:40 AM, David Miller wrote:

From: Jakub Kicinski 
Date: Wed, 1 Aug 2018 17:00:47 -0700

On Wed,  1 Aug 2018 14:52:45 -0700, Saeed Mahameed wrote:

- According to the discussion outcome, we are keeping the congestion control
   setting as mlx5 device specific for the current HW generation.

I still see queuing and marking based on queue level.  You want to add
a Qdisc that will mirror your HW's behaviour to offload, if you really
believe this is not a subset of RED, why not...  But devlink params?

I totally agree, devlink seems like absolutely to wrong level and set
of interfaces to be doing this stuff.

I will not pull these changes in and I probably should have not
accepted the DCB changes from the other day and they were sneakily
leading up to this crap.

Sorry, please follow Jakub's lead as I think his approach makes much
more technical sense than using devlink for this.

Thanks.

Hi Dave,
I would like to re-state that this feature was not meant to be a generic 
one. This feature was added in order to resolve a HW bug which exist in
a small portion of our devices. Those params will be used only on those 
current HWs and won't be in use for our future devices.

During the discussions, several alternatives where offered to be used by 
various members of the community. These alternatives includes TC and 
enhancements to PCI configuration tools.

Regarding the TC, from my perspective, this is not an option as:
1) The HW mechanism handles multiple functions and therefore cannot be 
configured on as a regular TC
2) No PF + representors modeling can be applied here, this is a 
MultiHost environment where one host is not aware to the other hosts, 
and each is running on its own pci/driver. It is a device working mode 
configuration.
3) The current HW W/A is very limited, maybe it has a similar algorithm 
as WRED, but is being used for much simpler different use case (pci bus 
congestion). It cannot be compared to a standard TC capability 
(RED/WRED), and defining it as a offload fully controlled by the user 
will be a big misuse. (for example, drop rate cannot be configured)

regarding the PCI config tools, there was a consensus that such tool is 
not acceptable as it is not a part of the PCI spec.

Since module param/sysfs/debugfs/etc are no longer acceptable, and 
current drivers still desired with a way to do some configurations to 
the device/driver which cannot used standard Linux tool or by other 
vendors, devlink params was developed (under the assumption that this 
tool will be helpful for those needs, and those only).

From my perspective, Devlink is the tool to configure the device for 
handling such unexpected bugs, i.e "PCIe buffer congestion handling 
workaround".

Thanks,
Eran

Re: IT Governance article

2018-08-02 Thread Reciprocity Labs

Hey Claire,

I'd be very interested to learn more.

Cheers,
Dwight

On Wed, Aug 1, 2018 at 1:50 PM, Claire Gagnon 
wrote:

> Hello Dwight,
>
> Thanks for getting in touch!
>
> We’d be happy to consider publishing your content on our site, however, we
> do charge a fee for this.
>
> Please let me know if you’re still interested.
>
> Regards,
>
> *Claire Gagnon*
> *News Editor*
> Agenda Daily
> 
>
>
> On Mon, 30 Jul 2018 at 21:02, Reciprocity Labs <
> reciprocitymarketingoutre...@gmail.com> wrote:
>
>> Greetings,
>>
>> My name is Dwight and I am reaching out to theagendadaily.com
>> 
>> on behalf of Reciprocity Labs
>> .
>> We help companies with compliance and risk management.  Recently we've
>> helped educate people on combating fraud (COSO), data-security and the
>> handling of sensitive information (GDPR, HIPAA, etc) for enterprises and IT
>> governance professionals.
>>
>> When I came across http://theagendadaily.com/
>> 
>> I decided to reach out. We'd love to contribute an article to
>> theagendadaily.com
>> 
>> .  Our aim is to educate your users with content specific to your site!
>>
>> If you have any guidelines for the article or another idea, send them
>> along!
>>
>> I would be happy to set up a time to chat over the phone to answer any
>> questions or further discuss.
>>
>> Thank you and I look forward to working together!
>>
>> Cheers,
>> Dwight
>> reciprocitymarketingoutre...@gmail.com
>>
>> [image: beacon]
>>
>
[image: beacon]

Re: [Query]: DSA Understanding

2018-08-02 Thread Lad, Prabhakar

Hi Andrew,

On Thu, Aug 2, 2018 at 3:45 PM, Andrew Lunn  wrote:
>> I have PC connected to lan4(ip = 169.254..126.126) and the PC ip is
>> 169.254.78.251,
>> but when I ping from PC to lan4 I get Destination Host Unreachable,
>> but where as I can see
>> that in the tcpdump log for lan4 it does reply back, but it doesn’t
>> reach the PC, Is there I am missing
>> something here ?
>>
>> ~$ tcpdump -i lan4 -v
>> [  661.057166] device lan4 entered promiscuous mode
>> [  661.061814] device eth1 entered promiscuous mode
>
>> 07:40:21.254161 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
>> VB4-SN tell tango-charlie.local, length 46
>> 07:40:21.254181 ARP, Ethernet (len 6), IPv4 (len 4), Reply
>> VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28
>
> Having names here does not help when you gave IP addresses above.
>
Sorry about that. VB4-SN is the switch device and
tango-charlie.local is the PC.

> Am i reading this correct? The PC is ARPing the switch device. The
> switch device is replying?
>
Yes the PC is ARPing to the switch device and from the tcpdump on
swicth device lan4
I can see that its trying to send Reply

~$ tcpdump -i lan4 -v
[ 1675.526326] device lan4 entered promiscuous mode
[ 1675.531503] device eth1 entered promiscuous mode
tcpdump: listening on lan4, link-type EN10MB (Ethernet), capture size
262144 bytes
07:57:06.133853 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
VB4-SN tell tango-charlie.local, length 46
07:57:06.133893 ARP, Ethernet (len 6), IPv4 (len 4), Reply
VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28
07:57:07.151100 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
VB4-SN tell tango-charlie.local, length 46
07:57:07.151133 ARP, Ethernet (len 6), IPv4 (len 4), Reply
VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28
07:57:07.703063 IP (tos 0x0, ttl 64, id 41824, offset 0, flags [DF],
proto UDP (17), length 224)
tango-charlie.local.netbios-dgm > 169.254.255.255.netbios-dgm: NBT
UDP PACKET(138)
07:57:07.804780 IP6 (flowlabel 0x42aaa, hlim 255, next-header UDP (17)
payload length: 54) VB4-SN.mdns > ff02::fb.mdns: [udp sum o)
07:57:07.804961 IP (tos 0x0, ttl 255, id 13070, offset 0, flags [DF],
proto UDP (17), length 74)
^CVB4-SN.mdns > 224.0.0.251.mdns: 0 PTR (QM)? 255.255.[
1689.046815] device lan4 left promiscuous mode
254.169.in-addr.arpa. (46)

7 packets captured
53 packets rec[ 1689.055592] device eth1 left promiscuous mode
eived by filter
40 packets dropped by kernel

> What does tcpdump on the PC show? Are the ARP replies getting to it?
> Is the PC dropping the ARP replies?
>
I dont see any Reply's on the PC with tcpdump on PC

> If the PC is dropping the ARP replies, take a look at the
> checksums. Wireshark is good at that.
>
Nor do I see anything in the wireshark.

> If the ARP replies are not making it to the PC, look at the switch
> statistics. ethtool -S lan4. Are the TX counts going up? Any error
> counts going up?
>
Yes the Tx counts are going up, without any errors, following is the log:

~$ ethtool -S lan4
NIC statistics:
 tx_packets: 499
 tx_bytes: 37105
 rx_packets: 462
 rx_bytes: 34138
 rx_hi: 0
 rx_undersize: 0
 rx_fragments: 0
 rx_oversize: 0
 rx_jabbers: 0
 rx_symbol_err: 0
 rx_crc_err: 0
 rx_align_err: 0
 rx_mac_ctrl: 0
 rx_pause: 0
 rx_bcast: 402
 rx_mcast: 69
 rx_ucast: 0
 rx_64_or_less: 361
 rx_65_127: 45
 rx_128_255: 34
 rx_256_511: 31
 rx_512_1023: 0
 rx_1024_1522: 0
 rx_1523_2000: 0
 rx_2001: 0
 tx_hi: 0
 tx_late_col: 0
 tx_pause: 0
 tx_bcast: 14
 tx_mcast: 196
 tx_ucast: 0
 tx_deferred: 0
 tx_total_col: 0
 tx_exc_col: 0
 tx_single_col: 0
 tx_mult_col: 0
 rx_total: 43248
 tx_total: 33759
 rx_discards: 0
 tx_discards: 0

Cheers,
--Prabhakar

Re: [Query]: DSA Understanding

2018-08-02 Thread Andrew Lunn

> I have PC connected to lan4(ip = 169.254..126.126) and the PC ip is
> 169.254.78.251,
> but when I ping from PC to lan4 I get Destination Host Unreachable,
> but where as I can see
> that in the tcpdump log for lan4 it does reply back, but it doesn’t
> reach the PC, Is there I am missing
> something here ?
> 
> ~$ tcpdump -i lan4 -v
> [  661.057166] device lan4 entered promiscuous mode
> [  661.061814] device eth1 entered promiscuous mode

> 07:40:21.254161 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
> VB4-SN tell tango-charlie.local, length 46
> 07:40:21.254181 ARP, Ethernet (len 6), IPv4 (len 4), Reply
> VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28

Having names here does not help when you gave IP addresses above.

Am i reading this correct? The PC is ARPing the switch device. The
switch device is replying?

What does tcpdump on the PC show? Are the ARP replies getting to it?
Is the PC dropping the ARP replies?

If the PC is dropping the ARP replies, take a look at the
checksums. Wireshark is good at that.

If the ARP replies are not making it to the PC, look at the switch
statistics. ethtool -S lan4. Are the TX counts going up? Any error
counts going up?

   Andrew

Re: [PATCH net-next 7/9] net: stmmac: Integrate XGMAC into main driver flow

2018-08-02 Thread Andrew Lunn

> Sorry, I made a mistake. Where it reads SGMII in my reply I was
> referring to XGMII.

So you have XGMII between the MAC and the PHY. That should support
2.5G, 5G and 10G. What i don't know is if you can also do 10/100/1000
over XGMII?

How are you currently connecting your 1G PHY to the MAC? XGMII is a
big parallel bus, where as SGMII is a small serial bus.

I would say, before this patchset goes anywhere, you need to test
10/100/1000/2.5G/10G, with at least one PHY.

Alternatively, take out support for 2.5G/10G and C45, and post patches
for just > 1G and C22. That you can test and you know works. You can
add the rest later.

Andrew

Re: [PATCH net-next 7/9] net: stmmac: Integrate XGMAC into main driver flow

2018-08-02 Thread Jose Abreu

Hi Andrew,

On 02-08-2018 15:03, Andrew Lunn wrote:
> On Thu, Aug 02, 2018 at 09:26:28AM +0100, Jose Abreu wrote:
>> Hi Andrew,
>>
>> Thanks for the review!
>>
>> On 01-08-2018 16:23, Andrew Lunn wrote:
 @@ -842,6 +863,12 @@ static void stmmac_adjust_link(struct net_device *dev)
new_state = true;
ctrl &= ~priv->hw->link.speed_mask;
switch (phydev->speed) {
 +  case SPEED_1:
 +  ctrl |= priv->hw->link.speed1;
 +  break;
 +  case SPEED_2500:
 +  ctrl |= priv->hw->link.speed2500;
 +  break;
case SPEED_1000:
ctrl |= priv->hw->link.speed1000;
break;
>>> Hi Jose
>>>
>>> What PHY did you test this with?
>> We had some shipping issues with the 10G phy so right now I'm
>> using a 1G phy ...
> Please add that to the commit message. It is useful for people to know
> this is untested above 1G, and so probably broken

Ok, will do.

>
>> I would expect that as MDIO is used in both
>> phys then phylib would take care of everything as long as I
>> specify in the DT the right interface (SGMII) ... Am I making a
>> wrong assumption?
>>
>>> 10G phys change the interface mode when the speed change. In general,
>>> 10/100/1000G copper uses SGMII. A 1G SFP optical module generally
>>> wants 1000Base-X. 2.5G wants 2500Base-X, 10G copper wants 10GKR, etc.
>>>
>>> So your adjust link callback needs to look at phydev->interface and
>>> reconfigure the MAC as requested.
>> Sorry, I'm not a phy expert but as long as I use MDIO shouldn't
>> this be transparent to MAC? I mean, there are no registers about
>> the interface to use in XGMAC2, there is only this speed
>> selection register that its implemented already in the
>> stmmac_adjust_link.
> MDIO is the control plane used to manage the PHY. But here we are
> talking about the data plane. As i said, the link between the MAC and
> PHY will need to change depending on what the PHY is doing. SGMII will
> work for 10/100/1000, but nothing above that. 

Sorry, I made a mistake. Where it reads SGMII in my reply I was
referring to XGMII.

> It could be this speed
> register also changes the SERDES configuration, but you really should
> confirm this and find out exactly what it is doing. There can be
> multiple ways of doing one speed, e.g. SGMII at 1G. So if the PHY
> wants you to do 1000Base-X and the MAC can only do SGMII, you need to
> be raising an error. phylink makes this simpler. It ask the MAC driver
> for all the modes it supports. It will then not ask the MAC to swap to
> something it does not support.

Ok. XGMII support is optional in the MAC so I will need to add a
check for that.

Thanks and Best Regards,
Jose Miguel Abreu

>
> I suggest you get the datasheet for the PHY you are expecting to get,
> once shipping is fixed. See what it says about its MAC side interface.
> You can also look at the Marvell 10G driver, e.g.
> mv3310_update_interface().
>
>   Andrew

Re: [PATCH net-next 5/9] net: stmmac: Add MDIO related functions for XGMAC2

2018-08-02 Thread Andrew Lunn

On Thu, Aug 02, 2018 at 09:36:41AM +0100, Jose Abreu wrote:
> Hi Andrew,
> 
> On 01-08-2018 16:08, Andrew Lunn wrote:
> > Hi Jose
> >
> >> +static int stmmac_xgmac2_mdio_read(struct stmmac_priv *priv, int phyaddr,
> >> + int phyreg)
> >> +{
> >> +  unsigned int mii_address = priv->hw->mii.addr;
> >> +  unsigned int mii_data = priv->hw->mii.data;
> >> +  u32 tmp, addr, value = MII_XGMAC_BUSY;
> >> +  int data;
> >> +
> >> +  if (phyreg & MII_ADDR_C45) {
> >> +  addr = ((phyreg >> 16) & 0x1f) << 21;
> >> +  addr |= (phyaddr << 16) | (phyreg & 0x);
> > Do you need to tell the hardware this is a C45 transfer? Normally an
> > extra bit needs setting somewhere.
> 
> The organization of addr reg is the following:
> DA [25:21] | PA [20:16] | RA [15:0]
> 
> DA is Device Address, PA is Port Address and RA is Register Address.

Hi Jose

Please read my question. That is not what i asked.

> >
> >> +  } else {
> >> +  if (phyaddr >= 4)
> >> +  return -ENODEV;
> > Can the MDIO bus be external? If so, is there a reason why there
> > cannot be a PHY at addresses > 4. So maybe there is an Ethernet
> > switch, which needs lots of addresses? And C45 can have devices > 4
> > but C22 cannot?
> 
> Only ports 0 to 3 can be configured as C22 ports, according to
> databook.

Please add a comment about this. And EOPNOTSUP might be a better error
code.
 
> >> +  value |= MII_XGMAC_READ;
> >> +
> >> +  if (readl_poll_timeout(priv->ioaddr + mii_data, tmp,
> >> + !(tmp & MII_XGMAC_BUSY), 100, 1))
> >> +  return -EBUSY;
> >> +
> >> +  writel(addr, priv->ioaddr + mii_address);
> >> +  writel(value, priv->ioaddr + mii_data);
> >> +
> >> +  if (readl_poll_timeout(priv->ioaddr + mii_data, tmp,
> >> + !(tmp & MII_XGMAC_BUSY), 100, 1))
> >> +  return -EBUSY;
> >> +
> >> +  /* Read the data from the MII data register */
> >> +  data = (int)readl(priv->ioaddr + mii_data) & GENMASK(15, 0);
> > Is the cast needed? And why use GENMASK here, but not in all the other
> > places you have masks in this code?
> 
> The GENMASK is needed, notice how we set more values into
> mii_data (clk_csr, bit(18), cmd), this is not cleared by XGMAC2
> upon the completion of the operation ...

Again, please read my question. That is not what i asked.

Has C45 been tested? Does the 1G PHY you have support C45 transfers?

   Andrew

Re: [PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Toshiaki Makita


On 18/08/02 (木) 22:53, Jesper Dangaard Brouer wrote:

On Thu, 2 Aug 2018 22:17:53 +0900
Toshiaki Makita  wrote:


On 18/08/02 (木) 20:45, Jesper Dangaard Brouer wrote:

On Thu,  2 Aug 2018 19:55:09 +0900
Toshiaki Makita  wrote:
   

+   headroom = frame->data - delta - (void *)frame;


Your calculation of headroom is still adding an assumption that
xdp_frame is located in the top of data area, that is unnecessary.

The headroom can be calculated as:

   headroom = sizeof(struct xdp_frame) + frame->headroom - delta;


Thanks. But I'm not sure I get what you are requesting.


I'm simply requesting you do not use the (void *)frame pointer address,
to calculate the headroom, as it can be calculated in another way.


I don't see difference, but ok I can change this calculation as you 
prefer a different way.



Supposing xdp_frame is not located in the top of data area, what ensures
that additional sizeof(struct xdp_frame) can be used?


The calculation in convert_to_xdp_frame() assures this.  If we later
add an xdp_frame that is not located in the top of data area, and want
to change the reserved headroom size, then we deal with it, and update
the code.


I just thought you are requesting the change so that we don't need to 
change this code even when convert_to_xdp_frame() is changed. Now I see 
my guess was wrong.


will send v8.

Thanks,
Toshiaki Makita

Re: [PATCH net-next 7/9] net: stmmac: Integrate XGMAC into main driver flow

2018-08-02 Thread Andrew Lunn

On Thu, Aug 02, 2018 at 09:26:28AM +0100, Jose Abreu wrote:
> Hi Andrew,
> 
> Thanks for the review!
> 
> On 01-08-2018 16:23, Andrew Lunn wrote:
> >> @@ -842,6 +863,12 @@ static void stmmac_adjust_link(struct net_device *dev)
> >>new_state = true;
> >>ctrl &= ~priv->hw->link.speed_mask;
> >>switch (phydev->speed) {
> >> +  case SPEED_1:
> >> +  ctrl |= priv->hw->link.speed1;
> >> +  break;
> >> +  case SPEED_2500:
> >> +  ctrl |= priv->hw->link.speed2500;
> >> +  break;
> >>case SPEED_1000:
> >>ctrl |= priv->hw->link.speed1000;
> >>break;
> > Hi Jose
> >
> > What PHY did you test this with?
> 
> We had some shipping issues with the 10G phy so right now I'm
> using a 1G phy ...

Please add that to the commit message. It is useful for people to know
this is untested above 1G, and so probably broken

> I would expect that as MDIO is used in both
> phys then phylib would take care of everything as long as I
> specify in the DT the right interface (SGMII) ... Am I making a
> wrong assumption?
> 
> >
> > 10G phys change the interface mode when the speed change. In general,
> > 10/100/1000G copper uses SGMII. A 1G SFP optical module generally
> > wants 1000Base-X. 2.5G wants 2500Base-X, 10G copper wants 10GKR, etc.
> >
> > So your adjust link callback needs to look at phydev->interface and
> > reconfigure the MAC as requested.
> 
> Sorry, I'm not a phy expert but as long as I use MDIO shouldn't
> this be transparent to MAC? I mean, there are no registers about
> the interface to use in XGMAC2, there is only this speed
> selection register that its implemented already in the
> stmmac_adjust_link.

MDIO is the control plane used to manage the PHY. But here we are
talking about the data plane. As i said, the link between the MAC and
PHY will need to change depending on what the PHY is doing. SGMII will
work for 10/100/1000, but nothing above that. It could be this speed
register also changes the SERDES configuration, but you really should
confirm this and find out exactly what it is doing. There can be
multiple ways of doing one speed, e.g. SGMII at 1G. So if the PHY
wants you to do 1000Base-X and the MAC can only do SGMII, you need to
be raising an error. phylink makes this simpler. It ask the MAC driver
for all the modes it supports. It will then not ask the MAC to swap to
something it does not support.

I suggest you get the datasheet for the PHY you are expecting to get,
once shipping is fixed. See what it says about its MAC side interface.
You can also look at the Marvell 10G driver, e.g.
mv3310_update_interface().

Andrew

Re: [PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Jesper Dangaard Brouer

On Thu, 2 Aug 2018 22:17:53 +0900
Toshiaki Makita  wrote:

> On 18/08/02 (木) 20:45, Jesper Dangaard Brouer wrote:
> > On Thu,  2 Aug 2018 19:55:09 +0900
> > Toshiaki Makita  wrote:
> >   
> >> +  headroom = frame->data - delta - (void *)frame;  
> > 
> > Your calculation of headroom is still adding an assumption that
> > xdp_frame is located in the top of data area, that is unnecessary.
> > 
> > The headroom can be calculated as:
> > 
> >   headroom = sizeof(struct xdp_frame) + frame->headroom - delta;  
> 
> Thanks. But I'm not sure I get what you are requesting.

I'm simply requesting you do not use the (void *)frame pointer address,
to calculate the headroom, as it can be calculated in another way.

> Supposing xdp_frame is not located in the top of data area, what ensures 
> that additional sizeof(struct xdp_frame) can be used?

The calculation in convert_to_xdp_frame() assures this.  If we later
add an xdp_frame that is not located in the top of data area, and want
to change the reserved headroom size, then we deal with it, and update
the code.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: UDP packets arriving on wrong sockets

2018-08-02 Thread Eric Dumazet




On 08/02/2018 06:20 AM, Eric Dumazet wrote:
> 
> Ideally you could give us a C reproducer, so that we can run it ourselves and 
> fix the kernel bug if there is one.
> 
> This C reproducer could be part of an official patch, adding a test in 
> tools/testing/selftests/net

Alternatively a test in Python would be accepted ;)

Re: UDP packets arriving on wrong sockets

2018-08-02 Thread Eric Dumazet




On 08/02/2018 02:05 AM, Andrew Cann wrote:
> Hi, 
> 
> I posted this on stackoverflow yesterday but I'm reposting it here since it 
> got
> no response. Original post: 
> https://stackoverflow.com/questions/51630337/udp-packets-arriving-on-wrong-sockets-on-linux
> 
> I have two UDP sockets bound to the same address and connected to addresses A
> and B. I have two more UDP sockets bound to A and B and not connected.
> 
> This is what my /proc/net/udp looks like (trimmed for readability):
> 
>   sl  local_address rem_address
>  3937: 017F:DD9C 037F:9910
>  3937: 017F:DD9C 027F:907D
> 16962: 027F:907D :
> 19157: 037F:9910 :
> 
> According to connect(2): "If the socket sockfd is of type SOCK_DGRAM, then 
> addr
> is the address to which datagrams are sent by default, *and the only address
> from which datagrams are received*."
> 
> For some reason, my connected sockets are receiving packets that were destined
> for each other. eg: The UDP socket connected to A sends a message to A, A then
> sends a reply back. The UDP socket connected to B sends a message to B, B then
> sends a reply back. But the reply from A arrives at the socket connected to B
> and the reply from B arrives at the socket connected to A.
> 
> Why on earth would this be happening? Note that it happens randomly - 
> sometimes
> the replies arrive at the correct sockets and sometimes they don't. Is there
> any way to prevent this or any situation under which connect() is supposed to
> not work?
> 
> Any help explaining this would be hugely appreciated :)

Hi Andrew

Well, you should first give much more details, as there are thousands of 
different UDP stacks out there.

Documentation/admin-guide/reporting-bugs.rst

...
[4.1.] Kernel version (from /proc/version): 
...

Ideally you could give us a C reproducer, so that we can run it ourselves and 
fix the kernel bug if there is one.

This C reproducer could be part of an official patch, adding a test in 
tools/testing/selftests/net

Thanks !

Re: [PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Toshiaki Makita


On 18/08/02 (木) 20:45, Jesper Dangaard Brouer wrote:

On Thu,  2 Aug 2018 19:55:09 +0900
Toshiaki Makita  wrote:


+   headroom = frame->data - delta - (void *)frame;


Your calculation of headroom is still adding an assumption that
xdp_frame is located in the top of data area, that is unnecessary.

The headroom can be calculated as:

  headroom = sizeof(struct xdp_frame) + frame->headroom - delta;


Thanks. But I'm not sure I get what you are requesting.
Supposing xdp_frame is not located in the top of data area, what ensures 
that additional sizeof(struct xdp_frame) can be used?


Toshiaki Makita

Re: [bug report] net/mlx5e: Gather all XDP pre-requisite checks in a single function

2018-08-02 Thread Leon Romanovsky

+netdev and Saeed

On Thu, Aug 02, 2018 at 11:54:49AM +0300, Dan Carpenter wrote:
> Hello Tariq Toukan,
>
> The patch 0ec13877ce95: "net/mlx5e: Gather all XDP pre-requisite
> checks in a single function" from Mar 12, 2018, leads to the
> following Smatch warning:
>
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c:4284 mlx5e_xdp_set()
>   error: uninitialized symbol 'err'.
>
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>   4214  static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog 
> *prog)
>   4215  {
>   4216  struct mlx5e_priv *priv = netdev_priv(netdev);
>   4217  struct bpf_prog *old_prog;
>   4218  bool reset, was_opened;
>   4219  int err;
> ^^^
> I always encourage people to remove unneeded initializers so that GCC
> can detect uninitialized variables, but it turns out that GCC misses a
> bunch of bugs?
>
>   4220  int i;
>   4221
>   4222  mutex_lock(>state_lock);
>   4223
>   4224  if (prog) {
>   4225  err = mlx5e_xdp_allowed(priv, prog);
>   4226  if (err)
>   4227  goto unlock;
>   4228  }
>   4229
>   4230  was_opened = test_bit(MLX5E_STATE_OPENED, >state);
>   4231  /* no need for full reset when exchanging programs */
>   4232  reset = (!priv->channels.params.xdp_prog || !prog);
>   4233
>   4234  if (was_opened && reset)
>   4235  mlx5e_close_locked(netdev);
>   4236  if (was_opened && !reset) {
>   4237  /* num_channels is invariant here, so we can take the
>   4238   * batched reference right upfront.
>   4239   */
>   4240  prog = bpf_prog_add(prog, priv->channels.num);
>   4241  if (IS_ERR(prog)) {
>   4242  err = PTR_ERR(prog);
>   4243  goto unlock;
>   4244  }
>   4245  }
>   4246
>   4247  /* exchange programs, extra prog reference we got from caller
>   4248   * as long as we don't fail from this point onwards.
>   4249   */
>   4250  old_prog = xchg(>channels.params.xdp_prog, prog);
>   4251  if (old_prog)
>   4252  bpf_prog_put(old_prog);
>   4253
>   4254  if (reset) /* change RQ type according to priv->xdp_prog */
>   4255  mlx5e_set_rq_type(priv->mdev, >channels.params);
>   4256
>   4257  if (was_opened && reset)
>   4258  mlx5e_open_locked(netdev);
>   4259
>   4260  if (!test_bit(MLX5E_STATE_OPENED, >state) || reset)
>   4261  goto unlock;
> ^^^
> In the original code this was a success path.
>
>   4262
>   4263  /* exchanging programs w/o reset, we update ref counts on 
> behalf
>   4264   * of the channels RQs here.
>   4265   */
>   4266  for (i = 0; i < priv->channels.num; i++) {
>   4267  struct mlx5e_channel *c = priv->channels.c[i];
>   4268
>   4269  clear_bit(MLX5E_RQ_STATE_ENABLED, >rq.state);
>   4270  napi_synchronize(>napi);
>   4271  /* prevent mlx5e_poll_rx_cq from accessing 
> rq->xdp_prog */
>   4272
>   4273  old_prog = xchg(>rq.xdp_prog, prog);
>   4274
>   4275  set_bit(MLX5E_RQ_STATE_ENABLED, >rq.state);
>   4276  /* napi_schedule in case we have missed anything */
>   4277  napi_schedule(>napi);
>   4278
>   4279  if (old_prog)
>   4280  bpf_prog_put(old_prog);
>   4281  }
>   4282
>   4283  unlock:
>   4284  mutex_unlock(>state_lock);
>   4285  return err;
> ^^
>   4286  }
>
> regards,
> dan carpenter
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature

[PATCH] ipv6: icmp: Updating pmtu for link local route

2018-08-02 Thread Georg Kohmann

When a ICMPV6_PKT_TOOBIG is received from a link local address the pmtu will
be updated on a route with an arbitrary interface index. Subsequent packets
sent back to the same link local address may therefore end up not
considering the updated pmtu.

Current behavior breaks TAHI v6LC4.1.4 Reduce PMTU On-link. Referring to RFC
1981: Section 3: "Note that Path MTU Discovery must be performed even in
cases where a node "thinks" a destination is attached to the same link as
itself. In a situation such as when a neighboring router acts as proxy [ND]
for some destination, the destination can to appear to be directly
connected but is in fact more than one hop away."

Using the interface index from the incoming ICMPV6_PKT_TOOBIG when updating
the pmtu.

Signed-off-by: Georg Kohmann 
---
 net/ipv6/icmp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 3ae2fbe..211db37 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -92,7 +92,7 @@ static void icmpv6_err(struct sk_buff *skb, struct 
inet6_skb_parm *opt,
struct net *net = dev_net(skb->dev);
 
if (type == ICMPV6_PKT_TOOBIG)
-   ip6_update_pmtu(skb, net, info, 0, 0, sock_net_uid(net, NULL));
+   ip6_update_pmtu(skb, net, info, skb->dev->ifindex, 0, 
sock_net_uid(net, NULL));
else if (type == NDISC_REDIRECT)
ip6_redirect(skb, net, skb->dev->ifindex, 0,
 sock_net_uid(net, NULL));
-- 
2.10.2

Re: [PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Jesper Dangaard Brouer

On Thu,  2 Aug 2018 19:55:09 +0900
Toshiaki Makita  wrote:

> + headroom = frame->data - delta - (void *)frame;

Your calculation of headroom is still adding an assumption that
xdp_frame is located in the top of data area, that is unnecessary.

The headroom can be calculated as:

 headroom = sizeof(struct xdp_frame) + frame->headroom - delta;

> + skb = veth_build_skb(frame, headroom, len, 0);
> + if (!skb) {
> + xdp_return_frame(frame);
> + goto err;
> + }
> +
> + xdp_scrub_frame(frame);

Thanks you for adding a xdp_scrub_frame() instead.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net-next] net/tls: Always get number of sg entries for skb to be decrypted

2018-08-02 Thread Vakul Garg

Function decrypt_skb() made a bad assumption that number of sg entries
required for mapping skb to be decrypted would always be less than
MAX_SKB_FRAGS. The required count of sg entries for skb should always be
calculated. If they cannot fit in local array sgin_arr[], allocate them
from heap irrespective whether it is zero-copy case or otherwise. The
change also benefits the non-zero copy case as we could use sgin_arr[]
instead of always allocating sg entries from heap.

Signed-off-by: Vakul Garg 
---

The said problem has been discussed with Dave Watson over mail list.

 net/tls/tls_sw.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ff3a6904a722..e2cf7aebb877 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -693,7 +693,7 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
struct scatterlist *sgin = _arr[0];
struct strp_msg *rxm = strp_msg(skb);
-   int ret, nsg = ARRAY_SIZE(sgin_arr);
+   int ret, nsg;
struct sk_buff *unused;
 
ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE,
@@ -703,12 +703,20 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
return ret;
 
memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
-   if (!sgout) {
-   nsg = skb_cow_data(skb, 0, ) + 1;
+
+   /* If required number of SG entries for skb are more than
+* sgin_arr elements, then dynamically allocate sg table.
+*/
+   nsg = skb_cow_data(skb, 0, ) + 1;
+   if (nsg > ARRAY_SIZE(sgin_arr)) {
sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
-   sgout = sgin;
+   if (!sgin)
+   return -ENOMEM;
}
 
+   if (!sgout)
+   sgout = sgin;
+
sg_init_table(sgin, nsg);
sg_set_buf([0], ctx->rx_aad_ciphertext, TLS_AAD_SPACE_SIZE);
 
-- 
2.13.6

Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-08-02 Thread Björn Töpel

Den ons 1 aug. 2018 kl 22:25 skrev Daniel Borkmann :
>
> On 08/01/2018 04:43 PM, Björn Töpel wrote:
> > Den ons 1 aug. 2018 kl 16:14 skrev Jesper Dangaard Brouer 
> > :
> >> On Mon, 23 Jul 2018 11:41:02 +0200
> >> Björn Töpel  wrote:
> >>
> >> diff --git a/net/core/xdp.c b/net/core/xdp.c
> >> index 9d1f220..1c12bc7 100644
> >> --- a/net/core/xdp.c
> >> +++ b/net/core/xdp.c
> >> @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> >> xdp_mem_info *mem, bool napi_direct,
> >>   rcu_read_lock();
> >>   /* mem->id is valid, checked in 
> >> xdp_rxq_info_reg_mem_model() */
> >>   xa = rhashtable_lookup(mem_id_ht, >id, 
> >> mem_id_rht_params);
> >> - xa->zc_alloc->free(xa->zc_alloc, handle);
> >> + if (xa)
> >> + xa->zc_alloc->free(xa->zc_alloc, handle);
> > hmm...It is not clear to me the "!xa" case don't have to be handled?
> 
>  Thank you for reviewing!
> 
>  Returning NULL pointer is bug case such as calling after use
>  xdp_rxq_info_unreg().
>  so that, I think it can't handle at that moment.
>  we can make __xdp_return to add WARN_ON_ONCE() or
>  add return error code to driver.
>  But I'm not sure if these is useful information.
> 
>  I might have misunderstood scenario of MEM_TYPE_ZERO_COPY
>  because there is no use case of MEM_TYPE_ZERO_COPY yet.
> >>>
> >>> Taehee, again, sorry for the slow response and thanks for patch!
> >>>
> >>> If xa is NULL, the driver has a buggy/broken implementation. What
> >>> would be a proper way of dealing with this? BUG?
> >>
> >> Hmm... I don't like these kind of changes to the hot-path code!
> >>
> >> You might not realize this, but adding BUG() and WARN_ON() to the code
> >> affect performance in ways you might not realize!  These macros gets
> >> compiled and uses an asm instruction called "ud2".  Seeing the "ud2"
> >> instruction causes the CPUs instruction cache prefetcher to stop.
> >> Thus, if some code ends up below this instruction, this will cause more
> >> i-cache-misses.
> >>
> >> I don't know if xa==NULL is even possible, but if it is, then I think
> >> this is a result of a driver mem_reg API usage bug.  And the mem-reg
> >> API is full of WARN's and error messages, exactly to push these kind of
> >> checks out of the fast-path.  There is no need for a BUG() call, as
> >> deref a NULL pointer will case an OOPS, that is easy to read and
> >> understand.
> >
> > Jesper, thanks for having a look! So, you're right that if xa==NULL
> > the driver is "broken/buggy" (as stated earlier!). I agree that
> > OOPSing on a NULL pointer is as good as a BUG!
> >
> > The applied patch adds a WARN_ON_ONCE, and I thought best practice was
> > that a buggy driver shouldn't crash the kernel... What is considered
> > best practices in these scenarios? *I'd* prefer an OOPS instead of
> > WARN_ON_ONCE, to catch that buggy driver. Again, that's me. I thought
> > that most people prefer not crashing, hence the patch. :-)
>
> In that case, lets send a revert for the patch with a proper analysis
> of why it is safe to omit the NULL check which should be placed as a
> comment right near the rhashtable_lookup().
>

I'll do that (as soon as I've double-checked so that I'm not lying)!


Björn

> Thanks,
> Daniel

[PATCH v7 bpf-next 10/10] veth: Support per queue XDP ring

2018-08-02 Thread Toshiaki Makita

Move XDP and napi related fields from veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 278 -
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a2ba1c0..0bb409b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
struct napi_struct  xdp_napi;
struct net_device   *dev;
struct bpf_prog __rcu   *xdp_prog;
-   struct bpf_prog *_xdp_prog;
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
struct xdp_mem_info xdp_mem;
-   unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
 };
 
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   struct bpf_prog *_xdp_prog;
+   struct veth_rq  *rq;
+   unsigned intrequested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
/* Write ptr_ring before reading rx_notify_masked */
smp_mb();
-   if (!priv->rx_notify_masked) {
-   priv->rx_notify_masked = true;
-   napi_schedule(>xdp_napi);
+   if (!rq->rx_notify_masked) {
+   rq->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
dev_kfree_skb_any(skb);
return NET_RX_DROP;
}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct 
sk_buff *skb)
return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+   struct veth_rq *rq, bool xdp)
 {
-   struct veth_priv *priv = netdev_priv(dev);
-
return __dev_forward_skb(dev, skb) ?: xdp ?
-   veth_xdp_rx(priv, skb) :
+   veth_xdp_rx(rq, skb) :
netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct veth_rq *rq = NULL;
struct net_device *rcv;
int length = skb->len;
bool rcv_xdp = false;
+   int rxq;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
rcv_priv = netdev_priv(rcv);
-   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+   rxq = skb_get_queue_mapping(skb);
+   if (rxq < rcv->real_num_rx_queues) {
+   rq = _priv->rq[rxq];
+   rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+   if (rcv_xdp)
+   skb_record_rx_queue(skb, rxq);
+   }
 
-   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+   if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
if (rcv_xdp)
-   __veth_xdp_flush(rcv_priv);
+   __veth_xdp_flush(rq);
 
rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+   return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 struct xdp_frame **frames, u32 flags)
 {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct

[PATCH v7 bpf-next 06/10] veth: Add ndo_xdp_xmit

2018-08-02 Thread Toshiaki Makita

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9993878..3e1582a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+   return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+struct xdp_frame **frames, u32 flags)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+   unsigned int max_len;
+   int i, drops = 0;
+
+   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+   return -EINVAL;
+
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   return -ENXIO;
+
+   rcv_priv = netdev_priv(rcv);
+   /* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+* side. This means an XDP program is loaded on the peer and the peer
+* device is up.
+*/
+   if (!rcu_access_pointer(rcv_priv->xdp_prog))
+   return -ENXIO;
+
+   max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+   spin_lock(_priv->xdp_ring.producer_lock);
+   for (i = 0; i < n; i++) {
+   struct xdp_frame *frame = frames[i];
+   void *ptr = veth_xdp_to_ptr(frame);
+
+   if (unlikely(frame->len > max_len ||
+__ptr_ring_produce(_priv->xdp_ring, ptr))) {
+   xdp_return_frame_rx_napi(frame);
+   drops++;
+   }
+   }
+   spin_unlock(_priv->xdp_ring.producer_lock);
+
+   if (flags & XDP_XMIT_FLUSH)
+   __veth_xdp_flush(rcv_priv);
+
+   return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
struct xdp_frame *frame)
 {
@@ -767,6 +817,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
+   .ndo_xdp_xmit   = veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
1.8.3.1

[PATCH v7 bpf-next 07/10] bpf: Make redirect_info accessible from modules

2018-08-02 Thread Toshiaki Makita

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 10 ++
 net/core/filter.c  | 29 +++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..4717af8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
struct list_head list;
 };
 
+struct bpf_redirect_info {
+   u32 ifindex;
+   u32 flags;
+   struct bpf_map *map;
+   struct bpf_map *map_to_flush;
+   unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 104d560..2766a55 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2080,19 +2080,12 @@ static int __bpf_redirect(struct sk_buff *skb, struct 
net_device *dev,
.arg3_type  = ARG_ANYTHING,
 };
 
-struct redirect_info {
-   u32 ifindex;
-   u32 flags;
-   struct bpf_map *map;
-   struct bpf_map *map_to_flush;
-   unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags & ~(BPF_F_INGRESS)))
return TC_ACT_SHOT;
@@ -2105,7 +2098,7 @@ struct redirect_info {
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *dev;
 
dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3198,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
 
 void xdp_do_flush_map(void)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = ri->map_to_flush;
 
ri->map_to_flush = NULL;
@@ -3243,7 +3236,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog 
*xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3283,7 +3276,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct net_device *fwd;
u32 index = ri->ifindex;
int err;
@@ -3315,7 +3308,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
   struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
@@ -3366,7 +3359,7 @@ static int xdp_do_generic_redirect_map(struct net_device 
*dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
struct net_device *fwd;
int err = 0;
@@ -3397,7 +3390,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
 
if (unlikely(flags))
return XDP_ABORTED;
@@ -3421,7 +3414,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 BPF_CALL_4(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,

[PATCH v7 bpf-next 09/10] veth: Add XDP TX and REDIRECT

2018-08-02 Thread Toshiaki Makita

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)  (XDP) (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 119 +
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3e1582a..a2ba1c0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TXBIT(0)
+#define VETH_XDP_REDIR BIT(1)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
+   struct xdp_mem_info xdp_mem;
unsignedrequested_headroom;
boolrx_notify_masked;
struct ptr_ring xdp_ring;
@@ -317,10 +322,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+   struct net_device *rcv;
+
+   rcu_read_lock();
+   rcv = rcu_dereference(priv->peer);
+   if (unlikely(!rcv))
+   goto out;
+
+   rcv_priv = netdev_priv(rcv);
+   /* xdp_ring is initialized on receive side? */
+   if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+   goto out;
+
+   __veth_xdp_flush(rcv_priv);
+out:
+   rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+   struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!frame))
+   return -EOVERFLOW;
+
+   return veth_xdp_xmit(dev, 1, , 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-   struct xdp_frame *frame)
+   struct xdp_frame *frame,
+   unsigned int *xdp_xmit)
 {
int len = frame->len, delta = 0;
+   struct xdp_frame orig_frame;
struct bpf_prog *xdp_prog;
unsigned int headroom;
struct sk_buff *skb;
@@ -344,6 +381,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
delta = frame->data - xdp.data;
len = xdp.data_end - xdp.data;
break;
+   case XDP_TX:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (unlikely(veth_xdp_tx(priv->dev, ) < 0)) {
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_TX;
+   rcu_read_unlock();
+   goto xdp_xmit;
+   case XDP_REDIRECT:
+   orig_frame = *frame;
+   xdp.data_hard_start = frame;
+   xdp.rxq->mem = frame->mem;
+   if (xdp_do_redirect(priv->dev, , xdp_prog)) {
+   frame = _frame;
+   goto err_xdp;
+   }
+   *xdp_xmit |= VETH_XDP_REDIR;
+   rcu_read_unlock();
+   goto xdp_xmit;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -368,12 +428,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv 
*priv,
 err_xdp:
rcu_read_unlock();
xdp_return_frame(frame);
-
+xdp_xmit:
return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+   unsigned int *xdp_xmit)
 {
u32 pktlen, headroom, act, metalen;
void

[PATCH v7 bpf-next 08/10] xdp: Helpers for disabling napi_direct of xdp_return_frame

2018-08-02 Thread Toshiaki Makita

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita 
---
 include/linux/filter.h | 25 +
 net/core/xdp.c |  6 --
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8..2b072da 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
struct bpf_map *map;
struct bpf_map *map_to_flush;
unsigned long   map_owner;
+   u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT  BIT(0)  /* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
+
+   ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 5728538..3dd99e1 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
page = virt_to_head_page(data);
-   if (xa)
+   if (xa) {
+   napi_direct &= !xdp_return_frame_no_direct();
page_pool_put_page(xa->page_pool, page, napi_direct);
-   else
+   } else {
put_page(page);
+   }
rcu_read_unlock();
break;
case MEM_TYPE_PAGE_SHARED:
-- 
1.8.3.1

[PATCH v7 bpf-next 05/10] veth: Handle xdp_frames in xdp napi ring

2018-08-02 Thread Toshiaki Makita

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v7:
- Use xdp_scrub_frame() instead of memset().

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita 
Acked-by: John Fastabend 
---
 drivers/net/veth.c | 87 ++
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9edf104..9993878 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_XDP_FLAG  BIT(0)
 #define VETH_RING_SIZE 256
 #define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+   return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+   return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+   if (veth_is_xdp_frame(ptr))
+   xdp_return_frame(veth_ptr_to_xdp(ptr));
+   else
+   kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int 
headroom, int len,
return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+   struct xdp_frame *frame)
+{
+   int len = frame->len, delta = 0;
+   struct bpf_prog *xdp_prog;
+   unsigned int headroom;
+   struct sk_buff *skb;
+
+   rcu_read_lock();
+   xdp_prog = rcu_dereference(priv->xdp_prog);
+   if (likely(xdp_prog)) {
+   struct xdp_buff xdp;
+   u32 act;
+
+   xdp.data_hard_start = frame->data - frame->headroom;
+   xdp.data = frame->data;
+   xdp.data_end = frame->data + frame->len;
+   xdp.data_meta = frame->data - frame->metasize;
+   xdp.rxq = >xdp_rxq;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (act) {
+   case XDP_PASS:
+   delta = frame->data - xdp.data;
+   len = xdp.data_end - xdp.data;
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->dev, xdp_prog, act);
+   case XDP_DROP:
+   goto err_xdp;
+   }
+   }
+   rcu_read_unlock();
+
+   headroom = frame->data - delta - (void *)frame;
+   skb = veth_build_skb(frame, headroom, len, 0);
+   if (!skb) {
+   xdp_return_frame(frame);
+   goto err;
+   }
+
+   xdp_scrub_frame(frame);
+   skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+   return skb;
+err_xdp:
+   rcu_read_unlock();
+   xdp_return_frame(frame);
+
+   return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
struct sk_buff *skb)
 {
@@ -359,12 +432,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int 
budget)
int i, done = 0;
 
for (i = 0; i < budget; i++) {
-   struct sk_buff *skb = __ptr_ring_consume(>xdp_ring);
+   void *ptr = __ptr_ring_consume(>xdp_ring);
+   struct sk_buff *skb;
 
-   if (!skb)
+   if (!ptr)
break;
 
-   skb = veth_xdp_rcv_skb(priv, skb);
+   if (veth_is_xdp_frame(ptr))
+   skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+   else
+   skb = veth_xdp_rcv_skb(priv, ptr);
 
if (skb)
napi_gro_receive(>xdp_napi, skb);
@@ -417,7 +494,7 @@ static void veth_napi_del(struct net_device *dev)
napi_disable(>xdp_napi);
netif_napi_del(>xdp_napi);
priv->rx_notify_masked = false;
-   ptr_ring_cleanup(>xdp_ring, __skb_array_destroy_skb);
+   ptr_ring_cleanup(>xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
1.8.3.1

[PATCH v7 bpf-next 03/10] veth: Avoid drops by oversized packets when XDP is enabled

2018-08-02 Thread Toshiaki Makita

Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d3b9f10..9edf104 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -543,6 +543,23 @@ static int veth_get_iflink(const struct net_device *dev)
return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+  netdev_features_t features)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   struct net_device *peer;
+
+   peer = rtnl_dereference(priv->peer);
+   if (peer) {
+   struct veth_priv *peer_priv = netdev_priv(peer);
+
+   if (peer_priv->_xdp_prog)
+   features &= ~NETIF_F_GSO_SOFTWARE;
+   }
+
+   return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -572,6 +589,7 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
struct veth_priv *priv = netdev_priv(dev);
struct bpf_prog *old_prog;
struct net_device *peer;
+   unsigned int max_mtu;
int err;
 
old_prog = priv->_xdp_prog;
@@ -585,6 +603,15 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
 
+   max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+ peer->hard_header_len -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (peer->mtu > max_mtu) {
+   NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to 
set XDP");
+   err = -ERANGE;
+   goto err;
+   }
+
if (dev->flags & IFF_UP) {
err = veth_enable_xdp(dev);
if (err) {
@@ -592,14 +619,29 @@ static int veth_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
goto err;
}
}
+
+   if (!old_prog) {
+   peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = max_mtu;
+   }
}
 
if (old_prog) {
-   if (!prog && dev->flags & IFF_UP)
-   veth_disable_xdp(dev);
+   if (!prog) {
+   if (dev->flags & IFF_UP)
+   veth_disable_xdp(dev);
+
+   if (peer) {
+   peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+   peer->max_mtu = ETH_MAX_MTU;
+   }
+   }
bpf_prog_put(old_prog);
}
 
+   if ((!!old_prog ^ !!prog) && peer)
+   netdev_update_features(peer);
+
return 0;
 err:
priv->_xdp_prog = old_prog;
@@ -644,6 +686,7 @@ static int veth_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
.ndo_poll_controller= veth_poll_controller,
 #endif
.ndo_get_iflink = veth_get_iflink,
+   .ndo_fix_features   = veth_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom= veth_set_rx_headroom,
.ndo_bpf= veth_xdp,
-- 
1.8.3.1

[PATCH v7 bpf-next 04/10] xdp: Helper function to clear kernel pointers in xdp_frame

2018-08-02 Thread Toshiaki Makita

xdp_frame has kernel pointers which should not be readable from bpf
programs. When we want to reuse xdp_frame region but it may be read by
bpf programs later, we can use this helper to clear kernel pointers.
This is more efficient than calling memset() for the entire struct.

Signed-off-by: Toshiaki Makita 
---
 include/net/xdp.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index fcb033f..76b9525 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -84,6 +84,13 @@ struct xdp_frame {
struct net_device *dev_rx; /* used by cpumap */
 };
 
+/* Clear kernel pointers in xdp_frame */
+static inline void xdp_scrub_frame(struct xdp_frame *frame)
+{
+   frame->data = NULL;
+   frame->dev_rx = NULL;
+}
+
 /* Convert xdp_buff to xdp_frame */
 static inline
 struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
-- 
1.8.3.1

[PATCH v7 bpf-next 02/10] veth: Add driver XDP

2018-08-02 Thread Toshiaki Makita

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
  events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita 
---
 drivers/net/veth.c | 374 -
 1 file changed, 367 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39..d3b9f10 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define DRV_NAME   "veth"
 #define DRV_VERSION"1.0"
 
+#define VETH_RING_SIZE 256
+#define VETH_XDP_HEADROOM  (XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
u64 packets;
u64 bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+   struct napi_struct  xdp_napi;
+   struct net_device   *dev;
+   struct bpf_prog __rcu   *xdp_prog;
+   struct bpf_prog *_xdp_prog;
struct net_device __rcu *peer;
atomic64_t  dropped;
unsignedrequested_headroom;
+   boolrx_notify_masked;
+   struct ptr_ring xdp_ring;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static void veth_get_ethtool_stats(struct net_device *dev,
.get_link_ksettings = veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+   /* Write ptr_ring before reading rx_notify_masked */
+   smp_mb();
+   if (!priv->rx_notify_masked) {
+   priv->rx_notify_masked = true;
+   napi_schedule(>xdp_napi);
+   }
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+   if (unlikely(ptr_ring_produce(>xdp_ring, skb))) {
+   dev_kfree_skb_any(skb);
+   return NET_RX_DROP;
+   }
+
+   return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool 
xdp)
 {
struct veth_priv *priv = netdev_priv(dev);
+
+   return __dev_forward_skb(dev, skb) ?: xdp ?
+   veth_xdp_rx(priv, skb) :
+   netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv;
int length = skb->len;
+   bool rcv_xdp = false;
 
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
}
 
-   if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+   rcv_priv = netdev_priv(rcv);
+   rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+   if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
u64_stats_update_begin(>syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
 drop:
atomic64_inc(>dropped);
}
+
+   if (rcv_xdp)
+   __veth_xdp_flush(rcv_priv);
+
rcu_read_unlock();
+
return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,254 @@ static void veth_set_multicast_list(struct net_device 
*dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+   struct sk_buff

[PATCH v7 bpf-next 01/10] net: Export skb_headers_offset_update

2018-08-02 Thread Toshiaki Makita

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b..f692968 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned 
int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954..f5670e6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
/* Only adjust this if it actually is csum_start rather than csum */
if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff 
*skb, int off)
skb->inner_network_header += off;
skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
1.8.3.1

[PATCH v7 bpf-next 00/10] veth: Driver XDP

2018-08-02 Thread Toshiaki Makita

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC ---> veth===veth
 (XDP) (redirect)(XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases


* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-6: Add ndo_xdp_xmit.
 - patch 7-9: Add XDP_TX and XDP_REDIRECT.
 - patch 10: Performance optimization for multi-queue env.


Tests and performance numbers
-

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP actionFlowsMpps

DROP   110.6
DROP   221.2
DROP 10036.0
TX 1 5.0
TX 210.0
TX   10031.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction Flows   Gbps
==
external->veth1   20.8
external->veth2   23.5
external->veth  100   23.6
veth->external19.0
veth->external2   17.8
veth->external  100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v7:
- Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
  use it instead of memset().

v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
  events.
- Fix sparse warning around EXPORT_SYMBOL.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita 

Toshiaki Makita (10):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  xdp: Helper function to clear kernel pointers in xdp_frame
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c | 748 -
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 include/net/xdp.h  |   7 +
 net/core/filter.c  |  29 +-
 net/core/skbuff.c  |   3 +-
 net/core/xdp.c |   6 +-
 7 files changed, 799 insertions(+), 30 deletions(-)

-- 
1.8.3.1

[PATCH net-next] net: Fix coding style in skb_push()

2018-08-02 Thread Ganesh Goudar

Signed-off-by: Ganesh Goudar 
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954..51b0a912 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1715,7 +1715,7 @@ void *skb_push(struct sk_buff *skb, unsigned int len)
 {
skb->data -= len;
skb->len  += len;
-   if (unlikely(skb->datahead))
+   if (unlikely(skb->data < skb->head))
skb_under_panic(skb, len, __builtin_return_address(0));
return skb->data;
 }
-- 
2.1.0

[PATCH net-next] net/tls: Mark the end in scatterlist table

2018-08-02 Thread Vakul Garg

Function zerocopy_from_iter() unmarks the 'end' in input sgtable while
adding new entries in it. The last entry in sgtable remained unmarked.
This results in KASAN error report on using apis like sg_nents(). Before
returning, the function needs to mark the 'end' in the last entry it
adds.

Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ff3a6904a722..83d67df33f0c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -311,6 +311,9 @@ static int zerocopy_from_iter(struct sock *sk, struct 
iov_iter *from,
}
}
 
+   /* Mark the end in the last sg entry if newly added */
+   if (num_elem > *pages_used)
+   sg_mark_end([num_elem - 1]);
 out:
if (rc)
iov_iter_revert(from, size - *size_used);
-- 
2.13.6

UDP packets arriving on wrong sockets

2018-08-02 Thread Andrew Cann

Hi, 

I posted this on stackoverflow yesterday but I'm reposting it here since it got
no response. Original post: 
https://stackoverflow.com/questions/51630337/udp-packets-arriving-on-wrong-sockets-on-linux

I have two UDP sockets bound to the same address and connected to addresses A
and B. I have two more UDP sockets bound to A and B and not connected.

This is what my /proc/net/udp looks like (trimmed for readability):

  sl  local_address rem_address
 3937: 017F:DD9C 037F:9910
 3937: 017F:DD9C 027F:907D
16962: 027F:907D :
19157: 037F:9910 :

According to connect(2): "If the socket sockfd is of type SOCK_DGRAM, then addr
is the address to which datagrams are sent by default, *and the only address
from which datagrams are received*."

For some reason, my connected sockets are receiving packets that were destined
for each other. eg: The UDP socket connected to A sends a message to A, A then
sends a reply back. The UDP socket connected to B sends a message to B, B then
sends a reply back. But the reply from A arrives at the socket connected to B
and the reply from B arrives at the socket connected to A.

Why on earth would this be happening? Note that it happens randomly - sometimes
the replies arrive at the correct sockets and sometimes they don't. Is there
any way to prevent this or any situation under which connect() is supposed to
not work?

Any help explaining this would be hugely appreciated :)

 - Andrew



signature.asc
Description: Digital signature

Re: [PATCH net-next 5/9] net: stmmac: Add MDIO related functions for XGMAC2

2018-08-02 Thread Jose Abreu

Hi Andrew,

On 01-08-2018 16:08, Andrew Lunn wrote:
> Hi Jose
>
>> +static int stmmac_xgmac2_mdio_read(struct stmmac_priv *priv, int phyaddr,
>> +   int phyreg)
>> +{
>> +unsigned int mii_address = priv->hw->mii.addr;
>> +unsigned int mii_data = priv->hw->mii.data;
>> +u32 tmp, addr, value = MII_XGMAC_BUSY;
>> +int data;
>> +
>> +if (phyreg & MII_ADDR_C45) {
>> +addr = ((phyreg >> 16) & 0x1f) << 21;
>> +addr |= (phyaddr << 16) | (phyreg & 0x);
> Do you need to tell the hardware this is a C45 transfer? Normally an
> extra bit needs setting somewhere.

The organization of addr reg is the following:
DA [25:21] | PA [20:16] | RA [15:0]

DA is Device Address, PA is Port Address and RA is Register Address.

>
>> +} else {
>> +if (phyaddr >= 4)
>> +return -ENODEV;
> Can the MDIO bus be external? If so, is there a reason why there
> cannot be a PHY at addresses > 4. So maybe there is an Ethernet
> switch, which needs lots of addresses? And C45 can have devices > 4
> but C22 cannot?

Only ports 0 to 3 can be configured as C22 ports, according to
databook. This for MDIO bus trough controller.

>
>> +writel(~0x0, priv->ioaddr + 0x220);
>> +addr = (phyaddr << 16) | (phyreg & 0x1f);
>> +}
>> +
>> +value |= (priv->clk_csr << priv->hw->mii.clk_csr_shift)
>> +& priv->hw->mii.clk_csr_mask;
>> +value |= BIT(18);
> Please add a #define for this bit.

Ok.

>
>> +value |= MII_XGMAC_READ;
>> +
>> +if (readl_poll_timeout(priv->ioaddr + mii_data, tmp,
>> +   !(tmp & MII_XGMAC_BUSY), 100, 1))
>> +return -EBUSY;
>> +
>> +writel(addr, priv->ioaddr + mii_address);
>> +writel(value, priv->ioaddr + mii_data);
>> +
>> +if (readl_poll_timeout(priv->ioaddr + mii_data, tmp,
>> +   !(tmp & MII_XGMAC_BUSY), 100, 1))
>> +return -EBUSY;
>> +
>> +/* Read the data from the MII data register */
>> +data = (int)readl(priv->ioaddr + mii_data) & GENMASK(15, 0);
> Is the cast needed? And why use GENMASK here, but not in all the other
> places you have masks in this code?

The GENMASK is needed, notice how we set more values into
mii_data (clk_csr, bit(18), cmd), this is not cleared by XGMAC2
upon the completion of the operation ...

>
>>  /**
>>   * stmmac_mdio_read
>>   * @bus: points to the mii_bus structure
>> @@ -59,6 +141,9 @@ static int stmmac_mdio_read(struct mii_bus *bus, int 
>> phyaddr, int phyreg)
>>  int data;
>>  u32 value = MII_BUSY;
>>  
>> +if (priv->plat->has_xgmac)
>> +return stmmac_xgmac2_mdio_read(priv, phyaddr, phyreg);
> It would be cleaner to instead do this in stmmac_mdio_register() when
> setting new_bus->read.

Makes sense! Thanks!

Thanks and Best Regards,
Jose Miguel Abreu

>
>   Andrew

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread Petr Machata

David Miller  writes:

> From: Jakub Kicinski 
> Date: Wed, 1 Aug 2018 17:00:47 -0700
>
>> On Wed,  1 Aug 2018 14:52:45 -0700, Saeed Mahameed wrote:
>>> - According to the discussion outcome, we are keeping the congestion control
>>>   setting as mlx5 device specific for the current HW generation.
>> 
>> I still see queuing and marking based on queue level.  You want to add 
>> a Qdisc that will mirror your HW's behaviour to offload, if you really
>> believe this is not a subset of RED, why not...  But devlink params?
>
> I totally agree, devlink seems like absolutely to wrong level and set
> of interfaces to be doing this stuff.
>
> I will not pull these changes in and I probably should have not
> accepted the DCB changes from the other day and they were sneakily
> leading up to this crap.

Are you talking about the recent additions of DCB helpers
dcb_ieee_getapp_prio_dscp_mask_map() etc.?

If yes, I can assure there were no sneaky intentions at all. I'm at a
loss to understand the relation to mlx5 team's decision to use devlink
for congestion control configuration.

Could you please clarify your remark?

Thanks,
Petr

Re: [PATCH net-next 7/9] net: stmmac: Integrate XGMAC into main driver flow

2018-08-02 Thread Jose Abreu

Hi Andrew,

Thanks for the review!

On 01-08-2018 16:23, Andrew Lunn wrote:
>> @@ -842,6 +863,12 @@ static void stmmac_adjust_link(struct net_device *dev)
>>  new_state = true;
>>  ctrl &= ~priv->hw->link.speed_mask;
>>  switch (phydev->speed) {
>> +case SPEED_1:
>> +ctrl |= priv->hw->link.speed1;
>> +break;
>> +case SPEED_2500:
>> +ctrl |= priv->hw->link.speed2500;
>> +break;
>>  case SPEED_1000:
>>  ctrl |= priv->hw->link.speed1000;
>>  break;
> Hi Jose
>
> What PHY did you test this with?

We had some shipping issues with the 10G phy so right now I'm
using a 1G phy ... I would expect that as MDIO is used in both
phys then phylib would take care of everything as long as I
specify in the DT the right interface (SGMII) ... Am I making a
wrong assumption?

>
> 10G phys change the interface mode when the speed change. In general,
> 10/100/1000G copper uses SGMII. A 1G SFP optical module generally
> wants 1000Base-X. 2.5G wants 2500Base-X, 10G copper wants 10GKR, etc.
>
> So your adjust link callback needs to look at phydev->interface and
> reconfigure the MAC as requested.

Sorry, I'm not a phy expert but as long as I use MDIO shouldn't
this be transparent to MAC? I mean, there are no registers about
the interface to use in XGMAC2, there is only this speed
selection register that its implemented already in the
stmmac_adjust_link.

>
> You might also want to consider moving from phylib to phylink. It has
> a better interface for things like this, and makes support for SFP
> interfaces much easier. A MAC which supports 10G is likely to be used
> with SFPs...

Ok, I will take a look into it.

Thanks and Best Regards,
Jose Miguel Abreu

>
>  Andrew

[PATCH net-next v2 3/3] qed: Add Multi-TC RoCE support

2018-08-02 Thread Denis Bolotin

RoCE qps use a pair of physical queues (pq) received from the Queue Manager
(QM) - an offload queue (OFLD) and a low latency queue (LLT). The QM block
creates a pq for each TC, and allows RoCE qps to ask for a pq with a
specific TC. As a result, qps with different VLAN priorities can be mapped
to different TCs, and employ features such as PFC and ETS.

Signed-off-by: Denis Bolotin 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h  |  10 ++-
 drivers/net/ethernet/qlogic/qed/qed_dev.c  | 104 +
 drivers/net/ethernet/qlogic/qed/qed_roce.c |  61 -
 3 files changed, 143 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index f916f13..a60e1c8 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -338,6 +338,9 @@ struct qed_hw_info {
u8  offload_tc;
booloffload_tc_set;
 
+   boolmulti_tc_roce_en;
+#define IS_QED_MULTI_TC_ROCE(p_hwfn) (((p_hwfn)->hw_info.multi_tc_roce_en))
+
u32 concrete_fid;
u16 opaque_fid;
u16 ovlan;
@@ -400,8 +403,8 @@ struct qed_qm_info {
u16 start_pq;
u8  start_vport;
u16  pure_lb_pq;
-   u16 offload_pq;
-   u16 low_latency_pq;
+   u16 first_ofld_pq;
+   u16 first_llt_pq;
u16 pure_ack_pq;
u16 ooo_pq;
u16 first_vf_pq;
@@ -882,11 +885,14 @@ void qed_set_fw_mac_addr(__le16 *fw_msb,
 #define PQ_FLAGS_OFLD   (BIT(5))
 #define PQ_FLAGS_VFS(BIT(6))
 #define PQ_FLAGS_LLT(BIT(7))
+#define PQ_FLAGS_MTC(BIT(8))
 
 /* physical queue index for cm context intialization */
 u16 qed_get_cm_pq_idx(struct qed_hwfn *p_hwfn, u32 pq_flags);
 u16 qed_get_cm_pq_idx_mcos(struct qed_hwfn *p_hwfn, u8 tc);
 u16 qed_get_cm_pq_idx_vf(struct qed_hwfn *p_hwfn, u16 vf);
+u16 qed_get_cm_pq_idx_ofld_mtc(struct qed_hwfn *p_hwfn, u8 tc);
+u16 qed_get_cm_pq_idx_llt_mtc(struct qed_hwfn *p_hwfn, u8 tc);
 
 #define QED_LEADING_HWFN(dev)   (>hwfns[0])
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index a8e7683..04f18bc 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -215,6 +215,8 @@ static u32 qed_get_pq_flags(struct qed_hwfn *p_hwfn)
break;
case QED_PCI_ETH_ROCE:
flags |= PQ_FLAGS_MCOS | PQ_FLAGS_OFLD | PQ_FLAGS_LLT;
+   if (IS_QED_MULTI_TC_ROCE(p_hwfn))
+   flags |= PQ_FLAGS_MTC;
break;
case QED_PCI_ETH_IWARP:
flags |= PQ_FLAGS_MCOS | PQ_FLAGS_ACK | PQ_FLAGS_OOO |
@@ -241,6 +243,16 @@ static u16 qed_init_qm_get_num_vfs(struct qed_hwfn *p_hwfn)
   p_hwfn->cdev->p_iov_info->total_vfs : 0;
 }
 
+static u8 qed_init_qm_get_num_mtc_tcs(struct qed_hwfn *p_hwfn)
+{
+   u32 pq_flags = qed_get_pq_flags(p_hwfn);
+
+   if (!(PQ_FLAGS_MTC & pq_flags))
+   return 1;
+
+   return qed_init_qm_get_num_tcs(p_hwfn);
+}
+
 #define NUM_DEFAULT_RLS 1
 
 static u16 qed_init_qm_get_num_pf_rls(struct qed_hwfn *p_hwfn)
@@ -282,8 +294,11 @@ static u16 qed_init_qm_get_num_pqs(struct qed_hwfn *p_hwfn)
   (!!(PQ_FLAGS_MCOS & pq_flags)) *
   qed_init_qm_get_num_tcs(p_hwfn) +
   (!!(PQ_FLAGS_LB & pq_flags)) + (!!(PQ_FLAGS_OOO & pq_flags)) +
-  (!!(PQ_FLAGS_ACK & pq_flags)) + (!!(PQ_FLAGS_OFLD & pq_flags)) +
-  (!!(PQ_FLAGS_LLT & pq_flags)) +
+  (!!(PQ_FLAGS_ACK & pq_flags)) +
+  (!!(PQ_FLAGS_OFLD & pq_flags)) *
+  qed_init_qm_get_num_mtc_tcs(p_hwfn) +
+  (!!(PQ_FLAGS_LLT & pq_flags)) *
+  qed_init_qm_get_num_mtc_tcs(p_hwfn) +
   (!!(PQ_FLAGS_VFS & pq_flags)) * qed_init_qm_get_num_vfs(p_hwfn);
 }
 
@@ -474,9 +489,9 @@ static u16 *qed_init_qm_get_idx_from_flags(struct qed_hwfn 
*p_hwfn,
case PQ_FLAGS_ACK:
return _info->pure_ack_pq;
case PQ_FLAGS_OFLD:
-   return _info->offload_pq;
+   return _info->first_ofld_pq;
case PQ_FLAGS_LLT:
-   return _info->low_latency_pq;
+   return _info->first_llt_pq;
case PQ_FLAGS_VFS:
return _info->first_vf_pq;
default:
@@ -525,6 +540,43 @@ u16 qed_get_cm_pq_idx_vf(struct qed_hwfn *p_hwfn, u16 vf)
return qed_get_cm_pq_idx(p_hwfn, PQ_FLAGS_VFS) + vf;
 }
 
+static u16

Re: [Query]: DSA Understanding

2018-08-02 Thread Lad, Prabhakar

Hi Andrew,

Thank for your reply.

On Thu, Jul 26, 2018 at 4:39 PM, Andrew Lunn  wrote:
>> I am bit confused on how dsa needs to be actually working,
>> Q's
>> 1] should I be running a dhcp server on eth1 (where switch is connected)
>> so that devices connected on lan* devices get an ip ?
>
> Nope. You need eth1 up, but otherwise you do not use it. Use the lanX
> interfaces like normal Linux interfaces. Run your dhclient on lanX, etc.
>>
>> 2] From the device where switch is connected if the cpu port wants to send
>>any data to any other user ports lan* how do i do it (just open
>> socket on eth1 or lan*) ?
>
> Just treat the lanX interfaces as normal Linux interfaces.
>
I have some more query’s on DSA.

I have manged to get the TI's cpsw slave1 connected to ksz9897
Ethernet switch chip partially working,

I have PC connected to lan4(ip = 169.254..126.126) and the PC ip is
169.254.78.251,
but when I ping from PC to lan4 I get Destination Host Unreachable,
but where as I can see
that in the tcpdump log for lan4 it does reply back, but it doesn’t
reach the PC, Is there I am missing
something here ?

Log from the device on which switch is present:
===

~$ ifconfig
eth0  Link encap:Ethernet  HWaddr C4:F3:12:08:FE:7E
  UP BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
  Interrupt:102

eth1  Link encap:Ethernet  HWaddr C4:F3:12:08:FE:7F
  inet6 addr: fe80::c6f3:12ff:fe08:fe7f%lo/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:436 errors:0 dropped:0 overruns:0 frame:0
  TX packets:516 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:37254 (36.3 KiB)  TX bytes:51585 (50.3 KiB)

lan4  Link encap:Ethernet  HWaddr C4:F3:12:08:FE:7F
  inet addr:169.254.126.126  Bcast:169.254.255.255  Mask:255.255.0.0
  inet6 addr: fe80::c6f3:12ff:fe08:fe7f%lo/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:436 errors:0 dropped:0 overruns:0 frame:0
  TX packets:444 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:28970 (28.2 KiB)  TX bytes:35214 (34.3 KiB)

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1%1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:67 errors:0 dropped:0 overruns:0 frame:0
  TX packets:67 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:6618 (6.4 KiB)  TX bytes:6618 (6.4 KiB)

~$ tcpdump -i lan4 -v
[  661.057166] device lan4 entered promiscuous mode
[  661.061814] device eth1 entered promiscuous mode
tcpdump: listening on lan4, link-type EN10MB (Ethernet), capture size
262144 bytes
07:40:20.255355 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
VB4-SN tell tango-charlie.local, length 46
07:40:20.255393 ARP, Ethernet (len 6), IPv4 (len 4), Reply
VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28
07:40:20.360936 IP6 (flowlabel 0x970aa, hlim 255, next-header UDP (17)
payload length: 53) VB4-SN.mdns > ff02::fb.mdns: [udp sum o)
07:40:20.361085 IP (tos 0x0, ttl 255, id 17259, offset 0, flags [DF],
proto UDP (17), length 73)
VB4-SN.mdns > 224.0.0.251.mdns: 0 PTR (QM)?
251.78.254.169.in-addr.arpa. (45)
07:40:20.361848 IP (tos 0x0, ttl 255, id 1808, offset 0, flags [DF],
proto UDP (17), length 100)
tango-charlie.local.mdns > 224.0.0.251.mdns: 0*- [0q] 1/0/0
251.78.254.169.in-addr.arpa. (Cache flush) PTR tango-charlie.local.
(72)
07:40:20.465933 IP6 (flowlabel 0x970aa, hlim 255, next-header UDP (17)
payload length: 98) VB4-SN.mdns > ff02::fb.mdns: [udp sum o)
07:40:20.466124 IP (tos 0x0, ttl 255, id 17288, offset 0, flags [DF],
proto UDP (17), length 118)
VB4-SN.mdns > 224.0.0.251.mdns: 0 PTR (QM)?
b.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.f.f.ip6.arpa.
(90)
07:40:21.254161 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
VB4-SN tell tango-charlie.local, length 46
07:40:21.254181 ARP, Ethernet (len 6), IPv4 (len 4), Reply
VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28
07:40:25.470288 IP6 (flowlabel 0x970aa, hlim 255, next-header UDP (17)
payload length: 50) VB4-SN.mdns > ff02::fb.mdns: [udp sum o)
07:40:31.301929 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
VB4-SN tell tango-charlie.local, length 46
07:40:31.301957 ARP, Ethernet (len 6), IPv4 (len 4), Reply
VB4-SN is-at c4:f3:12:08:fe:7f (oui Unknown), length 28
07:40:32.319104 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
VB4-SN tell tango-charlie.local, length

[PATCH net-next v2 2/3] qed: Add a flag which indicates if offload TC is set

2018-08-02 Thread Denis Bolotin

Distinguish not set offload_tc from offload_tc 0 and add getters and
setters.

Signed-off-by: Denis Bolotin 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h  |  3 +++
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c  | 32 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  |  3 ++-
 4 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 1dfaccd..f916f13 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -336,6 +336,7 @@ struct qed_hw_info {
 */
u8 num_active_tc;
u8  offload_tc;
+   booloffload_tc_set;
 
u32 concrete_fid;
u16 opaque_fid;
@@ -921,4 +922,6 @@ void qed_get_protocol_stats(struct qed_dev *cdev,
 int qed_mfw_fill_tlv_data(struct qed_hwfn *hwfn,
  enum qed_mfw_tlv_type type,
  union qed_mfw_tlv_data *tlv_data);
+
+void qed_hw_info_set_offload_tc(struct qed_hw_info *p_info, u8 tc);
 #endif /* _QED_H */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index 53c7be8..2c1c3c1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -208,7 +208,7 @@ static bool qed_dcbx_roce_v2_tlv(u32 app_info_bitmap, u16 
proto_id, bool ieee)
 
/* QM reconf data */
if (p_info->personality == personality)
-   p_info->offload_tc = tc;
+   qed_hw_info_set_offload_tc(p_info, tc);
 }
 
 /* Update app protocol data and hw_info fields with the TLV info */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 6a0b46f..a8e7683 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -394,7 +394,25 @@ static void qed_init_qm_advance_vport(struct qed_hwfn 
*p_hwfn)
 /* defines for pq init */
 #define PQ_INIT_DEFAULT_WRR_GROUP   1
 #define PQ_INIT_DEFAULT_TC  0
-#define PQ_INIT_OFLD_TC (p_hwfn->hw_info.offload_tc)
+
+void qed_hw_info_set_offload_tc(struct qed_hw_info *p_info, u8 tc)
+{
+   p_info->offload_tc = tc;
+   p_info->offload_tc_set = true;
+}
+
+static bool qed_is_offload_tc_set(struct qed_hwfn *p_hwfn)
+{
+   return p_hwfn->hw_info.offload_tc_set;
+}
+
+static u32 qed_get_offload_tc(struct qed_hwfn *p_hwfn)
+{
+   if (qed_is_offload_tc_set(p_hwfn))
+   return p_hwfn->hw_info.offload_tc;
+
+   return PQ_INIT_DEFAULT_TC;
+}
 
 static void qed_init_qm_pq(struct qed_hwfn *p_hwfn,
   struct qed_qm_info *qm_info,
@@ -538,7 +556,8 @@ static void qed_init_qm_pure_ack_pq(struct qed_hwfn *p_hwfn)
return;
 
qed_init_qm_set_idx(p_hwfn, PQ_FLAGS_ACK, qm_info->num_pqs);
-   qed_init_qm_pq(p_hwfn, qm_info, PQ_INIT_OFLD_TC, PQ_INIT_SHARE_VPORT);
+   qed_init_qm_pq(p_hwfn, qm_info, qed_get_offload_tc(p_hwfn),
+  PQ_INIT_SHARE_VPORT);
 }
 
 static void qed_init_qm_offload_pq(struct qed_hwfn *p_hwfn)
@@ -549,7 +568,8 @@ static void qed_init_qm_offload_pq(struct qed_hwfn *p_hwfn)
return;
 
qed_init_qm_set_idx(p_hwfn, PQ_FLAGS_OFLD, qm_info->num_pqs);
-   qed_init_qm_pq(p_hwfn, qm_info, PQ_INIT_OFLD_TC, PQ_INIT_SHARE_VPORT);
+   qed_init_qm_pq(p_hwfn, qm_info, qed_get_offload_tc(p_hwfn),
+  PQ_INIT_SHARE_VPORT);
 }
 
 static void qed_init_qm_low_latency_pq(struct qed_hwfn *p_hwfn)
@@ -560,7 +580,8 @@ static void qed_init_qm_low_latency_pq(struct qed_hwfn 
*p_hwfn)
return;
 
qed_init_qm_set_idx(p_hwfn, PQ_FLAGS_LLT, qm_info->num_pqs);
-   qed_init_qm_pq(p_hwfn, qm_info, PQ_INIT_OFLD_TC, PQ_INIT_SHARE_VPORT);
+   qed_init_qm_pq(p_hwfn, qm_info, qed_get_offload_tc(p_hwfn),
+  PQ_INIT_SHARE_VPORT);
 }
 
 static void qed_init_qm_mcos_pqs(struct qed_hwfn *p_hwfn)
@@ -601,7 +622,8 @@ static void qed_init_qm_rl_pqs(struct qed_hwfn *p_hwfn)
 
qed_init_qm_set_idx(p_hwfn, PQ_FLAGS_RLS, qm_info->num_pqs);
for (pf_rls_idx = 0; pf_rls_idx < num_pf_rls; pf_rls_idx++)
-   qed_init_qm_pq(p_hwfn, qm_info, PQ_INIT_OFLD_TC, PQ_INIT_PF_RL);
+   qed_init_qm_pq(p_hwfn, qm_info, qed_get_offload_tc(p_hwfn),
+  PQ_INIT_PF_RL);
 }
 
 static void qed_init_qm_pq_params(struct qed_hwfn *p_hwfn)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 8e4f60e..d89a0e2 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1552,7 +1552,8 @@ void qed_mcp_read_ufp_config(struct qed_hwfn

[PATCH net-next v2 1/3] qed: Add DCBX API - qed_dcbx_get_priority_tc()

2018-08-02 Thread Denis Bolotin

The API receives a priority and looks for the TC it is mapped to in the
operational DCBX configuration.

Signed-off-by: Denis Bolotin 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 19 +++
 drivers/net/ethernet/qlogic/qed/qed_dcbx.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index d02e774..53c7be8 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -989,6 +989,25 @@ void qed_dcbx_set_pf_update_params(struct qed_dcbx_results 
*p_src,
qed_dcbx_update_protocol_data(p_dcb_data, p_src, DCBX_PROTOCOL_ETH);
 }
 
+int qed_dcbx_get_priority_tc(struct qed_hwfn *p_hwfn, u8 pri, u8 *p_tc)
+{
+   struct qed_dcbx_get *dcbx_info = _hwfn->p_dcbx_info->get;
+
+   if (pri >= QED_MAX_PFC_PRIORITIES) {
+   DP_ERR(p_hwfn, "Invalid priority %d\n", pri);
+   return -EINVAL;
+   }
+
+   if (!dcbx_info->operational.valid) {
+   DP_ERR(p_hwfn, "Dcbx parameters not available\n");
+   return -EINVAL;
+   }
+
+   *p_tc = dcbx_info->operational.params.ets_pri_tc_tbl[pri];
+
+   return 0;
+}
+
 #ifdef CONFIG_DCB
 static int qed_dcbx_query_params(struct qed_hwfn *p_hwfn,
 struct qed_dcbx_get *p_get,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.h 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.h
index 5feb90e..324244b 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.h
@@ -123,4 +123,5 @@ int qed_dcbx_config_params(struct qed_hwfn *,
 void qed_dcbx_set_pf_update_params(struct qed_dcbx_results *p_src,
   struct pf_update_ramrod_data *p_dest);
 
+int qed_dcbx_get_priority_tc(struct qed_hwfn *p_hwfn, u8 pri, u8 *p_tc);
 #endif
-- 
1.8.3.1

[PATCH net-next] rxrpc: Remove set but not used variable 'nowj'

2018-08-02 Thread David Howells

From: Wei Yongjun 

Fixes gcc '-Wunused-but-set-variable' warning:

net/rxrpc/proc.c: In function 'rxrpc_call_seq_show':
net/rxrpc/proc.c:66:29: warning:
 variable 'nowj' set but not used [-Wunused-but-set-variable]
  unsigned long timeout = 0, nowj;
 ^

Signed-off-by: Wei Yongjun 
Signed-off-by: David Howells 
---

 net/rxrpc/proc.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rxrpc/proc.c b/net/rxrpc/proc.c
index 163d05df339d..9805e3b85c36 100644
--- a/net/rxrpc/proc.c
+++ b/net/rxrpc/proc.c
@@ -63,7 +63,7 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
struct rxrpc_peer *peer;
struct rxrpc_call *call;
struct rxrpc_net *rxnet = rxrpc_net(seq_file_net(seq));
-   unsigned long timeout = 0, nowj;
+   unsigned long timeout = 0;
rxrpc_seq_t tx_hard_ack, rx_hard_ack;
char lbuff[50], rbuff[50];
 
@@ -97,7 +97,6 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
 
if (call->state != RXRPC_CALL_SERVER_PREALLOC) {
timeout = READ_ONCE(call->expect_rx_by);
-   nowj = jiffies;
timeout -= jiffies;
}

[PATCH net-next v2 0/3] qed: Add Multi-TC RoCE support

2018-08-02 Thread Denis Bolotin

Hi Dave,
This patch series adds support for multiple concurrent traffic classes for RoCE.
The first three patches enable the required parts of the driver to learn the TC
configuration, and the last one makes use of it to enable the feature.
Please consider applying this to net-next.

V1->V2:
---
Avoid allocation in qed_dcbx_get_priority_tc().
Move qed_dcbx_get_priority_tc() out of CONFIG_DCB section since it doesn't call
qed_dcbx_query_params() anymore.

Denis Bolotin (3):
  qed: Add DCBX API - qed_dcbx_get_priority_tc()
  qed: Add a flag which indicates if offload TC is set
  qed: Add Multi-TC RoCE support

 drivers/net/ethernet/qlogic/qed/qed.h  |  13 ++-
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c |  21 -
 drivers/net/ethernet/qlogic/qed/qed_dcbx.h |   1 +
 drivers/net/ethernet/qlogic/qed/qed_dev.c  | 128 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  |   3 +-
 drivers/net/ethernet/qlogic/qed/qed_roce.c |  61 ++
 6 files changed, 192 insertions(+), 35 deletions(-)

-- 
1.8.3.1

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-02 Thread Jiri Pirko

Thu, Aug 02, 2018 at 01:13:20AM CEST, sae...@dev.mellanox.co.il wrote:

[...]

>>
>> So after looking over the patch set the one thing I would ask for in
>> this is some sort of documentation at a minimum. As a user I don't see
>> how you can expect someone to be able to use this when the naming of
>> things are pretty cryptic and there is no real explanation anywhere if
>> you don't go through and read the patch description itself. When you
>> start adding driver specific interfaces, you should at least start
>> adding vendor specific documentation.
>>
>
>Sure, sounds like a great idea, something like:
>Documentation/networking/mlx5.txt and have a devlink section ?
>or have a generic devlink doc and a mlx5 section in it ?

I think that Documentation/networking/devlink/mlx5.txt would be good.
We can have generic param description in:
Documentation/networking/devlink/generic.txt

89 matches

Mail list logo