date:20181129

Re: [PATCH net 0/2] fixes for XPS configuration after Symmetric queue selection

2018-11-29 Thread David Miller

From: Sabrina Dubroca 
Date: Thu, 29 Nov 2018 14:14:47 +0100

> This fixes some bugs introduced by the "Symmetric queue selection
> using XPS for Rx queues".
> 
> First, the refactoring of the cleanup function skipped resetting the
> queue's NUMA node under some conditions.
> 
> Second, the accounting on static keys for XPS and RXQS-XPS is
> unbalanced, so the static key for XPS won't actually disable itself,
> once enabled. The RXQS-XPS static key can actually be disabled by
> reconfiguring a device that didn't have RXQS-XPS configured at all.

Series applied and queued up for -stable, thanks.

Re: [PATCH] net: phy: sfp: correct location of SFP standards

2018-11-29 Thread Andrew Lunn

On Thu, Nov 29, 2018 at 12:00:05PM +0200, Baruch Siach wrote:
> SFP standards are now available from the SNIA (Storage Networking
> Industry Association) website.
> 
> Cc: Andrew Lunn 
> Cc: Florian Fainelli 
> Signed-off-by: Baruch Siach 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH bpf-next] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Jakub Kicinski

On Thu, 29 Nov 2018 12:38:03 -0800, Yonghong Song wrote:
> diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
> index d83b17f8435c..286e497c50ec 100644
> --- a/tools/lib/bpf/libbpf_errno.c
> +++ b/tools/lib/bpf/libbpf_errno.c
> @@ -40,9 +40,19 @@ int libbpf_strerror(int err, char *buf, size_t size)
>   err = err > 0 ? err : -err;
>  
>   if (err < __LIBBPF_ERRNO__START) {
> +#ifdef _GNU_SOURCE
> + const char *ret_buf;
> +#endif
>   int ret;
>  
> +#ifdef _GNU_SOURCE
> + ret_buf = strerror_r(err, buf, size);
> + if (ret_buf != buf)
> + snprintf(buf, size, "%s", ret_buf);
> + ret = 0;
> +#else
>   ret = strerror_r(err, buf, size);
> +#endif
>   buf[size - 1] = '\0';
>   return ret;
>   }

That is kinda strange, the whole point for this file was to have
non-GNU strerror_r, would doing #undef _GNU_SOURCE at the top not work?

Re: [PATCH net-next] dpaa2-eth: Add "fall through" comments

2018-11-29 Thread David Miller

From: Ioana Ciocoi Radulescu 
Date: Thu, 29 Nov 2018 08:43:40 +

> Add comments in the switch statement for XDP action to indicate
> fallthrough is intended.
> 
> Signed-off-by: Ioana Radulescu 

Applied.

Re: [PATCH net-next] qede - Add a statistic for a case where driver drops tx packet due to memory allocation failure.

2018-11-29 Thread David Miller

From: Michael Shteinbok 
Date: Thu, 29 Nov 2018 10:50:19 +0200

> skb_linearization can fail due to memory allocation failure.
> In such a case, the driver will drop the packet. In such a case
> The driver used to print an error message.
> This patch replaces this error message by a dedicated statistic.
> 
> Signed-off-by: Michael Shteinbok 
> Signed-off-by: Ariel Elior 

Applied.

[PATCH net-next 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-11-29 Thread Steve Douthit

Most dsa devices currently expect to get a pointer to a mii_bus from
platform data of device tree information.

The first patch exposes a mii_bus for ixgbe devices.

Currently the ixgbe driver only allows accesses to the probed PHY
address via ioctls. The second patch changes that behavior to allow
access to all addresses.  This should be useful for doing switch peeks &
pokes from userspace for development, test, or if theres no dsa driver
for a particular switch yet.

I've tested the clause 22 portion of this code on a board that has an
Intel C3xxx series SoC connected to a Marvell Peridot switch.  I don't
have any clause 45 devices or other ixgbe devices to test with.

Stephen Douthit (2):
  ixgbe: register a mdiobus
  ixgbe: use mii_bus to handle MII related ioctls

 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  35 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
 4 files changed, 216 insertions(+), 15 deletions(-)

-- 
2.17.2

Re: [PATCH net-next 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-11-29 Thread Jeff Kirsher

On Thu, 2018-11-29 at 18:54 +, Steve Douthit wrote:
> Most dsa devices currently expect to get a pointer to a mii_bus from
> platform data of device tree information.
> 
> The first patch exposes a mii_bus for ixgbe devices.
> 
> Currently the ixgbe driver only allows accesses to the probed PHY
> address via ioctls. The second patch changes that behavior to allow
> access to all addresses.  This should be useful for doing switch peeks &
> pokes from userspace for development, test, or if theres no dsa driver
> for a particular switch yet.
> 
> I've tested the clause 22 portion of this code on a board that has an
> Intel C3xxx series SoC connected to a Marvell Peridot switch.  I don't
> have any clause 45 devices or other ixgbe devices to test with.
> 
> Stephen Douthit (2):
>   ixgbe: register a mdiobus
>   ixgbe: use mii_bus to handle MII related ioctls
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  35 ++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
>  4 files changed, 216 insertions(+), 15 deletions(-)
> 

I will add these changes to my queue, please remember to send patches
against the drivers in drivers/net/ethernet/intel/* to 
intel-wired-...@lists.osuosl.org  mailing list as well, so that our
patchwork project can track the patches.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH net-next] tcp: remove loop to compute wscale

2018-11-29 Thread David Miller

From: Eric Dumazet 
Date: Thu, 29 Nov 2018 07:56:20 -0800

> We can remove the loop and conditional branches
> and compute wscale efficiently thanks to ilog2()
> 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [RFC PATCH] net: mvpp2: fix detection of 10G SFP modules

2018-11-29 Thread Florian Fainelli




On 11/29/2018 4:49 AM, Baruch Siach wrote:
> The mvpp2_phylink_validate() relies on the interface field of
> phylink_link_state to determine valid link modes. However, when called
> from phylink_sfp_module_insert() this field in not initialized. The
> default switch case then excludes 10G link modes. This allows 10G SFP
> modules that are detected correctly to be configured at max rate of
> 2.5G.
> 
> Catch the uninitialized PHY mode case, and allow 10G rates.
> 
> Cc: Maxime Chevallier 
> Cc: Antoine Tenart 
> Signed-off-by: Baruch Siach 
> ---
> Is that the right fix?

It would be a bit surprising that this is the right fix, you would
expect validate to be called once everything has been parsed
successfully from the SFP, is not that the case here? If not, can you
find out what happens?
-- 
Florian

Re: [PATCH] net: phy: sfp: correct location of SFP standards

2018-11-29 Thread David Miller

From: Baruch Siach 
Date: Thu, 29 Nov 2018 12:00:05 +0200

> SFP standards are now available from the SNIA (Storage Networking
> Industry Association) website.
> 
> Cc: Andrew Lunn 
> Cc: Florian Fainelli 
> Signed-off-by: Baruch Siach 

Applied.

Re: [PATCHv2 net] sctp: hold transport before accessing its asoc in sctp_epaddr_lookup_transport

2018-11-29 Thread Neil Horman

On Thu, Nov 29, 2018 at 02:44:07PM +0800, Xin Long wrote:
> Without holding transport to dereference its asoc, a use after
> free panic can be caused in sctp_epaddr_lookup_transport. Note
> that a sock lock can't protect these transports that belong to
> other socks.
> 
> A similar fix as Commit bab1be79a516 ("sctp: hold transport
> before accessing its asoc in sctp_transport_get_next") is
> needed to hold the transport before accessing its asoc in
> sctp_epaddr_lookup_transport.
> 
> Note that this extra atomic operation is on the datapath,
> but as rhlist keeps the lists to a small size, it won't
> see a noticeable performance hurt.
> 
> v1->v2:
>   - improve the changelog.
> 
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> Signed-off-by: Xin Long 
> ---
>  net/sctp/input.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 5c36a99..ce7351c 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -967,9 +967,15 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
>   list = rhltable_lookup(_transport_hashtable, ,
>  sctp_hash_params);
>  
> - rhl_for_each_entry_rcu(t, tmp, list, node)
> - if (ep == t->asoc->ep)
> + rhl_for_each_entry_rcu(t, tmp, list, node) {
> + if (!sctp_transport_hold(t))
> + continue;
> + if (ep == t->asoc->ep) {
> + sctp_transport_put(t);
>   return t;
> + }
> + sctp_transport_put(t);
> + }
>  
>   return NULL;
>  }

Wait a second, what if we just added an rcu_head to the association structure
and changed the kfree call in sctp_association_destroy to a kfree_rcu call
instead?  That would force the actual freeing of the association to pass through
a grace period, during which any in flight list traversal in
sctp_epaddr_lookup_transport could complete safely.  Its another two pointers
worth of space in the association, but I think that would be a worthwhile
tradeoff for not having to do N atomic adds/puts every time you wanted to
receive or send a frame.

Neil

> -- 
> 2.1.0
> 
>

Re: [PATCH] net: phy: sfp: correct store of detected link modes

2018-11-29 Thread David Miller

From: Baruch Siach 
Date: Thu, 29 Nov 2018 12:40:11 +0200

> The link modes that sfp_parse_support() detects are stored in the
> 'modes' bitmap. There is no reason to make an exception for 1000Base-PX
> or 1000Base-BX10.
> 
> Signed-off-by: Baruch Siach 

Applied and queued up for -stable.

Re: [PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-29 Thread Stephen Hemminger

On Thu, 29 Nov 2018 02:27:54 -0800
Eric Dumazet  wrote:

> Wei Wang added these fields in linux-4.19
> 
> Tested:
> 
> ss -ti ...
> 
>   ts sack cubic wscale:8,8 rto:7 rtt:2.678/0.267 mss:1428 pmtu:1500
> rcvmss:536 advmss:1428 cwnd:91 ssthresh:65
> (*) bytes_sent:17470606104 bytes_retrans:2856
> bytes_acked:17470483297
> segs_out:12234320 segs_in:622983
> data_segs_out:12234318 send 388.2Mbps lastrcv:986784 lastack:1
> pacing_rate 465.8Mbps delivery_rate 162.7Mbps
> delivered:12234235 delivered_ce:3669056
> busy:986784ms unacked:84 retrans:0/2
> (*) dsack_dups:2
> rcv_space:14280 rcv_ssthresh:65535 notsent:2016336 minrtt:0.183
> 
> Signed-off-by: Eric Dumazet 
> Cc: Wei Wang 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 

Applied.

PS: ss still doesn't support JSON output, given the volume of output it would 
be good.

Re: [RFC PATCH] net: mvpp2: fix detection of 10G SFP modules

2018-11-29 Thread Baruch Siach

Hi Florian,

Florian Fainelli writes:
> On 11/29/2018 4:49 AM, Baruch Siach wrote:
>> The mvpp2_phylink_validate() relies on the interface field of
>> phylink_link_state to determine valid link modes. However, when called
>> from phylink_sfp_module_insert() this field in not initialized. The
>> default switch case then excludes 10G link modes. This allows 10G SFP
>> modules that are detected correctly to be configured at max rate of
>> 2.5G.
>>
>> Catch the uninitialized PHY mode case, and allow 10G rates.
>>
>> Cc: Maxime Chevallier 
>> Cc: Antoine Tenart 
>> Signed-off-by: Baruch Siach 
>> ---
>> Is that the right fix?
>
> It would be a bit surprising that this is the right fix, you would
> expect validate to be called once everything has been parsed
> successfully from the SFP, is not that the case here? If not, can you
> find out what happens?

This is the code from phylink_sfp_module_insert() at
drivers/net/phy/phylink.c:

__ETHTOOL_DECLARE_LINK_MODE_MASK(support) = { 0, };
struct phylink_link_state config;
...

sfp_parse_support(pl->sfp_bus, id, support);
port = sfp_parse_port(pl->sfp_bus, id, support);

memset(, 0, sizeof(config));
linkmode_copy(config.advertising, support);
config.interface = PHY_INTERFACE_MODE_NA;
/* ... more 'config' fields initialization ... */
ret = phylink_validate(pl, support, );

The 'interface' field is not detected at this stage. I think it is not
meant to represent the detected information, but the actual
configuration.

baruch

--
 http://baruch.siach.name/blog/  ~. .~   Tk Open Systems
=}ooO--U--Ooo{=
   - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -

Re: [PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-29 Thread Wei Wang

On Thu, Nov 29, 2018 at 2:28 AM Eric Dumazet  wrote:
>
> Wei Wang added these fields in linux-4.19
>
> Tested:
>
> ss -ti ...
>
> ts sack cubic wscale:8,8 rto:7 rtt:2.678/0.267 mss:1428 pmtu:1500
> rcvmss:536 advmss:1428 cwnd:91 ssthresh:65
> (*) bytes_sent:17470606104 bytes_retrans:2856
> bytes_acked:17470483297
> segs_out:12234320 segs_in:622983
> data_segs_out:12234318 send 388.2Mbps lastrcv:986784 lastack:1
> pacing_rate 465.8Mbps delivery_rate 162.7Mbps
> delivered:12234235 delivered_ce:3669056
> busy:986784ms unacked:84 retrans:0/2
> (*) dsack_dups:2
> rcv_space:14280 rcv_ssthresh:65535 notsent:2016336 minrtt:0.183
>
> Signed-off-by: Eric Dumazet 
> Cc: Wei Wang 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> ---

Acked-by: Wei Wang 

Thanks Eric.

>
>  misc/ss.c | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/misc/ss.c b/misc/ss.c
> index 
> 3aa94f235085512510dca9fd597e8e37aaaf0fd3..3589ebedc5a0ab0615ba56f3df10d49198bed0d9
>  100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -819,6 +819,8 @@ struct tcpstat {
> unsigned intnot_sent;
> unsigned intdelivered;
> unsigned intdelivered_ce;
> +   unsigned intdsack_dups;
> +   unsigned intreord_seen;
> double  rcv_rtt;
> double  min_rtt;
> int rcv_space;
> @@ -826,6 +828,8 @@ struct tcpstat {
> unsigned long long  busy_time;
> unsigned long long  rwnd_limited;
> unsigned long long  sndbuf_limited;
> +   unsigned long long  bytes_sent;
> +   unsigned long long  bytes_retrans;
> boolhas_ts_opt;
> boolhas_sack_opt;
> boolhas_ecn_opt;
> @@ -2426,6 +2430,10 @@ static void tcp_stats_print(struct tcpstat *s)
> if (s->ssthresh)
> out(" ssthresh:%d", s->ssthresh);
>
> +   if (s->bytes_sent)
> +   out(" bytes_sent:%llu", s->bytes_sent);
> +   if (s->bytes_retrans)
> +   out(" bytes_retrans:%llu", s->bytes_retrans);
> if (s->bytes_acked)
> out(" bytes_acked:%llu", s->bytes_acked);
> if (s->bytes_received)
> @@ -2512,10 +2520,14 @@ static void tcp_stats_print(struct tcpstat *s)
> out(" lost:%u", s->lost);
> if (s->sacked && s->ss.state != SS_LISTEN)
> out(" sacked:%u", s->sacked);
> +   if (s->dsack_dups)
> +   out(" dsack_dups:%u", s->dsack_dups);
> if (s->fackets)
> out(" fackets:%u", s->fackets);
> if (s->reordering != 3)
> out(" reordering:%d", s->reordering);
> +   if (s->reord_seen)
> +   out(" reord_seen:%d", s->reord_seen);
> if (s->rcv_rtt)
> out(" rcv_rtt:%g", s->rcv_rtt);
> if (s->rcv_space)
> @@ -2837,6 +2849,10 @@ static void tcp_show_info(const struct nlmsghdr *nlh, 
> struct inet_diag_msg *r,
> s.sndbuf_limited = info->tcpi_sndbuf_limited;
> s.delivered = info->tcpi_delivered;
> s.delivered_ce = info->tcpi_delivered_ce;
> +   s.dsack_dups = info->tcpi_dsack_dups;
> +   s.reord_seen = info->tcpi_reord_seen;
> +   s.bytes_sent = info->tcpi_bytes_sent;
> +   s.bytes_retrans = info->tcpi_bytes_retrans;
> tcp_stats_print();
> free(s.dctcp);
> free(s.bbr_info);
> --
> 2.20.0.rc0.387.gc7a69e6b6c-goog
>

[PATCH net-next v3 1/3] udp: msg_zerocopy

2018-11-29 Thread Willem de Bruijn

From: Willem de Bruijn 

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok

Changes
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c  |  6 ++
 net/core/sock.c|  5 -
 net/ipv4/ip_output.c   | 23 ++-
 net/ipv6/ip6_output.c  | 23 ++-
 5 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..04f52e719571 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -485,6 +485,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 02cd7ae3d0fb..435bac91d293 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1105,6 +1105,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+   return __zerocopy_sg_from_iter(skb->sk, skb, >msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..f5bb89785e47 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1018,7 +1018,10 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 
case SO_ZEROCOPY:
if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-   if (sk->sk_protocol != IPPROTO_TCP)
+   if (!((sk->sk_type == SOCK_STREAM &&
+  sk->sk_protocol == IPPROTO_TCP) ||
+ (sk->sk_type == SOCK_DGRAM &&
+  sk->sk_protocol == IPPROTO_UDP)))
ret = -ENOTSUPP;
} else if (sk->sk_family != PF_RDS) {
ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c09219e7f230..9602746d7175 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -867,6 +867,7 @@ static int __ip_append_data(struct sock *sk,
unsigned int flags)
 {
struct inet_sock *inet = inet_sk(sk);
+   struct ubuf_info *uarg = NULL;
struct sk_buff *skb;
 
struct ip_options *opt = cork->opt;
@@ -916,6 +917,19 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
 
+   if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+   uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   if (!uarg)
+   return -ENOBUFS;
+   if (rt->dst.dev->features & NETIF_F_SG &&
+   csummode == CHECKSUM_PARTIAL) {
+   paged = true;
+   } else {
+   uarg->zerocopy = 0;
+   skb_zcopy_set(skb, uarg);
+   }
+   }
+
cork->length += length;
 
/* So, what's going on in the loop below?
@@ -1005,6 +1019,7 @@ static int __ip_append_data(struct sock *sk,
cork->tx_flags = 0;
skb_shinfo(skb)->tskey = tskey;
tskey = 0;
+   skb_zcopy_set(skb, uarg);

[PATCH net-next v3 2/3] udp: elide zerocopy operation in hot path

2018-11-29 Thread Willem de Bruijn

From: Willem de Bruijn 

With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 12 
 net/core/skbuff.c  |  9 +
 net/ipv4/ip_output.c   | 10 +-
 net/ipv4/tcp.c |  2 +-
 net/ipv6/ip6_output.c  | 10 +-
 5 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..75d50ab7997c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,7 +481,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
-void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1326,10 +1326,14 @@ static inline struct ubuf_info *skb_zcopy(struct 
sk_buff *skb)
return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg,
+bool *have_ref)
 {
if (skb && uarg && !skb_zcopy(skb)) {
-   sock_zerocopy_get(uarg);
+   if (unlikely(have_ref && *have_ref))
+   *have_ref = false;
+   else
+   sock_zerocopy_get(uarg);
skb_shinfo(skb)->destructor_arg = uarg;
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
}
@@ -1374,7 +1378,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
struct ubuf_info *uarg = skb_zcopy(skb);
 
if (uarg) {
-   sock_zerocopy_put_abort(uarg);
+   sock_zerocopy_put_abort(uarg, false);
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 435bac91d293..7cfc2144228a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1089,7 +1089,7 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
-void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
if (uarg) {
struct sock *sk = skb_from_uarg(uarg)->sk;
@@ -1097,7 +1097,8 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
atomic_dec(>sk_zckey);
uarg->len--;
 
-   sock_zerocopy_put(uarg);
+   if (have_uref)
+   sock_zerocopy_put(uarg);
}
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
@@ -1137,7 +1138,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct 
sk_buff *skb,
return err;
}
 
-   skb_zcopy_set(skb, uarg);
+   skb_zcopy_set(skb, uarg, NULL);
return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1157,7 +1158,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, 
struct sk_buff *orig,
if (skb_copy_ubufs(nskb, GFP_ATOMIC))
return -EIO;
}
-   skb_zcopy_set(nskb, skb_uarg(orig));
+   skb_zcopy_set(nskb, skb_uarg(orig), NULL);
}
return 0;
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9602746d7175..08ff04d12642 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
+   bool paged, extra_uref;
u32 tskey = 0;
-   bool paged;
 
skb = skb_peek_tail(queue);
 
@@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
return -ENOBUFS;
+   extra_uref = true;
if (rt->dst.dev->features & NETIF_F_SG &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
} else {

[PATCH net-next v3 0/3] udp msg_zerocopy

2018-11-29 Thread Willem de Bruijn

From: Willem de Bruijn 

Enable MSG_ZEROCOPY for udp sockets

Patch 1/3 is the main patch, a rework of RFC patch
  http://patchwork.ozlabs.org/patch/899630/
  more details in the patch commit message

Patch 2/3 is an optimization to remove a branch from the UDP hot path
  and refcount_inc/refcount_dec_and_test pair when zerocopy is used.
  This used to be included in the first patch in v2.

Patch 3/3 runs the already existing udp zerocopy tests
  as part of kselftest

See also recent Linux Plumbers presentation
  
https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf

Changes:
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero

Willem de Bruijn (3):
  udp: msg_zerocopy
  udp: elide zerocopy operation in hot path
  selftests: extend zerocopy tests to udp

 include/linux/skbuff.h  | 13 +++
 net/core/skbuff.c   | 15 +
 net/core/sock.c |  5 -
 net/ipv4/ip_output.c| 25 +++--
 net/ipv4/tcp.c  |  2 +-
 net/ipv6/ip6_output.c   | 25 +++--
 tools/testing/selftests/net/msg_zerocopy.c  |  3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh |  2 ++
 tools/testing/selftests/net/udpgso_bench.sh |  3 +++
 9 files changed, 78 insertions(+), 15 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

[PATCH net-next v3 3/3] selftests: extend zerocopy tests to udp

2018-11-29 Thread Willem de Bruijn

From: Willem de Bruijn 

Both msg_zerocopy and udpgso_bench have udp zerocopy variants.
Exercise these as part of the standard kselftest run.

With udp, msg_zerocopy has no control channel. Ensure that the
receiver exits after the sender by accounting for the initial
delay in starting them (in msg_zerocopy.sh).

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/msg_zerocopy.c  | 3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh | 2 ++
 tools/testing/selftests/net/udpgso_bench.sh | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 406cc70c571d..4b02933cab8a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -651,12 +651,13 @@ static void do_flush_datagram(int fd, int type)
 
 static void do_rx(int domain, int type, int protocol)
 {
+   const int cfg_receiver_wait_ms = 400;
uint64_t tstop;
int fd;
 
fd = do_setup_rx(domain, type, protocol);
 
-   tstop = gettimeofday_ms() + cfg_runtime_ms;
+   tstop = gettimeofday_ms() + cfg_runtime_ms + cfg_receiver_wait_ms;
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh 
b/tools/testing/selftests/net/msg_zerocopy.sh
index c43c6debda06..825ffec85cea 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -25,6 +25,8 @@ readonly path_sysctl_mem="net.core.optmem_max"
 if [[ "$#" -eq "0" ]]; then
$0 4 tcp -t 1
$0 6 tcp -t 1
+   $0 4 udp -t 1
+   $0 6 udp -t 1
echo "OK. All tests passed"
exit 0
 fi
diff --git a/tools/testing/selftests/net/udpgso_bench.sh 
b/tools/testing/selftests/net/udpgso_bench.sh
index 0f0628613f81..5670a9ffd8eb 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -35,6 +35,9 @@ run_udp() {
 
echo "udp gso"
run_in_netns ${args} -S 0
+
+   echo "udp gso zerocopy"
+   run_in_netns ${args} -S 0 -z
 }
 
 run_tcp() {
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

Re: [PATCH net-next v3 1/3] udp: msg_zerocopy

2018-11-29 Thread Willem de Bruijn

On Thu, Nov 29, 2018 at 3:26 PM Willem de Bruijn
 wrote:
>
> From: Willem de Bruijn 
>
> Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
> interpret flag MSG_ZEROCOPY.
>
> This patch was previously part of the zerocopy RFC patchsets. Zerocopy
> is not effective at small MTU. With segmentation offload building
> larger datagrams, the benefit of page flipping outweights the cost of
> generating a completion notification.
>
> tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
> test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
>
> ipv4 udp -t 1
> tx=191312 (11938 MB) txc=0 zc=n
> rx=191312 (11938 MB)
> ipv4 udp -z -t 1
> tx=304507 (19002 MB) txc=304507 zc=y
> rx=304507 (19002 MB)
> ok
> ipv6 udp -t 1
> tx=174485 (10888 MB) txc=0 zc=n
> rx=174485 (10888 MB)
> ipv6 udp -z -t 1
> tx=294801 (18396 MB) txc=294801 zc=y
> rx=294801 (18396 MB)
> ok
>
> Changes
>   v1 -> v2
> - Fixup reverse christmas tree violation
>   v2 -> v3
> - Split refcount avoidance optimization into separate patch
>   - Fix refcount leak on error in fragmented case
> (thanks to Paolo Abeni for pointing this one out!)
>   - Fix refcount inc on zero

one additional change that I forgot to add to the changelog

  - test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
this is needed since commit 5cf4a8532c99 ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

[PATCH net-next 2/2] ixgbe: use mii_bus to handle MII related ioctls

2018-11-29 Thread Steve Douthit

Use the mii_bus callbacks to address the entire clause 22/45 address
space.  Enables userspace to poke switch registers instead of a single
PHY address.

Signed-off-by: Stephen Douthit 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 29 +--
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c2141fe35685..345ecdddc749 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8786,27 +8786,26 @@ static int
 ixgbe_mdio_read(struct net_device *netdev, int prtad, int devad, u16 addr)
 {
struct ixgbe_adapter *adapter = netdev_priv(netdev);
-   struct ixgbe_hw *hw = >hw;
-   u16 value;
-   int rc;
+   struct mii_bus *mii_bus = adapter->mii_bus;
+   int regnum = addr;
 
-   if (prtad != hw->phy.mdio.prtad)
-   return -EINVAL;
-   rc = hw->phy.ops.read_reg(hw, addr, devad, );
-   if (!rc)
-   rc = value;
-   return rc;
+   if (devad != MDIO_DEVAD_NONE)
+   regnum |= (devad << 16) | MII_ADDR_C45;
+
+   return mdiobus_read(mii_bus, prtad, regnum);
 }
 
 static int ixgbe_mdio_write(struct net_device *netdev, int prtad, int devad,
u16 addr, u16 value)
 {
struct ixgbe_adapter *adapter = netdev_priv(netdev);
-   struct ixgbe_hw *hw = >hw;
+   struct mii_bus *mii_bus = adapter->mii_bus;
+   int regnum = addr;
 
-   if (prtad != hw->phy.mdio.prtad)
-   return -EINVAL;
-   return hw->phy.ops.write_reg(hw, addr, devad, value);
+   if (devad != MDIO_DEVAD_NONE)
+   regnum |= (devad << 16) | MII_ADDR_C45;
+
+   return mdiobus_write(mii_bus, prtad, regnum, value);
 }
 
 static int ixgbe_ioctl(struct net_device *netdev, struct ifreq *req, int cmd)
@@ -8819,7 +8818,7 @@ static int ixgbe_ioctl(struct net_device *netdev, struct 
ifreq *req, int cmd)
case SIOCGHWTSTAMP:
return ixgbe_ptp_get_ts_config(adapter, req);
case SIOCGMIIPHY:
-   if (!adapter->hw.phy.ops.read_reg)
+   if (!adapter->mii_bus)
return -EOPNOTSUPP;
/* fall through */
default:
@@ -10796,7 +10795,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
/* ixgbe_identify_phy_generic will set prtad and mmds properly */
hw->phy.mdio.prtad = MDIO_PRTAD_NONE;
hw->phy.mdio.mmds = 0;
-   hw->phy.mdio.mode_support = MDIO_SUPPORTS_C45 | MDIO_EMULATE_C22;
+   hw->phy.mdio.mode_support = MDIO_SUPPORTS_C45 | MDIO_SUPPORTS_C22;
hw->phy.mdio.dev = netdev;
hw->phy.mdio.mdio_read = ixgbe_mdio_read;
hw->phy.mdio.mdio_write = ixgbe_mdio_write;
-- 
2.17.2

[PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-29 Thread Steve Douthit

Most dsa devices expect a 'struct mii_bus' pointer to talk to switches
via the MII interface.

Signed-off-by: Stephen Douthit 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   6 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
 4 files changed, 202 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 143bdd5ee2a0..08d85e336bd4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -561,6 +562,7 @@ struct ixgbe_adapter {
struct net_device *netdev;
struct bpf_prog *xdp_prog;
struct pci_dev *pdev;
+   struct mii_bus *mii_bus;
 
unsigned long state;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 49a4ea38eb07..c2141fe35685 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -39,6 +39,7 @@
 #include "ixgbe.h"
 #include "ixgbe_common.h"
 #include "ixgbe_dcb_82599.h"
+#include "ixgbe_phy.h"
 #include "ixgbe_sriov.h"
 #include "ixgbe_model.h"
 #include "ixgbe_txrx_common.h"
@@ -11120,6 +11121,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
IXGBE_LINK_SPEED_10GB_FULL | IXGBE_LINK_SPEED_1GB_FULL,
true);
 
+   if (ixgbe_mii_bus_init(hw))
+   e_err(probe, "failed to create mii_bus\n");
+
return 0;
 
 err_register:
@@ -11170,6 +11174,8 @@ static void ixgbe_remove(struct pci_dev *pdev)
set_bit(__IXGBE_REMOVING, >state);
cancel_work_sync(>service_task);
 
+   if (adapter->mii_bus)
+   mdiobus_unregister(adapter->mii_bus);
 
 #ifdef CONFIG_IXGBE_DCA
if (adapter->flags & IXGBE_FLAG_DCA_ENABLED) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c
index 919a7af84b42..75db340963ad 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c
@@ -658,6 +658,198 @@ s32 ixgbe_write_phy_reg_generic(struct ixgbe_hw *hw, u32 
reg_addr,
return status;
 }
 
+/**
+ *  ixgbe_msca - Write the command register and poll for completion/timeout
+ *  @hw: pointer to hardware structure
+ *  @cmd: command register value to write
+ **/
+static s32 ixgbe_msca_cmd(struct ixgbe_hw *hw, u32 cmd)
+{
+   u32 i;
+
+   IXGBE_WRITE_REG(hw, IXGBE_MSCA, cmd);
+
+   for (i = 0; i < IXGBE_MDIO_COMMAND_TIMEOUT; i++) {
+   udelay(10);
+   cmd = IXGBE_READ_REG(hw, IXGBE_MSCA);
+   if (!(cmd & IXGBE_MSCA_MDI_COMMAND))
+   return 0;
+   }
+
+   return -ETIMEDOUT;
+}
+
+/**
+ *  ixgbe_msca - Use device_id to figure out if MDIO bus is shared between 
MACs.
+ *  The embedded x550(s) in the C3000 line of SoCs only have a single mii_bus
+ *  shared between all MACs, and that introduces some new mask flags that need
+ *  to be passed to the *_swfw_sync callbacks.
+ *  @hw: pointer to hardware structure
+ **/
+static bool ixgbe_is_shared_mii(struct ixgbe_hw *hw)
+{
+   switch (hw->device_id) {
+   case IXGBE_DEV_ID_X550EM_A_KR:
+   case IXGBE_DEV_ID_X550EM_A_KR_L:
+   case IXGBE_DEV_ID_X550EM_A_SFP_N:
+   case IXGBE_DEV_ID_X550EM_A_SGMII:
+   case IXGBE_DEV_ID_X550EM_A_SGMII_L:
+   case IXGBE_DEV_ID_X550EM_A_10G_T:
+   case IXGBE_DEV_ID_X550EM_A_SFP:
+   case IXGBE_DEV_ID_X550EM_A_1G_T:
+   case IXGBE_DEV_ID_X550EM_A_1G_T_L:
+   return true;
+   }
+
+   return false;
+}
+
+/**
+ *  ixgbe_mii_bus_read - Read a clause 22/45 register
+ *  @hw: pointer to hardware structure
+ *  @addr: address
+ *  @regnum: register number
+ **/
+static s32 ixgbe_mii_bus_read(struct mii_bus *bus, int addr, int regnum)
+{
+   struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)bus->priv;
+   struct ixgbe_hw *hw = >hw;
+   u32 hwaddr, cmd, gssr = hw->phy.phy_semaphore_mask;
+   s32 data;
+
+   if (ixgbe_is_shared_mii(hw)) {
+   gssr |= IXGBE_GSSR_TOKEN_SM;
+   if (hw->bus.lan_id)
+   gssr |= IXGBE_GSSR_PHY1_SM;
+   else
+   gssr |= IXGBE_GSSR_PHY0_SM;
+   }
+
+   if (hw->mac.ops.acquire_swfw_sync(hw, gssr))
+   return -EBUSY;
+
+   hwaddr = addr << IXGBE_MSCA_PHY_ADDR_SHIFT;
+   if (regnum & MII_ADDR_C45) {
+   hwaddr |= regnum & GENMASK(21, 0);
+   cmd = hwaddr | IXGBE_MSCA_ADDR_CYCLE | IXGBE_MSCA_MDI_COMMAND;
+   } else {
+   hwaddr |= (regnum & GENMASK(5, 0)) << IXGBE_MSCA_DEV_TYPE_SHIFT;
+   cmd = hwaddr |

Re: [PATCH net-next v3 2/3] udp: elide zerocopy operation in hot path

2018-11-29 Thread Willem de Bruijn

On Thu, Nov 29, 2018 at 3:26 PM Willem de Bruijn
 wrote:
>
> From: Willem de Bruijn 
>
> With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
> Release of its last reference triggers a completion notification.
>
> The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
> the skbs, because it can build, send and free skbs within its loop,
> possibly reaching refcount zero and freeing the ubuf_info too soon.
>
> The UDP stack currently also takes this extra ref, but does not need
> it as all skbs are sent after return from __ip(6)_append_data.
>
> Avoid the extra refcount_inc and refcount_dec_and_test, and generally
> the sock_zerocopy_put in the common path, by passing the initial
> reference to the first skb.
>
> This approach is taken instead of initializing the refcount to 0, as
> that would generate error "refcount_t: increment on 0" on the
> next skb_zcopy_set.
>
> Signed-off-by: Willem de Bruijn 
> ---

> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 9602746d7175..08ff04d12642 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
> int csummode = CHECKSUM_NONE;
> struct rtable *rt = (struct rtable *)cork->dst;
> unsigned int wmem_alloc_delta = 0;
> +   bool paged, extra_uref;
> u32 tskey = 0;
> -   bool paged;
>
> skb = skb_peek_tail(queue);
>
> @@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
> uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
> if (!uarg)
> return -ENOBUFS;
> +   extra_uref = true;
> if (rt->dst.dev->features & NETIF_F_SG &&
> csummode == CHECKSUM_PARTIAL) {
> paged = true;
> } else {
> uarg->zerocopy = 0;
> -   skb_zcopy_set(skb, uarg);
> +   skb_zcopy_set(skb, uarg, _uref);
> }
> }
>
> @@ -1019,7 +1020,7 @@ static int __ip_append_data(struct sock *sk,
> cork->tx_flags = 0;
> skb_shinfo(skb)->tskey = tskey;
> tskey = 0;
> -   skb_zcopy_set(skb, uarg);
> +   skb_zcopy_set(skb, uarg, _uref);
>
> /*
>  *  Find where to start putting bytes.
> @@ -1123,13 +1124,12 @@ static int __ip_append_data(struct sock *sk,
>
> if (wmem_alloc_delta)
> refcount_add(wmem_alloc_delta, >sk_wmem_alloc);
> -   sock_zerocopy_put(uarg);
> return 0;
>
>  error_efault:
> err = -EFAULT;
>  error:
> -   sock_zerocopy_put_abort(uarg);
> +   sock_zerocopy_put_abort(uarg, extra_uref);

I'll need another revision. Sorry for the spam.

In the draft patch I suggested that the skb_zcopy_set needs to be
moved below getfrag, so that the uarg is not freed on the only
error path that calls kfree_skb inside the main loop.

This is still needed. Else sock_zerocopy_put_abort here is
reached with a non-NULL, but already freed uarg.

Will send a v4. Will let this sit for at least a day in case anyone
else has comments.

[PATCH bpf-next] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Yonghong Song

During porting libbpf to bcc, I got some warnings like below:
  ...
  [  2%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
  warning: "_GNU_SOURCE" redefined [enabled by default]
   #define _GNU_SOURCE
  ...
  [  3%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
‘libbpf_strerror’:
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
  warning: assignment makes integer from pointer without a cast [enabled by 
default]
 ret = strerror_r(err, buf, size);
  ...

bcc is built with _GNU_SOURCE defined and this caused the above warning.
This patch intends to make libpf _GNU_SOURCE friendly by
  . define _GNU_SOURCE unless it is not defined
  . gnu version strerror_r has different return value than non-gnu version,
so strerror_r return value is handled differently if _GNU_SOURCE is defined.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.c   |  2 ++
 tools/lib/bpf/libbpf_errno.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ed4212a4c5f9..59b748ebd15f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -9,7 +9,9 @@
  * Copyright (C) 2017 Nicira, Inc.
  */
 
+#ifndef _GNU_SOURCE
 #define _GNU_SOURCE
+#endif
 #include 
 #include 
 #include 
diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
index d83b17f8435c..286e497c50ec 100644
--- a/tools/lib/bpf/libbpf_errno.c
+++ b/tools/lib/bpf/libbpf_errno.c
@@ -40,9 +40,19 @@ int libbpf_strerror(int err, char *buf, size_t size)
err = err > 0 ? err : -err;
 
if (err < __LIBBPF_ERRNO__START) {
+#ifdef _GNU_SOURCE
+   const char *ret_buf;
+#endif
int ret;
 
+#ifdef _GNU_SOURCE
+   ret_buf = strerror_r(err, buf, size);
+   if (ret_buf != buf)
+   snprintf(buf, size, "%s", ret_buf);
+   ret = 0;
+#else
ret = strerror_r(err, buf, size);
+#endif
buf[size - 1] = '\0';
return ret;
}
-- 
2.17.1

Re: [PATCH net-next,v4 05/12] flow_offload: add statistics retrieval infrastructure and use it

2018-11-29 Thread Jakub Kicinski

On Thu, 29 Nov 2018 03:22:24 +0100, Pablo Neira Ayuso wrote:
> This patch provides the flow_stats structure that acts as container for
> tc_cls_flower_offload, then we can use to restore the statistics on the
> existing TC actions. Hence, tcf_exts_stats_update() is not used from
> drivers anymore.
> 
> Signed-off-by: Pablo Neira Ayuso 

> diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
> index dabc819b6cc9..040c092c000a 100644
> --- a/include/net/flow_offload.h
> +++ b/include/net/flow_offload.h
> @@ -179,4 +179,18 @@ static inline bool flow_rule_match_key(const struct 
> flow_rule *rule,
>   return dissector_uses_key(rule->match.dissector, key);
>  }
>  
> +struct flow_stats {
> + u64 pkts;
> + u64 bytes;
> + u64 lastused;
> +};
> +
> +static inline void flow_stats_update(struct flow_stats *flow_stats,
> +  u64 pkts, u64 bytes, u64 lastused)
> +{
> + flow_stats->pkts= pkts;
> + flow_stats->bytes   = bytes;
> + flow_stats->lastused= lastused;
> +}
> +
>  #endif /* _NET_FLOW_OFFLOAD_H */
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index abb035f84321..a08c06e383db 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -767,6 +767,7 @@ struct tc_cls_flower_offload {
>   enum tc_fl_command command;
>   unsigned long cookie;
>   struct flow_rule *rule;
> + struct flow_stats stats;
>   struct tcf_exts *exts;
>   u32 classid;
>  };
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index 8898943b8ee6..b88cf29aff7b 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -432,6 +432,10 @@ static void fl_hw_update_stats(struct tcf_proto *tp, 
> struct cls_fl_filter *f)
>  
>   tc_setup_cb_call(block, >exts, TC_SETUP_CLSFLOWER,
>_flower, false);
> +
> + tcf_exts_stats_update(>exts, cls_flower.stats.bytes,
> +   cls_flower.stats.pkts,
> +   cls_flower.stats.lastused);
>  }
>  
>  static bool __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,

Apart from the bug Venkat mentioned I think this patch exposes a
potentially strange and unexpected TC-ism through to the abstracted API.
The stats for TC store the increment in update cycle, so flow->stats
will be equal to the diff between TC_CLSFLOWER_STATS calls.

Its this behaviour going to be subsystem-specific?  Looks not-so-clean.

[PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Debabrata Banerjee

Fix packet count when using vlan/macvlan drivers with gso. Without this it
is not possible to reconcile packet counts between underlying devices and
these virtual devices. Additionally, the output looks wrong in a standalone
way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.

There are many other drivers that likely have a similar problem, although
it is not clear how many of those could be used with gso. Perhaps all
packet counting should be wrapped in a helper fn.

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/macvlan.c | 2 +-
 net/8021q/vlan_core.c | 2 +-
 net/8021q/vlan_dev.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fc8d5f1ee1ad..15e67a87f202 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -566,7 +566,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
u64_stats_update_begin(_stats->syncp);
-   pcpu_stats->tx_packets++;
+   pcpu_stats->tx_packets += max_t(u16, 1, 
skb_shinfo(skb)->gso_segs);
pcpu_stats->tx_bytes += len;
u64_stats_update_end(_stats->syncp);
} else {
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index a313165e7a67..e85f6665d0ed 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -62,7 +62,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats);
 
u64_stats_update_begin(_stats->syncp);
-   rx_stats->rx_packets++;
+   rx_stats->rx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
rx_stats->rx_bytes += skb->len;
if (skb->pkt_type == PACKET_MULTICAST)
rx_stats->rx_multicast++;
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b2d9c8f27cd7..b28e7535a0b9 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -135,7 +135,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff 
*skb,
 
stats = this_cpu_ptr(vlan->vlan_pcpu_stats);
u64_stats_update_begin(>syncp);
-   stats->tx_packets++;
+   stats->tx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
stats->tx_bytes += len;
u64_stats_update_end(>syncp);
} else {
-- 
2.19.2

Re: Invalid transport_offset with AF_PACKET socket

2018-11-29 Thread Saeed Mahameed

On Wed, Nov 28, 2018 at 3:10 AM Maxim Mikityanskiy  wrote:
>
> Hi Saeed,
>
> > Can you elaborate more, what NIC? what configuration ? what do you mean
> > by confusion, anyway please see below
>
> ConnectX-4, after running `mlnx_qos -i eth1 --trust dscp`, which sets inline
> mode 2 (MLX5_INLINE_MODE_IP). I'll explain what I mean by confusion below.
>
> > in mlx5 with ConnectX4 or Connext4-LX there is a requirement to copy at
> > least the ethernet header to the tx descriptor otherwise this might
> > cause the packet to be dropped, and for RAW sockets the skb headers
> > offsets are not set, but the latest mlx5 upstream driver would know how
> > to handle this, and copy the minmum amount required
> > please see:
> >
> > static inline u16 mlx5e_calc_min_inline(enum mlx5_inline_modes mode,
> >   struct sk_buff *skb)
>
> Yes, I know that, and what I do is debugging an issue with this function.
>
> >
> > it should default to:
> >
> >
> > case MLX5_INLINE_MODE_L2:
> >   default:
> >   hlen = mlx5e_skb_l2_header_offset(skb);
>
> The issue appears in MLX5_INLINE_MODE_IP. I haven't tested
> MLX5_INLINE_MODE_TCP_UDP yet, though.
>
> > So it should return at least 18 and not 14.
>
> Yes, the function does its best to return at least 18, but it silently expects
> skb_transport_offset to exceed 18. In normal conditions, it will be more that
> 18, because it will be at least 14 + 20. But in my case, when I send a packet
> via an AF_PACKET socket, skb_transport_offset returns 14 (which is nonsense),
> and the driver uses this value, causing the hardware to fail, because it's 
> less
> than 18.
>

Got it, so even if you copy 18 it is not sufficient ! if the packet is
ipv4 or ipv6
and the inline mode is set to  MLX5_INLINE_MODE_IP in the vport
context you must copy the IP headers as well !

but what do you expect from AF_PACKET socket ? to parse each and every
packet and set skb_transport_offset ?

> > We had some issues with this in old driver such as kernels 4.14/15, and
> > it depends in the use case so i need some information first:
>
> No, it's not an old kernel. We actually have this bug in our internal bug
> tracking system, and I'm trying to resolve it.
>
> > 1. What Cards do you have ? (lspci)
>
> 03:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
> 03:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
> 81:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 
> Pro]
>
> Testing with ConnectX-4.
>
> > 2. What kernel/driver version are you using ?
>
> I'm on net-next-mlx5, commit 66a4b5ef638a (the latest when I started the
> investigation).
>
> > 3. what is the current enum mlx5_inline_modes seen in
> > mlx5e_calc_min_inline or sq->min_inline_mode ?
>
> MLX5_INLINE_MODE_IP, as I said above.
>
> > 4. Firmware version ? (ethtool -i)
>
> 12.22.0238 (MT_2190110032)
>
> > can you share the packet format you are sending and seeing the bad
> > behavior with
>
> Here is the hexdump of the simplest packet that causes the problem when it's
> sent through AF_PACKET after `mlnx_qos -i eth1 --trust dscp`:
>
> : 11 22 33 44 55 66 77 88 99 aa bb cc 08 00 45 00
> 0010: 00 20 00 00 40 00 40 11 ae a5 c6 12 00 01 c6 12
> 0020: 00 02 00 00 4a 38 00 0c 29 82 61 62 63 64
>
> (Please ignore the wrong UDP checksum and non-existing MACs, it doesn't matter
> at all, I tested it with completely valid packets as well. The wrong UDP
> checksum is due to a bug in our internal pypacket utility).
>
> Thanks,
> Max

RE: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Banerjee, Debabrata

> From: Eric Dumazet 
> On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:
> >
> > Also your bytes are going to be totally messed up as well since the
> > headers are trimmed in the GSO frames. It might be worthwhile to just
> > have a branch based on skb_is_gso that sets the packets and bytes
> > based on the GSO values, and one that sets it for the default case.
> >
> Note that __dev_queue_xmit() is already doing that, no need to re-
> implement in each driver.
> 
> It calls qdisc_pkt_len_init(), meaning that drivers can use qdisc_skb_cb(skb)-
> >pkt_len instead of skb->len
> 
> (Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )

This works. But rx bytes are still a problem. I haven't spotted anything quick 
for this.

Re: [PATCH bpf-next v2] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Daniel Borkmann

On 11/30/2018 12:47 AM, Jakub Kicinski wrote:
> On Thu, 29 Nov 2018 15:31:45 -0800, Yonghong Song wrote:
>> During porting libbpf to bcc, I got some warnings like below:
>>   ...
>>   [  2%] Building C object 
>> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
>>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
>>   warning: "_GNU_SOURCE" redefined [enabled by default]
>>#define _GNU_SOURCE
>>   ...
>>   [  3%] Building C object 
>> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
>>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
>> ‘libbpf_strerror’:
>>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
>>   warning: assignment makes integer from pointer without a cast [enabled by 
>> default]
>>  ret = strerror_r(err, buf, size);
>>   ...
>>
>> bcc is built with _GNU_SOURCE defined and this caused the above warning.
>> This patch intends to make libpf _GNU_SOURCE friendly by
>>   . define _GNU_SOURCE in libbpf.c unless it is not defined
>>   . undefine _GNU_SOURCE as non-gnu version of strerror_r is expected.
>>
>> Signed-off-by: Yonghong Song 
> 
> Acked-by: Jakub Kicinski 
> 
> FWIW :)
> 

Applied, thanks.

Re: pull-request: bpf-next 2018-11-30

2018-11-29 Thread David Miller

From: Daniel Borkmann 
Date: Fri, 30 Nov 2018 03:01:38 +0100

> The following pull-request contains BPF updates for your *net-next* tree.
> 
> (Getting out bit earlier this time to pull in a dependency from bpf.)
> 
> The main changes are:
> 
> 1) Add libbpf ABI versioning and document API naming conventions
>as well as ABI versioning process, from Andrey.
> 
> 2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
>programs that is used in conjunction with sk_msg_push_data()
>for adding / removing meta data to the msg data, from John.
> 
> 3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
>lib and testsuite build failures on 32 bit, from David.
> 
> 4) Make BPF prog dump for !JIT identical to how we dump subprogs
>when JIT is in use, from Yonghong.
> 
> 5) Rename btf_get_from_id() to make it more conform with libbpf
>API naming conventions, from Martin.
> 
> 6) Add a missing BPF kselftest config item, from Naresh.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Pulled, thanks Daniel.

[PATCHv2 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-29 Thread Joe Stringer

Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 38924b306e9f..b73d574356f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 465ad585c836..b73d574356f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.17.1

[PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-29 Thread Joe Stringer

David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns, while any values with a
positive value in the signed 32-bit integer space would result in a
lookup for a socket in the netns corresponding to that id. As before, if
the netns with that ID does not exist, no socket will be found.
Furthermore, if any bits are set in the upper 32-bits, then no socket
will be found.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..38924b306e9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2405,6 +2409,9 @@ enum bpf_func_id {
 /*

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-29 Thread Saeed Mahameed

On Wed, Nov 28, 2018 at 8:09 PM Eric Dumazet  wrote:
>
> On Wed, Nov 28, 2018 at 7:53 PM Cong Wang  wrote:
> >
> > On Wed, Nov 28, 2018 at 7:50 PM Eric Dumazet  wrote:
> > >
> > > On Wed, Nov 28, 2018 at 7:40 PM Cong Wang  
> > > wrote:
> > > >
> > > > On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  
> > > > wrote:
> > > > >
> > > > > A NIC is supposed to deliver frames, even the ones that 'seem' bad.
> > > >
> > > > A quick test shows this is not the case for mlx5.
> > > >
> > > > With the trafgen script you gave to me, with tot_len==40, the dest host
> > > > could receive all the packets. Changing tot_len to 80, tcpdump could no
> > > > longer see any packet. (Both sender and receiver are mlx5.)
> > > >
> > > > So, packets with tot_len > skb->len are clearly dropped before tcpdump
> > > > could see it, that is likely by mlx5 hardware.
> > >
> > > Or a router, or a switch.
> > >
> > > Are your two hosts connected back to back ?
> >
> > Both should be plugged into a same switch. I fail to see why a
> > switch could parse IP header as the packet is nothing of interest,
> > like a IGMP snooping.
>
> Well, _something_ is dropping the frames.
> It can be mlx5, or something else.
>

mlx5 HW should deliver such packets.
just tested with scapy :
sender:

#scapy
p = Ether(dst="24:8a:07:b4:24:6e")/IP(dst="1.2.3.4", len = 1000)
sendp(p, iface = "p5p1")

receiver:
tcpdump: listening on p5p1, link-type EN10MB (Ethernet), capture size
262144 bytes
16:24:26.427563 80:18:44:e5:2f:c4 > 24:8a:07:b4:24:6e, ethertype IPv4
(0x0800), length 60: truncated-ip - 954 bytes missing! (tos 0x0, ttl
64, id 1, offset 0, flags [none], proto Options (0), length 1000)
10.20.2.212 > 1.2.3.4:  ip-proto-0 980


> Does ethtool -S show any increasing counter ?

Re: [PATCH v2 net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread David Miller

From: Christoph Paasch 
Date: Thu, 29 Nov 2018 16:01:04 -0800

> __qdisc_drop_all() accesses skb->prev to get to the tail of the
> segment-list.
> 
> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> the skb-list handling has been changed to set skb->next to NULL and set
> the list-poison on skb->prev.
> 
> With that change, __qdisc_drop_all() will panic when it tries to
> dereference skb->prev.
> 
> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> __list_del_entry is used, leaving skb->prev unchanged (thus,
> pointing to the list-head if it's the first skb of the list).
> This will make __qdisc_drop_all modify the next-pointer of the list-head
> and result in a panic later on:
  ...
> This patch makes sure that skb->prev is set to NULL when entering
> netem_enqueue.
> 
> Cc: Prashant Bhole 
> Cc: Tyler Hicks 
> Cc: Eric Dumazet 
> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> Suggested-by: Eric Dumazet 
> Signed-off-by: Christoph Paasch 

Applied and queued up for -stable, thanks!

[PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Prashant Bhole

In tun.c skb->len was accessed while doing stats accounting after a
call to netif_receive_skb. We can not access skb after this call
because buffers may be dropped.

The fix for this bug would be to store skb->len in local variable and
then use it after netif_receive_skb(). IMO using xdp data size for
accounting bytes will be better because input for tun_xdp_one() is
xdp_buff.

Hence this patch:
- fixes a bug by removing skb access after netif_receive_skb()
- uses xdp data size for accounting bytes

[613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun]
[613.021062] Read of size 4 at addr 8881da9ab7c0 by task vhost-1115/1155
[613.023073]
[613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232
[613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[613.029116] Call Trace:
[613.031145]  dump_stack+0x5b/0x90
[613.032219]  print_address_description+0x6c/0x23c
[613.034156]  ? tun_sendmsg+0x77c/0xc50 [tun]
[613.036141]  kasan_report.cold.5+0x241/0x308
[613.038125]  tun_sendmsg+0x77c/0xc50 [tun]
[613.040109]  ? tun_get_user+0x1960/0x1960 [tun]
[613.042094]  ? __isolate_free_page+0x270/0x270
[613.045173]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.047127]  ? peek_head_len.part.13+0x90/0x90 [vhost_net]
[613.049096]  ? get_tx_bufs+0x5a/0x2c0 [vhost_net]
[613.051106]  ? vhost_enable_notify+0x2d8/0x420 [vhost]
[613.053139]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.053139]  ? vhost_net_buf_peek+0x340/0x340 [vhost_net]
[613.053139]  ? __mutex_lock+0x8d9/0xb30
[613.053139]  ? finish_task_switch+0x8f/0x3f0
[613.053139]  ? handle_tx+0x32/0x120 [vhost_net]
[613.053139]  ? mutex_trylock+0x110/0x110
[613.053139]  ? finish_task_switch+0xcf/0x3f0
[613.053139]  ? finish_task_switch+0x240/0x3f0
[613.053139]  ? __switch_to_asm+0x34/0x70
[613.053139]  ? __switch_to_asm+0x40/0x70
[613.053139]  ? __schedule+0x506/0xf10
[613.053139]  handle_tx+0xc7/0x120 [vhost_net]
[613.053139]  vhost_worker+0x166/0x200 [vhost]
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  ? __kthread_parkme+0x77/0x90
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  kthread+0x1b1/0x1d0
[613.053139]  ? kthread_park+0xb0/0xb0
[613.053139]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Allocated by task 1155:
[613.088705]  kasan_kmalloc+0xbf/0xe0
[613.088705]  kmem_cache_alloc+0xdc/0x220
[613.088705]  __build_skb+0x2a/0x160
[613.088705]  build_skb+0x14/0xc0
[613.088705]  tun_sendmsg+0x4f0/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Freed by task 1155:
[613.088705]  __kasan_slab_free+0x12e/0x180
[613.088705]  kmem_cache_free+0xa0/0x230
[613.088705]  ip6_mc_input+0x40f/0x5a0
[613.088705]  ipv6_rcv+0xc9/0x1e0
[613.088705]  __netif_receive_skb_one_core+0xc1/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  br_pass_frame_up+0x2b9/0x2e0
[613.088705]  br_handle_frame_finish+0x2fb/0x7a0
[613.088705]  br_handle_frame+0x30f/0x6c0
[613.088705]  __netif_receive_skb_core+0x61a/0x15b0
[613.088705]  __netif_receive_skb_one_core+0x8e/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  tun_sendmsg+0x738/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] The buggy address belongs to the object at 8881da9ab740
[613.088705]  which belongs to the cache skbuff_head_cache of size 232

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Reviewed-by: Toshiaki Makita 
Signed-off-by: Prashant Bhole 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e244f5d7512a..6e388792c0a8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
   struct tun_file *tfile,
   struct xdp_buff *xdp, int *flush)
 {
+   unsigned int datasize = xdp->data_end - xdp->data;
struct tun_xdp_hdr *hdr = xdp->data_hard_start;
struct virtio_net_hdr *gso = >gso;
struct tun_pcpu_stats *stats;
@@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(>syncp);
stats->rx_packets++;
-   stats->rx_bytes += skb->len;
+   stats->rx_bytes += datasize;
u64_stats_update_end(>syncp);
put_cpu_ptr(stats);
 
-- 
2.17.2

Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Jason Wang




On 2018/11/30 上午10:30, Prashant Bhole wrote:

In tun.c skb->len was accessed while doing stats accounting after a
call to netif_receive_skb. We can not access skb after this call
because buffers may be dropped.

The fix for this bug would be to store skb->len in local variable and
then use it after netif_receive_skb(). IMO using xdp data size for
accounting bytes will be better because input for tun_xdp_one() is
xdp_buff.

Hence this patch:
- fixes a bug by removing skb access after netif_receive_skb()
- uses xdp data size for accounting bytes

[613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun]
[613.021062] Read of size 4 at addr 8881da9ab7c0 by task vhost-1115/1155
[613.023073]
[613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232
[613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[613.029116] Call Trace:
[613.031145]  dump_stack+0x5b/0x90
[613.032219]  print_address_description+0x6c/0x23c
[613.034156]  ? tun_sendmsg+0x77c/0xc50 [tun]
[613.036141]  kasan_report.cold.5+0x241/0x308
[613.038125]  tun_sendmsg+0x77c/0xc50 [tun]
[613.040109]  ? tun_get_user+0x1960/0x1960 [tun]
[613.042094]  ? __isolate_free_page+0x270/0x270
[613.045173]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.047127]  ? peek_head_len.part.13+0x90/0x90 [vhost_net]
[613.049096]  ? get_tx_bufs+0x5a/0x2c0 [vhost_net]
[613.051106]  ? vhost_enable_notify+0x2d8/0x420 [vhost]
[613.053139]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.053139]  ? vhost_net_buf_peek+0x340/0x340 [vhost_net]
[613.053139]  ? __mutex_lock+0x8d9/0xb30
[613.053139]  ? finish_task_switch+0x8f/0x3f0
[613.053139]  ? handle_tx+0x32/0x120 [vhost_net]
[613.053139]  ? mutex_trylock+0x110/0x110
[613.053139]  ? finish_task_switch+0xcf/0x3f0
[613.053139]  ? finish_task_switch+0x240/0x3f0
[613.053139]  ? __switch_to_asm+0x34/0x70
[613.053139]  ? __switch_to_asm+0x40/0x70
[613.053139]  ? __schedule+0x506/0xf10
[613.053139]  handle_tx+0xc7/0x120 [vhost_net]
[613.053139]  vhost_worker+0x166/0x200 [vhost]
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  ? __kthread_parkme+0x77/0x90
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  kthread+0x1b1/0x1d0
[613.053139]  ? kthread_park+0xb0/0xb0
[613.053139]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Allocated by task 1155:
[613.088705]  kasan_kmalloc+0xbf/0xe0
[613.088705]  kmem_cache_alloc+0xdc/0x220
[613.088705]  __build_skb+0x2a/0x160
[613.088705]  build_skb+0x14/0xc0
[613.088705]  tun_sendmsg+0x4f0/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Freed by task 1155:
[613.088705]  __kasan_slab_free+0x12e/0x180
[613.088705]  kmem_cache_free+0xa0/0x230
[613.088705]  ip6_mc_input+0x40f/0x5a0
[613.088705]  ipv6_rcv+0xc9/0x1e0
[613.088705]  __netif_receive_skb_one_core+0xc1/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  br_pass_frame_up+0x2b9/0x2e0
[613.088705]  br_handle_frame_finish+0x2fb/0x7a0
[613.088705]  br_handle_frame+0x30f/0x6c0
[613.088705]  __netif_receive_skb_core+0x61a/0x15b0
[613.088705]  __netif_receive_skb_one_core+0x8e/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  tun_sendmsg+0x738/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] The buggy address belongs to the object at 8881da9ab740
[613.088705]  which belongs to the cache skbuff_head_cache of size 232

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Reviewed-by: Toshiaki Makita 
Signed-off-by: Prashant Bhole 
---
  drivers/net/tun.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e244f5d7512a..6e388792c0a8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
   struct tun_file *tfile,
   struct xdp_buff *xdp, int *flush)
  {
+   unsigned int datasize = xdp->data_end - xdp->data;
struct tun_xdp_hdr *hdr = xdp->data_hard_start;
struct virtio_net_hdr *gso = >gso;
struct tun_pcpu_stats *stats;
@@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(>syncp);
stats->rx_packets++;
-   stats->rx_bytes += skb->len;
+   stats->rx_bytes += datasize;
u64_stats_update_end(>syncp);
put_cpu_ptr(stats);

Re: [PATCH v2 net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Eric Dumazet




On 11/29/2018 04:01 PM, Christoph Paasch wrote:
> __qdisc_drop_all() accesses skb->prev to get to the tail of the
> segment-list.
> 
> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> the skb-list handling has been changed to set skb->next to NULL and set
> the list-poison on skb->prev.
> 
> With that change, __qdisc_drop_all() will panic when it tries to
> dereference skb->prev.
> 

Reviewed-by: Eric Dumazet 

Thanks !

Re: [PATCH mlx5-next 00/13] Mellanox, mlx5 core driver events API

2018-11-29 Thread Saeed Mahameed

On Mon, Nov 26, 2018 at 2:39 PM Saeed Mahameed  wrote:
>
> Hi,
>
> This patchset is for mlx5-next shared branch, and will be applied there
> once the review is done.
>
> The main idea of this change is to define a flexible scalable and
> simpler high level mlx5 core APIs to forward device and driver events
> to upper level interface drivers e.g mlx5_ib and mlx5e netdevice driver.
>
> Patch #1, Driver events notifier API:
>
> Use atomic notifier chain to fire events to mlx5 core driver
> consumers (mlx5e/mlx5_ib) and provide dynamic mlx5 register/unregister
> notifier API to replace the internal mlx5_interface->event and totally
> remove it.
>
> Patch #2, Forward port events via the new notifier chain.
> Patch #3, Use the mlx5 events API in mlx5e netdevice.
> Patch #4, Forward all device events sent via mlx5_core_event (old API)
> to the new notifier chain, this will allow seamless transition to the
> new AP.
> Patch #5, Use the new events API in mlx5_IB
> Patch #6, remove old interface callback, mlx5_interface->event
> Patch #7,8,9, mlx5_ib to handle raw FW events as is rather than the
> software version of them, this will remove any need for unnecessary
> processing of FW events in the low level mlx5_core driver.
> Patch #10, Remove unnecessary processing of FW events in the low level
> mlx5_core driver, all events are handled on demand by mlx5_ib and/or
> mlx5e netdevice.
>
> patch #11,12, forward QP and SRQ events via the new notifier chain,
> will be needed by mlx5_ib driver.
>
> Patch #13, Debug patch for mlx5 events.
>
> Thanks,
> Saeed.
>
> ---
>
> Saeed Mahameed (13):
>   net/mlx5: Driver events notifier API
>   net/mlx5: Allow port change event to be forwarded to driver notifiers
> chain
>   net/mlx5e: Use the new mlx5 core notifier API
>   net/mlx5: Forward all mlx5 events to mlx5 notifiers chain
>   IB/mlx5: Use the new mlx5 core notifier API
>   net/mlx5: Remove unused events callback and logic
>   IB/mlx5: Handle raw port change event rather than the software version
>   net/mlx5: Allow forwarding event type general event as is
>   IB/mlx5: Handle raw delay drop general event
>   net/mlx5: Remove all deprecated software versions of FW events
>   net/mlx5: Forward QP/WorkQueues resource events
>   net/mlx5: Forward SRQ resource events
>   net/mlx5: Debug print for forwarded async events
>

Applied to mlx5-next.

Thanks,
Saeed.

pull-request: bpf-next 2018-11-30

2018-11-29 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

(Getting out bit earlier this time to pull in a dependency from bpf.)

The main changes are:

1) Add libbpf ABI versioning and document API naming conventions
   as well as ABI versioning process, from Andrey.

2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
   programs that is used in conjunction with sk_msg_push_data()
   for adding / removing meta data to the msg data, from John.

3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
   lib and testsuite build failures on 32 bit, from David.

4) Make BPF prog dump for !JIT identical to how we dump subprogs
   when JIT is in use, from Yonghong.

5) Rename btf_get_from_id() to make it more conform with libbpf
   API naming conventions, from Martin.

6) Add a missing BPF kselftest config item, from Naresh.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 4afe60a97ba6ffacc4d030b13653dc64099fea26:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2018-11-26 
13:08:17 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to b42699547fc9fb1057795bccc21a6445743a7fde:

  tools/bpf: make libbpf _GNU_SOURCE friendly (2018-11-30 02:41:02 +0100)


Alexei Starovoitov (2):
  Merge branch 'non-jit-btf-func_info'
  Merge branch 'libbpf-versioning-doc'

Andrey Ignatov (3):
  libbpf: Add version script for DSO
  libbpf: Verify versioned symbols
  libbpf: Document API and ABI conventions

Daniel Borkmann (1):
  Merge branch 'bpf-sk-msg-pop-data'

David Miller (2):
  bpf: Avoid unnecessary instruction in convert_bpf_ld_abs()
  bpf: Fix various lib and testsuite build failures on 32-bit.

John Fastabend (3):
  bpf: helper to pop data from messages
  bpf: add msg_pop_data helper to tools
  bpf: test_sockmap, add options for msg_pop_data() helper

Martin KaFai Lau (1):
  libbpf: Name changing for btf_get_from_id

Naresh Kamboju (1):
  selftests/bpf: add config fragment CONFIG_FTRACE_SYSCALLS

Yonghong Song (3):
  bpf: btf: support proper non-jit func info
  tools/bpf: change selftest test_btf for both jit and non-jit
  tools/bpf: make libbpf _GNU_SOURCE friendly

 include/linux/bpf.h |   6 +-
 include/linux/bpf_verifier.h|   1 -
 include/uapi/linux/bpf.h|  16 ++-
 kernel/bpf/core.c   |   3 +-
 kernel/bpf/syscall.c|  33 ++---
 kernel/bpf/verifier.c   |  55 +---
 net/core/filter.c   | 174 +++-
 net/ipv4/tcp_bpf.c  |  17 ++-
 net/tls/tls_sw.c|  11 +-
 tools/bpf/bpftool/map.c |   4 +-
 tools/bpf/bpftool/prog.c|   2 +-
 tools/include/uapi/linux/bpf.h  |  16 ++-
 tools/lib/bpf/Makefile  |  23 +++-
 tools/lib/bpf/README.rst| 139 +++
 tools/lib/bpf/btf.c |   4 +-
 tools/lib/bpf/btf.h |   2 +-
 tools/lib/bpf/libbpf.c  |   2 +
 tools/lib/bpf/libbpf.map| 121 
 tools/lib/bpf/libbpf_errno.c|   1 +
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/config  |   1 +
 tools/testing/selftests/bpf/test_btf.c  |  35 +
 tools/testing/selftests/bpf/test_progs.c|  10 +-
 tools/testing/selftests/bpf/test_sockmap.c  | 127 -
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 --
 25 files changed, 760 insertions(+), 115 deletions(-)
 create mode 100644 tools/lib/bpf/README.rst
 create mode 100644 tools/lib/bpf/libbpf.map

[PATCH RFC net-next v2] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Debabrata Banerjee

Fix packet count when using vlan/macvlan drivers with gso. Without this it
is not possible to reconcile packet counts between underlying devices and
these virtual devices. Additionally, the output looks wrong in a standalone
way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.

There are many other drivers that likely have a similar problem, although
it is not clear how many of those could be used with gso. Perhaps all
packet counting should be wrapped in a helper fn.

v2: bytes were also incorrect for gso skb's, fix tx as that is readily
available. Fix rx packets for macvlan.

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |  4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  4 ++--
 drivers/net/macvlan.c | 13 -
 drivers/net/macvtap.c |  2 +-
 include/linux/if_macvlan.h|  6 +++---
 net/8021q/vlan_core.c |  3 ++-
 net/8021q/vlan_dev.c  |  4 ++--
 7 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 6fd15a734324..e39fad2d888f 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -431,8 +431,8 @@ static void fm10k_type_trans(struct fm10k_ring *rx_ring,
if (!l2_accel)
skb_record_rx_queue(skb, rx_ring->queue_index);
else
-   macvlan_count_rx(netdev_priv(dev), skb->len + ETH_HLEN, true,
-false);
+   macvlan_count_rx(netdev_priv(dev), skb_shinfo(skb)->gso_segs ?: 
1,
+skb->len + ETH_HLEN, true, false);
 
skb->protocol = eth_type_trans(skb, dev);
 }
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 49a4ea38eb07..474e72ec68b0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1704,8 +1704,8 @@ void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
if (netif_is_ixgbe(dev))
skb_record_rx_queue(skb, rx_ring->queue_index);
else
-   macvlan_count_rx(netdev_priv(dev), skb->len + ETH_HLEN, true,
-false);
+   macvlan_count_rx(netdev_priv(dev), skb_shinfo(skb)->gso_segs ?: 
1,
+skb->len + ETH_HLEN, true, false);
 
skb->protocol = eth_type_trans(skb, dev);
 }
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fc8d5f1ee1ad..ab8743777897 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -289,7 +289,8 @@ static void macvlan_broadcast(struct sk_buff *skb,
nskb, vlan, eth,
mode == MACVLAN_MODE_BRIDGE) ?:
  netif_rx_ni(nskb);
-   macvlan_count_rx(vlan, skb->len + ETH_HLEN,
+   macvlan_count_rx(vlan, skb_shinfo(skb)->gso_segs ?: 1,
+skb->len + ETH_HLEN,
 err == NET_RX_SUCCESS, true);
}
}
@@ -418,7 +419,8 @@ static void macvlan_forward_source_one(struct sk_buff *skb,
nskb->pkt_type = PACKET_HOST;
 
ret = netif_rx(nskb);
-   macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, false);
+   macvlan_count_rx(vlan, skb_shinfo(skb)->gso_segs ?: 1, len,
+ret == NET_RX_SUCCESS, false);
 }
 
 static void macvlan_forward_source(struct sk_buff *skb,
@@ -505,7 +507,8 @@ static rx_handler_result_t macvlan_handle_frame(struct 
sk_buff **pskb)
ret = NET_RX_SUCCESS;
handle_res = RX_HANDLER_ANOTHER;
 out:
-   macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, false);
+   macvlan_count_rx(vlan, skb_shinfo(skb)->gso_segs ?: 1, len,
+ret == NET_RX_SUCCESS, false);
return handle_res;
 }
 
@@ -553,7 +556,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
  struct net_device *dev)
 {
struct macvlan_dev *vlan = netdev_priv(dev);
-   unsigned int len = skb->len;
+   unsigned int len = qdisc_skb_cb(skb)->pkt_len;
int ret;
 
if (unlikely(netpoll_tx_running(dev)))
@@ -566,7 +569,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
u64_stats_update_begin(_stats->syncp);
-   pcpu_stats->tx_packets++;
+   pcpu_stats->tx_packets += skb_shinfo(skb)->gso_segs ?: 1;
pcpu_stats->tx_bytes += len;
u64_stats_update_end(_stats->syncp);
} else {
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index

Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Cong Wang

On Thu, Nov 29, 2018 at 1:46 PM Eric Dumazet  wrote:
>
>
>
> On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:
>
> > Instead of just checking for the max it might make more sense to do a
> > check using skb_is_gso, and then if true use gso_segs, otherwise just
> > default to 1.
> >
> > Also your bytes are going to be totally messed up as well since the
> > headers are trimmed in the GSO frames. It might be worthwhile to just
> > have a branch based on skb_is_gso that sets the packets and bytes based
> > on the GSO values, and one that sets it for the default case.
> >
> Note that __dev_queue_xmit() is already doing that, no need
> to re-implement in each driver.
>
> It calls qdisc_pkt_len_init(), meaning that drivers can use
> qdisc_skb_cb(skb)->pkt_len instead of skb->len
>
> (Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )

It should not modify, but, on the other hand, it is neither supposed
to be used for non-qdisc layer. At very least, you have to audit
each ->ndo_start_xmit() to see if it uses its own skb->cb
first.

RE: [next-queue PATCH v1 1/2] igb: Change RXPBSIZE size when setting Qav mode

2018-11-29 Thread Brown, Aaron F

> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Friday, November 16, 2018 4:19 PM
> To: intel-wired-...@lists.osuosl.org
> Cc: Sanchez-Palencia, Jesus ; Kirsher,
> Jeffrey T ; netdev@vger.kernel.org;
> jesus.s.palen...@gmail.com
> Subject: [next-queue PATCH v1 1/2] igb: Change RXPBSIZE size when setting
> Qav mode
> 
> From: Jesus Sanchez-Palencia 
> 
> Section 4.5.9 of the datasheet says that the total size of all packet
> buffers combined (TxPB 0 + 1 + 2 + 3 + RxPB + BMC2OS + OS2BMC) must not
> exceed 60KB. Today we are configuring a total of 62KB, so reduce the
> RxPB from 32KB to 30KB in order to respect that.
> 
> The choice of changing RxPBSIZE here is mainly because it seems more
> correct to give more priority to the transmit packet buffers over the
> receiver ones when running in Qav mode. Also, the BMC2OS and OS2BMC
> sizes are already too short.
> 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  drivers/net/ethernet/intel/igb/e1000_defines.h | 1 +
>  drivers/net/ethernet/intel/igb/igb_main.c  | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [next-queue PATCH v1 2/2] Documentation: igb: Add a section about CBS

2018-11-29 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Friday, November 16, 2018 4:19 PM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; jesus.s.palen...@gmail.com
> Subject: [Intel-wired-lan] [next-queue PATCH v1 2/2] Documentation: igb:
> Add a section about CBS
> 
> Add some pointers to the definition of the CBS algorithm, and some
> notes about the limits of its implementation in the i210 family of
> controllers.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  Documentation/networking/igb.rst | 19 +++
>  1 file changed, 19 insertions(+)
> 

Tested-by: Aaron Brown

[PATCH bpf-next 3/4] bpf: Make more use of 'any' alignment in test_verifier.c

2018-11-29 Thread David Miller



Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS in more tests where the
expected result is REJECT.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index af145f5..d8dda16 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1823,6 +1823,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid bpf_context access",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet read for SK_MSG",
@@ -2187,6 +2188,8 @@ static struct bpf_test tests[] = {
},
.errstr = "invalid bpf_context access",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"check cb access: half, wrong type",
@@ -3249,6 +3252,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R0 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs corruption 2",
@@ -3279,6 +3283,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R3 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs + data",
@@ -3778,6 +3783,7 @@ static struct bpf_test tests[] = {
.errstr = "R2 invalid mem access 'inv'",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test16 (arith on data_end)",
@@ -3961,6 +3967,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = REJECT,
.errstr = "invalid access to packet, off=0 size=8, 
R5(id=1,off=0,r=0)",
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test24 (x += pkt_ptr, 5)",
@@ -5117,6 +5124,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid cgroup storage access 5",
@@ -5233,6 +5241,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid per-cpu cgroup storage access 5",
@@ -7149,6 +7158,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid mem access 'inv'",
.result = REJECT,
.result_unpriv = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value illegal alu op, 5",
@@ -7171,6 +7181,7 @@ static struct bpf_test tests[] = {
.fixup_map_hash_48b = { 3 },
.errstr = "R0 invalid mem access 'inv'",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value is preserved across register spilling",
@@ -9663,6 +9674,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', good access",
@@ -9701,6 +9713,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', bad access 2",
@@ -9719,6 +9732,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,

[PATCH bpf-next 4/4] bpf: Apply F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to more ACCEPT test cases.

2018-11-29 Thread David Miller



If a testcase has alignment problems but is expected to be ACCEPT,
verify it using F_NEEDS_EFFICIENT_UNALIGNED_ACCESS too.

Maybe in the future if we add some architecture specific code to elide
the unaligned memory access warnings during the test, we can execute
these as well.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index d8dda16..683bf0c 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3886,6 +3886,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test21 (x += pkt_ptr, 2)",
@@ -3911,6 +3912,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test22 (x += pkt_ptr, 3)",
@@ -3941,6 +3943,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test23 (x += pkt_ptr, 4)",
@@ -3993,6 +3996,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test25 (marking on <, good access)",
@@ -7675,6 +7679,7 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.retval = 0 /* csum_diff of 64-byte packet */,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"helper access to variable memory: size = 0 not allowed on NULL 
(!ARG_PTR_TO_MEM_OR_NULL)",
@@ -9637,6 +9642,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' > pkt_end, bad access 1",
@@ -9808,6 +9814,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end < pkt_data', bad access 1",
@@ -9920,6 +9927,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end >= pkt_data', bad access 1",
@@ -9977,6 +9985,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' <= pkt_end, bad access 1",
@@ -10089,6 +10098,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' > pkt_data, bad access 1",
@@ -10260,6 +10270,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data < pkt_meta', bad access 1",
@@ -10372,6 +10383,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data >= pkt_meta', bad access 1",
@@ -10429,6 +10441,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' <= pkt_data, bad access 1",
@@ -12348,6 +12361,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 1,
+   .flags =

[PATCH bpf-next 2/4] bpf: Adjust F_NEEDS_EFFICIENT_UNALIGNED_ACCESS handling in test_verifier.c

2018-11-29 Thread David Miller



Make it set the flag argument to bpf_verify_program() which will relax
the alignment restrictions.

Now all such test cases will go properly through the verifier even on
inefficient unaligned access architectures.

On inefficient unaligned access architectures do not try to run such
programs, instead mark the test case as passing but annotate the
result similarly to how it is done now in the presence of this flag.

So, we get complete full coverage for all REJECT test cases, and at
least verifier level coverage for ACCEPT test cases.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 34 +++--
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c3de824..af145f5 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14200,7 +14200,7 @@ static int set_admin(bool admin)
 static void do_test_single(struct bpf_test *test, bool unpriv,
   int *passes, int *errors)
 {
-   int fd_prog, expected_ret, reject_from_alignment;
+   int fd_prog, expected_ret, alignment_prevented_execution;
int prog_len, prog_type = test->prog_type;
struct bpf_insn *prog = test->insns;
int map_fds[MAX_NR_MAPS];
@@ -14219,7 +14219,7 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
 
fd_prog = bpf_verify_program(prog_type, prog, prog_len,
 test->flags & F_LOAD_WITH_STRICT_ALIGNMENT,
-0,
+test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1);
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
@@ -14229,28 +14229,27 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
expected_val = unpriv && test->retval_unpriv ?
   test->retval_unpriv : test->retval;
 
-   reject_from_alignment = fd_prog < 0 &&
-   (test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
-   strstr(bpf_vlog, "misaligned");
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-   if (reject_from_alignment) {
-   printf("FAIL\nFailed due to alignment despite having efficient 
unaligned access: '%s'!\n",
-  strerror(errno));
-   goto fail_log;
-   }
-#endif
+   alignment_prevented_execution = 0;
+
if (expected_ret == ACCEPT) {
-   if (fd_prog < 0 && !reject_from_alignment) {
+   if (fd_prog < 0) {
printf("FAIL\nFailed to load prog '%s'!\n",
   strerror(errno));
goto fail_log;
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   if (fd_prog >= 0 &&
+   (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)) {
+   alignment_prevented_execution = 1;
+   goto test_ok;
+   }
+#endif
} else {
if (fd_prog >= 0) {
printf("FAIL\nUnexpected success to load!\n");
goto fail_log;
}
-   if (!strstr(bpf_vlog, expected_err) && !reject_from_alignment) {
+   if (!strstr(bpf_vlog, expected_err)) {
printf("FAIL\nUnexpected error message!\n\tEXP: 
%s\n\tRES: %s\n",
  expected_err, bpf_vlog);
goto fail_log;
@@ -14278,9 +14277,12 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
goto fail_log;
}
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+test_ok:
+#endif
(*passes)++;
-   printf("OK%s\n", reject_from_alignment ?
-  " (NOTE: reject due to unknown alignment)" : "");
+   printf("OK%s\n", alignment_prevented_execution ?
+  " (NOTE: not executed due to unknown alignment)" : "");
 close_fds:
close(fd_prog);
for (i = 0; i < MAX_NR_MAPS; i++)
-- 
2.1.2.532.g19b5d50

[PATCH bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-29 Thread David Miller



On sparc64 a ton of test cases in test_verifier.c fail because
the memory accesses in the test case are unaligned (or cannot
be proven to be aligned by the verifier).

Perhaps we can eventually try to (carefully) modify each test case
which has this problem to not use unaligned accesses but:

1) That is delicate work.

2) The changes might not fully respect the original
   intention of the testcase.

3) In some cases, such a transformation might not even
   be feasible at all.

So add an "any alignment" flag to tell the verifier to forcefully
disable it's alignment checks completely.

test_verifier.c is then annotated to use this flag when necessary.

The presence of the flag in each test case is good documentation to
anyone who wants to actually tackle the job of eliminating the
unaligned memory accesses in the test cases.

I've also seen several weird things in test cases, like trying to
access __skb->mark in a packet buffer.

This gets rid of 104 test_verifier.c failures on sparc64.

Changes since RFC:

1) Only the admin can allow the relaxation of alignment restrictions
   on inefficient unaligned access architectures.

2) Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS instead of making a new
   flag.

3) Annotate in the output, when we have a test case that the verifier
   accepted but we did not try to execute because we are on an
   inefficient unaligned access platform.  Maybe with some arch
   machinery we can avoid this in the future.

Signed-off-by: David S. Miller

[PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-29 Thread David Miller



Often we want to write tests cases that check things like bad context
offset accesses.  And one way to do this is to use an odd offset on,
for example, a 32-bit load.

This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
case see the alignment failure rather than what it was testing for.

It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.

Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.

Signed-off-by: David S. Miller 
---
 include/uapi/linux/bpf.h| 10 ++
 kernel/bpf/syscall.c|  7 ++-
 kernel/bpf/verifier.c   |  2 ++
 tools/include/uapi/linux/bpf.h  | 10 ++
 tools/lib/bpf/bpf.c | 12 +---
 tools/lib/bpf/bpf.h |  6 +++---
 tools/testing/selftests/bpf/test_align.c|  2 +-
 tools/testing/selftests/bpf/test_verifier.c |  1 +
 8 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 426b5c8..c9647ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -232,6 +232,16 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  This bypasses
+ * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the validator, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf5040f..cae65bb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1450,9 +1450,14 @@ static int bpf_prog_load(union bpf_attr *attr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
 
-   if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
+   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
return -EINVAL;
 
+   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
+   (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
+   !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
/* copy eBPF program license from user space */
if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
  sizeof(license) - 1) < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6dd4195..2fefeae 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6350,6 +6350,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT);
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
env->strict_alignment = true;
+   if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
+   env->strict_alignment = false;
 
ret = replace_map_fd_with_map_ptr(env);
if (ret < 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 426b5c8..c9647ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -232,6 +232,16 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  This bypasses
+ * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the validator, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 03f9bcc..d4f76d2 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -232,9 +232,11 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 
 int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
   size_t insns_cnt, int strict_alignment,
-  const char *license, __u32 kern_version,
-  char *log_buf, size_t log_buf_sz, int log_level)
+  int any_alignment, const char *license,
+  __u32 kern_version, char

Re: [PATCH net-next v2 1/2] udp: msg_zerocopy

2018-11-29 Thread Paolo Abeni

Hi,

Thank you for the update!

On Wed, 2018-11-28 at 18:50 -0500, Willem de Bruijn wrote:
> I did revert to the basic implementation using an extra ref
> for the function call, similar to TCP, as you suggested.
> 
> On top of that as a separate optimization patch I have a
> variant that uses refcnt zero by replacing refcount_inc with
> refcount_set(.., refcount_read(..) + 1). Not very pretty.

If the skb/uarg is not shared (no other threads can touch the refcnt)
before ip*_append_data() completes, how about something like the
following (incremental diff on top of patch 1/2, untested, uncompiled,
just to give the idea):

---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..1e3d195ffdfb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -480,6 +480,13 @@ static inline void sock_zerocopy_get(struct ubuf_info 
*uarg)
refcount_inc(>refcnt);
 }
 
+/* use only before uarg is actually shared */
+static inline void __sock_zerocopy_init(struct ubuf_info *uarg, int cnt)
+{
+   if (uarg)
+   refcount_set(>refcnt, cnt);
+}
+
 void sock_zerocopy_put(struct ubuf_info *uarg);
 void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
@@ -1326,13 +1333,20 @@ static inline struct ubuf_info *skb_zcopy(struct 
sk_buff *skb)
return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline int __skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
 {
if (skb && uarg && !skb_zcopy(skb)) {
-   sock_zerocopy_get(uarg);
skb_shinfo(skb)->destructor_arg = uarg;
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+   return 1;
}
+   return 0;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+   if (__skb_zcopy_set(skb, uarg))
+   sock_zerocopy_get(uarg);
 }
 
 static inline void skb_zcopy_set_nouarg(struct sk_buff *skb, void *val)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2179ef84bb44..435bac91d293 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -957,7 +957,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
uarg->len = 1;
uarg->bytelen = size;
uarg->zerocopy = 1;
-   refcount_set(>refcnt, sk->sk_type == SOCK_STREAM ? 1 : 0);
+   refcount_set(>refcnt, 1);
sock_hold(sk);
 
return uarg;
@@ -1097,13 +1097,6 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
atomic_dec(>sk_zckey);
uarg->len--;
 
-   /* Stream socks hold a ref for the syscall, as skbs can be sent
-* and freed inside the loop, dropping refcnt to 0 inbetween.
-* Datagrams do not need this, but sock_zerocopy_put expects it.
-*/
-   if (sk->sk_type != SOCK_STREAM && !refcount_read(>refcnt))
-   refcount_set(>refcnt, 1);
-
sock_zerocopy_put(uarg);
}
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7504da2f33d6..d3285613d87a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -882,6 +882,7 @@ static int __ip_append_data(struct sock *sk,
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
u32 tskey = 0;
+   int uarg_refs = 0;
bool paged;
 
skb = skb_peek_tail(queue);
@@ -919,6 +920,7 @@ static int __ip_append_data(struct sock *sk,
 
if (flags & MSG_ZEROCOPY && length) {
uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   uarg_refs = 1;
if (!uarg)
return -ENOBUFS;
if (rt->dst.dev->features & NETIF_F_SG &&
@@ -926,7 +928,7 @@ static int __ip_append_data(struct sock *sk,
paged = true;
} else {
uarg->zerocopy = 0;
-   skb_zcopy_set(skb, uarg);
+   uarg_refs += __skb_zcopy_set(skb, uarg);
}
}
 
@@ -1019,7 +1021,7 @@ static int __ip_append_data(struct sock *sk,
cork->tx_flags = 0;
skb_shinfo(skb)->tskey = tskey;
tskey = 0;
-   skb_zcopy_set(skb, uarg);
+   uarg_refs += __skb_zcopy_set(skb, uarg);
 
/*
 *  Find where to start putting bytes.
@@ -1121,6 +1123,7 @@ static int __ip_append_data(struct sock *sk,
length -= copy;
}
 
+   __sock_zerocopy_init(uarg, uarg_refs);
if (wmem_alloc_delta)
refcount_add(wmem_alloc_delta, >sk_wmem_alloc);
return 0;
@@ -1128,6 +1131,7 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
err = -EFAULT;
 error:
+   __sock_zerocopy_init(uarg, uarg_refs);

[PATCH net-next] dpaa2-eth: Add "fall through" comments

2018-11-29 Thread Ioana Ciocoi Radulescu

Add comments in the switch statement for XDP action to indicate
fallthrough is intended.

Signed-off-by: Ioana Radulescu 
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index be84171..1ca9a18 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -322,8 +322,10 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
break;
default:
bpf_warn_invalid_xdp_action(xdp_act);
+   /* fall through */
case XDP_ABORTED:
trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act);
+   /* fall through */
case XDP_DROP:
xdp_release_buf(priv, ch, addr);
ch->stats.xdp_drop++;
-- 
2.7.4

[PATCH net-next] qede - Add a statistic for a case where driver drops tx packet due to memory allocation failure.

2018-11-29 Thread Michael Shteinbok

skb_linearization can fail due to memory allocation failure.
In such a case, the driver will drop the packet. In such a case
The driver used to print an error message.
This patch replaces this error message by a dedicated statistic.

Signed-off-by: Michael Shteinbok 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qede/qede.h | 1 +
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c  | 4 ++--
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index de98a97..5084dc9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -377,6 +377,7 @@ struct qede_tx_queue {
 
u64 xmit_pkts;
u64 stopped_cnt;
+   u64 tx_mem_alloc_err;
 
__le16 *hw_cons_ptr;
 
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 8cbbd62..16331c6 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -73,6 +73,7 @@
 } qede_tqstats_arr[] = {
QEDE_TQSTAT(xmit_pkts),
QEDE_TQSTAT(stopped_cnt),
+   QEDE_TQSTAT(tx_mem_alloc_err),
 };
 
 #define QEDE_STAT_OFFSET(stat_name, type, base) \
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c 
b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index 1a78027..bdf816f 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1466,8 +1466,8 @@ netdev_tx_t qede_start_xmit(struct sk_buff *skb, struct 
net_device *ndev)
 #if ((MAX_SKB_FRAGS + 2) > ETH_TX_MAX_BDS_PER_NON_LSO_PACKET)
if (qede_pkt_req_lin(skb, xmit_type)) {
if (skb_linearize(skb)) {
-   DP_NOTICE(edev,
- "SKB linearization failed - silently dropping 
this SKB\n");
+   txq->tx_mem_alloc_err++;
+
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
-- 
1.8.3.1

[PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch

There are places in the stack, where we access skb->prev directly and
modify it. Namely, __qdisc_drop_all().

With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
the skb-list handling has been changed to set skb->next to NULL and set
the list-poison on skb->prev.

With that change, __qdisc_drop_all() will panic when it tries to
dereference skb->prev.

Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
__list_del_entry is used, leaving skb->prev unchanged (thus,
pointing to the list-head if it's the first skb of the list).
This will make __qdisc_drop_all modify the next-pointer of the list-head
and result in a panic later on:

[   34.501053] general protection fault:  [#1] SMP KASAN PTI
[   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
[   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.5.1 01/01/2011
[   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
[   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 
24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f b6 
04 30 84 c0 74 08 3c 04
[   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
[   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 111072e165a6
[   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 0038
[   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 0062
[   34.510346] R10: 0034 R11:  R12: 
[   34.511215] R13:  R14: dc00 R15: 8883ac338008
[   34.512082] FS:  () GS:8883af50() 
knlGS:
[   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
[   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 06e0
[   34.514593] Call Trace:
[   34.514893]  
[   34.515157]  napi_gro_receive+0x93/0x150
[   34.515632]  receive_buf+0x893/0x3700
[   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
[   34.516629]  ? virtnet_probe+0x1b40/0x1b40
[   34.517153]  ? __stable_node_chain+0x4d0/0x850
[   34.517684]  ? kfree+0x9a/0x180
[   34.518067]  ? __kasan_slab_free+0x171/0x190
[   34.518582]  ? detach_buf+0x1df/0x650
[   34.519061]  ? lapic_next_event+0x5a/0x90
[   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
[   34.520093]  virtnet_poll+0x2df/0xd60
[   34.520533]  ? receive_buf+0x3700/0x3700
[   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
[   34.521631]  ? htb_dequeue+0x1817/0x25f0
[   34.522107]  ? sch_direct_xmit+0x142/0xf30
[   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
[   34.523155]  net_rx_action+0x2f6/0xc50
[   34.523601]  ? napi_complete_done+0x2f0/0x2f0
[   34.524126]  ? kasan_check_read+0x11/0x20
[   34.524608]  ? _raw_spin_lock+0x7d/0xd0
[   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
[   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
[   34.526130]  ? apic_ack_irq+0x9e/0xe0
[   34.526567]  __do_softirq+0x188/0x4b5
[   34.527015]  irq_exit+0x151/0x180
[   34.527417]  do_IRQ+0xdb/0x150
[   34.527783]  common_interrupt+0xf/0xf
[   34.528223]  

This patch makes sure that skb->prev is also set to NULL when removing
it from the list.

The bug is in v4.19.x as well, but the patch can't be backported easily.
I can post a follow-up for that.

Cc: Prashant Bhole 
Cc: Tyler Hicks 
Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
Signed-off-by: Christoph Paasch 
---
 include/linux/skbuff.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0d1b2c3f127b..3bb3bfd390eb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
 static inline void skb_mark_not_on_list(struct sk_buff *skb)
 {
skb->next = NULL;
+   skb->prev = NULL;
 }
 
 static inline void skb_list_del_init(struct sk_buff *skb)
-- 
2.16.2

Re: [PATCH net] net/sched: act_police: fix memory leak in case of invalid control action

2018-11-29 Thread Cong Wang

On Wed, Nov 28, 2018 at 9:44 AM Davide Caratti  wrote:
>
> when users set an invalid control action, kmemleak complains as follows:
...
> change tcf_police_init() to avoid leaking 'new' in case TCA_POLICE_RESULT
> contains TC_ACT_GOTO_CHAIN extended action.
>
> Fixes: c08f5ed5d625 ("net/sched: act_police: disallow 'goto chain' on 
> fallback control action")
> Reported-by: Dan Carpenter 
> Signed-off-by: Davide Caratti 

Acked-by: Cong Wang

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Eric Dumazet




On 11/29/2018 02:27 PM, Christoph Paasch wrote:
> There are places in the stack, where we access skb->prev directly and
> modify it. Namely, __qdisc_drop_all().
> 
> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> the skb-list handling has been changed to set skb->next to NULL and set
> the list-poison on skb->prev.
> 
> With that change, __qdisc_drop_all() will panic when it tries to
> dereference skb->prev.
> 
> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> __list_del_entry is used, leaving skb->prev unchanged (thus,
> pointing to the list-head if it's the first skb of the list).
> This will make __qdisc_drop_all modify the next-pointer of the list-head
> and result in a panic later on:
> 
> [   34.501053] general protection fault:  [#1] SMP KASAN PTI
> [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
> [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 0.5.1 01/01/2011
> [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
> [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 
> 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f 
> b6 04 30 84 c0 74 08 3c 04
> [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
> [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
> 111072e165a6
> [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
> 0038
> [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
> 0062
> [   34.510346] R10: 0034 R11:  R12: 
> 
> [   34.511215] R13:  R14: dc00 R15: 
> 8883ac338008
> [   34.512082] FS:  () GS:8883af50() 
> knlGS:
> [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
> [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
> 06e0
> [   34.514593] Call Trace:
> [   34.514893]  
> [   34.515157]  napi_gro_receive+0x93/0x150
> [   34.515632]  receive_buf+0x893/0x3700
> [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
> [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
> [   34.517153]  ? __stable_node_chain+0x4d0/0x850
> [   34.517684]  ? kfree+0x9a/0x180
> [   34.518067]  ? __kasan_slab_free+0x171/0x190
> [   34.518582]  ? detach_buf+0x1df/0x650
> [   34.519061]  ? lapic_next_event+0x5a/0x90
> [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
> [   34.520093]  virtnet_poll+0x2df/0xd60
> [   34.520533]  ? receive_buf+0x3700/0x3700
> [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
> [   34.521631]  ? htb_dequeue+0x1817/0x25f0
> [   34.522107]  ? sch_direct_xmit+0x142/0xf30
> [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
> [   34.523155]  net_rx_action+0x2f6/0xc50
> [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
> [   34.524126]  ? kasan_check_read+0x11/0x20
> [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
> [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
> [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
> [   34.526130]  ? apic_ack_irq+0x9e/0xe0
> [   34.526567]  __do_softirq+0x188/0x4b5
> [   34.527015]  irq_exit+0x151/0x180
> [   34.527417]  do_IRQ+0xdb/0x150
> [   34.527783]  common_interrupt+0xf/0xf
> [   34.528223]  
> 
> This patch makes sure that skb->prev is also set to NULL when removing
> it from the list.
> 
> The bug is in v4.19.x as well, but the patch can't be backported easily.
> I can post a follow-up for that.
> 
> Cc: Prashant Bhole 
> Cc: Tyler Hicks 
> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> Signed-off-by: Christoph Paasch 
> ---
>  include/linux/skbuff.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0d1b2c3f127b..3bb3bfd390eb 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
>  static inline void skb_mark_not_on_list(struct sk_buff *skb)
>  {
>   skb->next = NULL;
> + skb->prev = NULL;
>  }
>  
>  static inline void skb_list_del_init(struct sk_buff *skb)
> 

skb_mark_not_on_list() is used in many place where we do not care of skb->prev

What about fixing netem instead ?

Re: [PATCH] udp: Allow to defer reception until connect() happened

2018-11-29 Thread Eric Dumazet

On Thu, Nov 29, 2018 at 2:47 PM Christoph Paasch  wrote:

> Indeed, the UDP-stack is not fully 4-tuple ready.
>
>
> What are your thoughts on getting it there?

This would request an additional lookup, and heavy duty servers using
non connected sockets
would pay the price for an extra lookup for each incoming packets.

DNS servers and QUIC servers would not like that, since they have better use
of a single (unconnected) UDP socket per cpu/thread.

>
> Should be doable by simply using a similar approach as TCP, no? Any caveats
> you see with that?
>
> Then, when one sets the "wait-for-connect"-flag we would add the socket to
> the hash-table only at connect()-time also addressing the cache-line miss
> you mentioned above.

Sure.

Re: [PATCH bpf-next v2] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Jakub Kicinski

On Thu, 29 Nov 2018 15:31:45 -0800, Yonghong Song wrote:
> During porting libbpf to bcc, I got some warnings like below:
>   ...
>   [  2%] Building C object 
> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
>   warning: "_GNU_SOURCE" redefined [enabled by default]
>#define _GNU_SOURCE
>   ...
>   [  3%] Building C object 
> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
> ‘libbpf_strerror’:
>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
>   warning: assignment makes integer from pointer without a cast [enabled by 
> default]
>  ret = strerror_r(err, buf, size);
>   ...
> 
> bcc is built with _GNU_SOURCE defined and this caused the above warning.
> This patch intends to make libpf _GNU_SOURCE friendly by
>   . define _GNU_SOURCE in libbpf.c unless it is not defined
>   . undefine _GNU_SOURCE as non-gnu version of strerror_r is expected.
> 
> Signed-off-by: Yonghong Song 

Acked-by: Jakub Kicinski 

FWIW :)

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-29 Thread Saeed Mahameed

On Wed, Nov 28, 2018 at 4:05 PM Cong Wang  wrote:
>
> On Wed, Nov 28, 2018 at 3:57 PM Cong Wang  wrote:
> > But again, I kinda feel the hardware already does the sanity check,
> > otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
> > which parses into TCP header.
> >
>
> While we are on this page, mlx5e_lro_update_hdr() incorrectly assumes
> TCP header is located right after struct iphdr, which is wrong if we could
> have IP options on this path.
>
> It could the hardware which already verified this corner case though.

yes HW will not do LRO in case of ip options, so no problem in
mlx5e_lro_update_hdr

Re: [PATCH] udp: Allow to defer reception until connect() happened

2018-11-29 Thread Christoph Paasch

Hello,

On 28/11/18 - 19:15:12, Eric Dumazet wrote:
> On Wed, Nov 28, 2018 at 7:09 PM Jana Iyengar  wrote:
> >
> >
> > On Wed, Nov 28, 2018 at 6:19 PM Eric Dumazet  wrote:
> >>
> >> On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch  wrote:
> >> >
> >> > There are use-cases where a host wants to use a UDP socket with a
> >> > specific 4-tuple. The way to do this is to bind() and then connect() the
> >> > socket. However, after the bind(), the socket starts receiving data even
> >> > if it does not match the intended 4-tuple. That is because after the
> >> > bind() UDP-socket will match in the lookup for all incoming UDP-traffic
> >> > that has the specific IP/port.
> >> >
> >> > This patch prevents any incoming traffic until the connect() system-call
> >> > is called whenever the app sets the UDP socket-option
> >> > UDP_WAIT_FOR_CONNECT.
> >>
> >> Please do not add something that could mislead applications writers to
> >> think UDP stack can scale.
> >
> >
> >> UDP stack does not have a full hash on 4-tuples, it means that
> >> incoming traffic on a 'shared port' has
> >> to scan a list of XXX sockets to find the best match ...
> >
> >
> >> Also you add another cache line miss in UDP lookup to access
> >> udp_sk()->wait_for_connect.

Fair enough. We could add the socket later to the hash (see below).

> >>
> >> recvfrom() can be used to filter whatever frame that came before the 
> >> connect()
> >
> >
> > I don't think I understand that argument -- connect() is supported for UDP 
> > sockets, and UDP sockets are being used for serving QUIC traffic. Are you 
> > suggesting that connect() never be used?
> 
> If the source port is not shared, Christoph patch is not needed.
> 
> If it is shared, then a whole can of worm is opened.
> 
> Trying to hack UDP stack while it is not fully 4-tuple ready is not
> going to fly.

Indeed, the UDP-stack is not fully 4-tuple ready.


What are your thoughts on getting it there?

Should be doable by simply using a similar approach as TCP, no? Any caveats
you see with that?

Then, when one sets the "wait-for-connect"-flag we would add the socket to
the hash-table only at connect()-time also addressing the cache-line miss
you mentioned above.


Thanks,
Christoph

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch

On 29/11/18 - 14:44:44, Eric Dumazet wrote:
> 
> 
> On 11/29/2018 02:27 PM, Christoph Paasch wrote:
> > There are places in the stack, where we access skb->prev directly and
> > modify it. Namely, __qdisc_drop_all().
> > 
> > With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> > the skb-list handling has been changed to set skb->next to NULL and set
> > the list-poison on skb->prev.
> > 
> > With that change, __qdisc_drop_all() will panic when it tries to
> > dereference skb->prev.
> > 
> > Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> > __list_del_entry is used, leaving skb->prev unchanged (thus,
> > pointing to the list-head if it's the first skb of the list).
> > This will make __qdisc_drop_all modify the next-pointer of the list-head
> > and result in a panic later on:
> > 
> > [   34.501053] general protection fault:  [#1] SMP KASAN PTI
> > [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp 
> > #108
> > [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 0.5.1 01/01/2011
> > [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
> > [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 
> > 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 
> > <42> 0f b6 04 30 84 c0 74 08 3c 04
> > [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
> > [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
> > 111072e165a6
> > [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
> > 0038
> > [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
> > 0062
> > [   34.510346] R10: 0034 R11:  R12: 
> > 
> > [   34.511215] R13:  R14: dc00 R15: 
> > 8883ac338008
> > [   34.512082] FS:  () GS:8883af50() 
> > knlGS:
> > [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
> > [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
> > 06e0
> > [   34.514593] Call Trace:
> > [   34.514893]  
> > [   34.515157]  napi_gro_receive+0x93/0x150
> > [   34.515632]  receive_buf+0x893/0x3700
> > [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
> > [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
> > [   34.517153]  ? __stable_node_chain+0x4d0/0x850
> > [   34.517684]  ? kfree+0x9a/0x180
> > [   34.518067]  ? __kasan_slab_free+0x171/0x190
> > [   34.518582]  ? detach_buf+0x1df/0x650
> > [   34.519061]  ? lapic_next_event+0x5a/0x90
> > [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
> > [   34.520093]  virtnet_poll+0x2df/0xd60
> > [   34.520533]  ? receive_buf+0x3700/0x3700
> > [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
> > [   34.521631]  ? htb_dequeue+0x1817/0x25f0
> > [   34.522107]  ? sch_direct_xmit+0x142/0xf30
> > [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
> > [   34.523155]  net_rx_action+0x2f6/0xc50
> > [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
> > [   34.524126]  ? kasan_check_read+0x11/0x20
> > [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
> > [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
> > [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
> > [   34.526130]  ? apic_ack_irq+0x9e/0xe0
> > [   34.526567]  __do_softirq+0x188/0x4b5
> > [   34.527015]  irq_exit+0x151/0x180
> > [   34.527417]  do_IRQ+0xdb/0x150
> > [   34.527783]  common_interrupt+0xf/0xf
> > [   34.528223]  
> > 
> > This patch makes sure that skb->prev is also set to NULL when removing
> > it from the list.
> > 
> > The bug is in v4.19.x as well, but the patch can't be backported easily.
> > I can post a follow-up for that.
> > 
> > Cc: Prashant Bhole 
> > Cc: Tyler Hicks 
> > Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> > Signed-off-by: Christoph Paasch 
> > ---
> >  include/linux/skbuff.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0d1b2c3f127b..3bb3bfd390eb 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff 
> > *skb)
> >  static inline void skb_mark_not_on_list(struct sk_buff *skb)
> >  {
> > skb->next = NULL;
> > +   skb->prev = NULL;
> >  }
> >  
> >  static inline void skb_list_del_init(struct sk_buff *skb)
> > 
> 
> skb_mark_not_on_list() is used in many place where we do not care of skb->prev
> 
> What about fixing netem instead ?

Yes, I have been looking at that and Alexey's patch which introduced the
access to skb->prev (cfr.: https://patchwork.ozlabs.org/patch/880717/).

But then I thought that setting skb->prev to NULL is a less risky approach for
-stable.


How would you go about fixing netem instead?

Because, from what I see we basically can enter netem_enqueue here with two
different "types" of skb's. The ones where skb->prev points to the tail of
the list

Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Eric Dumazet




On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:

> Instead of just checking for the max it might make more sense to do a
> check using skb_is_gso, and then if true use gso_segs, otherwise just
> default to 1.
> 
> Also your bytes are going to be totally messed up as well since the
> headers are trimmed in the GSO frames. It might be worthwhile to just
> have a branch based on skb_is_gso that sets the packets and bytes based
> on the GSO values, and one that sets it for the default case.
> 
Note that __dev_queue_xmit() is already doing that, no need
to re-implement in each driver.

It calls qdisc_pkt_len_init(), meaning that drivers can use 
qdisc_skb_cb(skb)->pkt_len instead of skb->len

(Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )

Re: [PATCH net-next 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-11-29 Thread Jeff Kirsher

On Thu, 2018-11-29 at 22:19 +, Steve Douthit wrote:
> On 11/29/18 2:03 PM, Jeff Kirsher wrote:
> > On Thu, 2018-11-29 at 18:54 +, Steve Douthit wrote:
> > > Most dsa devices currently expect to get a pointer to a mii_bus from
> > > platform data of device tree information.
> > > 
> > > The first patch exposes a mii_bus for ixgbe devices.
> > > 
> > > Currently the ixgbe driver only allows accesses to the probed PHY
> > > address via ioctls. The second patch changes that behavior to allow
> > > access to all addresses.  This should be useful for doing switch
> > > peeks &
> > > pokes from userspace for development, test, or if theres no dsa
> > > driver
> > > for a particular switch yet.
> > > 
> > > I've tested the clause 22 portion of this code on a board that has an
> > > Intel C3xxx series SoC connected to a Marvell Peridot switch.  I
> > > don't
> > > have any clause 45 devices or other ixgbe devices to test with.
> > > 
> > > Stephen Douthit (2):
> > >ixgbe: register a mdiobus
> > >ixgbe: use mii_bus to handle MII related ioctls
> > > 
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  35 ++--
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192
> > > ++
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
> > >   4 files changed, 216 insertions(+), 15 deletions(-)
> > > 
> > 
> > I will add these changes to my queue, please remember to send patches
> > against the drivers in drivers/net/ethernet/intel/* to
> > intel-wired-...@lists.osuosl.org  mailing list as well, so that our
> > patchwork project can track the patches.
> 
> Sorry about that, I'll add it to the CC list for the next rev.
> 
> I started adding another board to the platform driver that needs this
> code and found a bug.  The new board has more lanes enabled via soft
> straps.  That caused sysfs to complain because I ended up with duplicate
> bus IDs from this code:
> 
> + /* Use the position of the device in the PCI hierarchy as the id */
> + snprintf(bus->id, MII_BUS_ID_SIZE, "%s-mdio-%s", ixgbe_driver_name,
> +  pci_name(pdev->bus->self));
> 
> Please drop this series for now.

Ok, it has been dropped and I will wait for v2.

> Thinking about this some more I probably need to ensure that only a
> single mii_bus is registered out of all the discovered x550em_a devices.
> More generally there's not always going to be a 1:1 mapping between MACs
> and MIIs on these devices and the code should handle that.
> 
> I think for the next rev I'll only register a single mii_bus for the
> lowest numbered bus:device.function x500em_a device, and skip mii_bus
> setup entirely for other device IDs.
> 
> Let me know if there's another/better way to tackle that.

I will bring this up with our 10GbE driver lead, he may have ideas.  Any
patches sent to intel-wired-...@lists.osuosl.org mailing list will be seen
by our 10GbE driver developers, so they can provide feedback.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread David Miller

From: Eric Dumazet 
Date: Thu, 29 Nov 2018 15:09:18 -0800

> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index 
> 2c38e3d0792468162ee0dc4137f1400160ab9276..22cd46a600576f286803536d45875cd9d537cdca
>  100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -431,6 +431,9 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> Qdisc *sch,
> int count = 1;
> int rc = NET_XMIT_SUCCESS;
>  
> +   /* Do not fool qdisc_drop_all() */
> +   skb->prev = NULL;
> +
> /* Random duplication */
> if (q->duplicate && q->duplicate >= get_crandom(>dup_cor))
> ++count;

If this works I definitely prefer it to making the entire stack pay the
price to fix this crash.

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch

On Thu, Nov 29, 2018 at 3:54 PM David Miller  wrote:
>
> From: Eric Dumazet 
> Date: Thu, 29 Nov 2018 15:09:18 -0800
>
> > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > index 
> > 2c38e3d0792468162ee0dc4137f1400160ab9276..22cd46a600576f286803536d45875cd9d537cdca
> >  100644
> > --- a/net/sched/sch_netem.c
> > +++ b/net/sched/sch_netem.c
> > @@ -431,6 +431,9 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> > Qdisc *sch,
> > int count = 1;
> > int rc = NET_XMIT_SUCCESS;
> >
> > +   /* Do not fool qdisc_drop_all() */
> > +   skb->prev = NULL;
> > +
> > /* Random duplication */
> > if (q->duplicate && q->duplicate >= get_crandom(>dup_cor))
> > ++count;
>
> If this works I definitely prefer it to making the entire stack pay the
> price to fix this crash.

Yes, I tried it out and it works.


Christoph

Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-29 Thread kbuild test robot

Hi Cong,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Cong-Wang/mlx5-fixup-checksum-for-ethernet-padding/20181130-014928
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 8.1.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=ia64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c: In function 
'mlx5e_csum_padding':
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:740:6: warning: 'pad_offset' 
>> may be used uninitialized in this function [-Wmaybe-uninitialized]
 u32 pad_offset, pad_len;
 ^~

vim +/pad_offset +740 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

   734  
   735  static void mlx5e_csum_padding(struct sk_buff *skb, int network_depth,
   736 __be16 proto, bool has_fcs)
   737  {
   738  u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
   739  void *ip_p = skb->data + network_depth;
 > 740  u32 pad_offset, pad_len;
   741  void *pad;
   742  
   743  if (likely(frame_len > ETH_ZLEN))
   744  return;
   745  
   746  if (proto == htons(ETH_P_IP)) {
   747  struct iphdr *ipv4 = ip_p;
   748  
   749  pad_offset =  network_depth + 
be16_to_cpu(ipv4->tot_len);
   750  } else if (proto == htons(ETH_P_IPV6)) {
   751  struct ipv6hdr *ipv6 = ip_p;
   752  
   753  pad_offset = network_depth + sizeof(struct ipv6hdr) +
   754   be16_to_cpu(ipv6->payload_len);
   755  }
   756  
   757  pad = skb->data + pad_offset;
   758  pad_len = frame_len - pad_offset;
   759  
   760  skb->csum = csum_block_add(skb->csum, csum_partial(pad, 
pad_len, 0),
   761 pad_offset);
   762  }
   763  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [RFC PATCH 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-11-29 Thread Eric Dumazet




On 11/29/2018 03:00 PM, Paolo Abeni wrote:
> This header define a bunch of helpers that allow avoiding the
> retpoline overhead when calling builtin functions via function pointers.
> It boils down to explicitly comparing the function pointers to
> known builtin functions and eventually invoke directly the latter.
> 
> The macro defined here implement the boilerplate for the above schema
> and will be used by the next patches.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Paolo Abeni 
> ---
>  include/linux/indirect_call_wrapper.h | 77 +++
>  1 file changed, 77 insertions(+)
>  create mode 100644 include/linux/indirect_call_wrapper.h
> 
> diff --git a/include/linux/indirect_call_wrapper.h 
> b/include/linux/indirect_call_wrapper.h
> new file mode 100644
> index ..57e82b4a166d
> --- /dev/null
> +++ b/include/linux/indirect_call_wrapper.h
> @@ -0,0 +1,77 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
> +#define _LINUX_INDIRECT_CALL_WRAPPER_H
> +
> +#ifdef CONFIG_RETPOLINE
> +
> +/*
> + * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
> + *  @f: function pointer
> + *  @name: base name for builtin functions, see INDIRECT_CALLABLE_DECLARE_$NR
> + *  @__VA_ARGS__: arguments for @f
> + *
> + * Avoid retpoline overhead for known builtin, checking @f vs each of them 
> and
> + * eventually invoking directly the builtin function. Fallback to the 
> indirect
> + * call
> + */
> +#define INDIRECT_CALL_1(f, name, ...)
> \
> + ({  \
> + f == name ## 1 ? name ## 1(__VA_ARGS__) :   \

  likely(f == name ## 1) ? ...

> +  f(__VA_ARGS__);\
> + })
> +#define INDIRECT_CALL_2(f, name, ...)
> \
> + ({  \
> + f == name ## 2 ? name ## 2(__VA_ARGS__) :   \

 likely(f == name ## 2) ? ...


> +  INDIRECT_CALL_1(f, name, __VA_ARGS__); \
> + })
> +
>

[PATCH bpf-next v2] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Yonghong Song

During porting libbpf to bcc, I got some warnings like below:
  ...
  [  2%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
  warning: "_GNU_SOURCE" redefined [enabled by default]
   #define _GNU_SOURCE
  ...
  [  3%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
‘libbpf_strerror’:
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
  warning: assignment makes integer from pointer without a cast [enabled by 
default]
 ret = strerror_r(err, buf, size);
  ...

bcc is built with _GNU_SOURCE defined and this caused the above warning.
This patch intends to make libpf _GNU_SOURCE friendly by
  . define _GNU_SOURCE in libbpf.c unless it is not defined
  . undefine _GNU_SOURCE as non-gnu version of strerror_r is expected.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.c   | 2 ++
 tools/lib/bpf/libbpf_errno.c | 1 +
 2 files changed, 3 insertions(+)

Changelog:
 v1 -> v2:
   . undef _GNU_SOURCE instead of multiversioning strerror_r,
 suggested by Jakub.

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ed4212a4c5f9..59b748ebd15f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -9,7 +9,9 @@
  * Copyright (C) 2017 Nicira, Inc.
  */
 
+#ifndef _GNU_SOURCE
 #define _GNU_SOURCE
+#endif
 #include 
 #include 
 #include 
diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
index d83b17f8435c..4343e40588c6 100644
--- a/tools/lib/bpf/libbpf_errno.c
+++ b/tools/lib/bpf/libbpf_errno.c
@@ -7,6 +7,7 @@
  * Copyright (C) 2017 Nicira, Inc.
  */
 
+#undef _GNU_SOURCE
 #include 
 #include 
 
-- 
2.17.1

Re: Freeze when using ipheth+IPsec+IPv6

2018-11-29 Thread Kees Cook

On Wed, Jun 6, 2018 at 1:21 AM Yves-Alexis Perez  wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On Tue, Jun 05, 2018 at 10:54:51AM +0200, Yves-Alexis Perez wrote:
> > Hi,
> >
> > since some kernels releases (I didn't test thorougly but at least 4.16
> > and 4.17) I have regular freezes in certain situations on my laptop.
> >
> > It seems to happen when I:
> >
> > - tether using my iPhone (involving ipheth)
> > - mount an IPsec tunnel over IPv4
> > - run evolution to fetch my mail (IMAP traffic over IPv6 inside the IPv4
> >   IPsec tunnel)
> >
> > When I do that, the interface seems to freeze. Last time the mouse was
> > still moving so the kernel didn't completely crash, but the UI was
> > completely irresponsive. I managed to get the attached log from
> > /sys/fs/pstore with refcount_t stuff pointing to an underflow.
>
> Today I had a different behavior. Again same situation (ipheth, IPsec
> tunnel, refresh of the LKML folder in Evolution). The kernel didn't
> crash/freeze but I had multiple (33309 actually) "recvmsg bug:
> copied..." traces like this one:
>
>
> [ 1555.957599] [ cut here ]
> [ 1555.957619] recvmsg bug: copied ABEA08B2 seq 1 rcvnxt ABEA0DCE fl 0
> [ 1555.957805] WARNING: CPU: 3 PID: 2177 at 
> /home/corsac/projets/linux/linux/net/ipv4/tcp.c:1850 tcp_recvmsg+0x610/0xb40

(I'm going through ancient email while I try to catch up from travel...)

Did you ever solve this?

-Kees

> [ 1555.957813] Modules linked in: esp4 xfrm6_mode_tunnel xfrm4_mode_tunnel 
> bnep ipheth rtsx_pci_sdmmc snd_hda_codec_realtek iwlmvm snd_hda_codec_generic 
> snd_hda_codec_hdmi snd_hda_intel iwlwifi snd_hda_codec snd_hwdep rtsx_pci 
> snd_hda_core snd_pcm thinkpad_acpi efivarfs input_leds
> [ 1555.957895] CPU: 3 PID: 2177 Comm: pool Tainted: GT 4.17.0 
> #22
> [ 1555.957902] Hardware name: LENOVO 20CMCTO1WW/20CMCTO1WW, BIOS N10ET48W 
> (1.27 ) 09/12/2017
> [ 1555.957922] RIP: 0010:tcp_recvmsg+0x610/0xb40
> [ 1555.957927] RSP: 0018:b77e010f7cf8 EFLAGS: 00010282
> [ 1555.957932] RAX:  RBX: abea08b2 RCX: 
> 0006
> [ 1555.957935] RDX: 0007 RSI: 0086 RDI: 
> a37a8dd95610
> [ 1555.957939] RBP: b77e010f7db8 R08: 03b4 R09: 
> 0004
> [ 1555.957942] R10: a37a3b1180c8 R11: 0001 R12: 
> a37a81d40e00
> [ 1555.957945] R13: a37a3b118000 R14: a37a3b118524 R15: 
> 
> [ 1555.957951] FS:  738f795c0700() GS:a37a8dd8() 
> knlGS:
> [ 1555.957954] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1555.957957] CR2: 738f0879a028 CR3: 00024200c006 CR4: 
> 003606e0
> [ 1555.957964] Call Trace:
> [ 1555.957996]  inet_recvmsg+0x5c/0x110
> [ 1555.958017]  __sys_recvfrom+0xf2/0x160
> [ 1555.958030]  __x64_sys_recvfrom+0x1f/0x30
> [ 1555.958039]  do_syscall_64+0x72/0x1c0
> [ 1555.958048]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1555.958053] RIP: 0033:0x73901a71deae
> [ 1555.958056] RSP: 002b:738f795bee50 EFLAGS: 0246 ORIG_RAX: 
> 002d
> [ 1555.958060] RAX: ffda RBX: 0028 RCX: 
> 73901a71deae
> [ 1555.958063] RDX: 0404 RSI: 738f087955a7 RDI: 
> 0028
> [ 1555.958066] RBP:  R08:  R09: 
> 
> [ 1555.958068] R10:  R11: 0246 R12: 
> 738f087955a7
> [ 1555.958071] R13: 0404 R14:  R15: 
> 
> [ 1555.958075] Code: e9 33 fd ff ff 4c 89 e0 41 8b 8d 20 05 00 00 89 de 48 c7 
> c7 10 47 05 ae 48 89 85 48 ff ff ff 44 8b 85 70 ff ff ff e8 80 0d 93 ff <0f> 
> 0b 48 8b 85 48 ff ff ff e9 ed fd ff ff 41 8b 8d 20 05 00 00
> [ 1555.958180] ---[ end trace e7da03c87ec51f13 ]---
>
> (complete log available but it seems that only R08 is changing between
> these traces)
>
> Followed by a "recvmsg bug 2:":
>
> [ 1563.657991] [ cut here ]
> [ 1563.657992] recvmsg bug 2: copied ABEA08B2 seq 6A7E3970 rcvnxt ABECA5EE fl > 0
> [ 1563.658002] WARNING: CPU: 1 PID: 2177 at 
> /home/corsac/projets/linux/linux/net/ipv4/tcp.c:1864 tcp_recvmsg+0x647/0xb40
> [ 1563.658002] Modules linked in: esp4 xfrm6_mode_tunnel xfrm4_mode_tunnel 
> bnep ipheth rtsx_pci_sdmmc snd_hda_codec_realtek iwlmvm snd_hda_codec_generic 
> snd_hda_codec_hdmi snd_hda_intel iwlwifi snd_hda_codec snd_hwdep rtsx_pci 
> snd_hda_core snd_pcm thinkpad_acpi efivarfs input_leds
> [ 1563.658016] CPU: 1 PID: 2177 Comm: pool Tainted: GW   T 4.17.0 
> #22
> [ 1563.658017] Hardware name: LENOVO 20CMCTO1WW/20CMCTO1WW, BIOS N10ET48W 
> (1.27 ) 09/12/2017
> [ 1563.658019] RIP: 0010:tcp_recvmsg+0x647/0xb40
> [ 1563.658020] RSP: 0018:b77e010f7cf8 EFLAGS: 00010282
> [ 1563.658022] RAX:  RBX: 416bcf42 RCX: 
> 0006
> [ 1563.658023] RDX: 0007 RSI: 0086 RDI: 
> a37a8dc95610
> [

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread David Miller

From: Christoph Paasch 
Date: Thu, 29 Nov 2018 15:45:19 -0800

> I can resubmit a patch.

Please do after testing.

Re: [Patch net] mlx5: fix get_ip_proto()

2018-11-29 Thread Saeed Mahameed

On Thu, Nov 29, 2018 at 5:13 AM Tariq Toukan  wrote:
>
>
>
> On 29/11/2018 1:04 AM, Cong Wang wrote:
> > IP header is not necessarily located right after struct ethhdr,
> > there could be multiple 802.1Q headers in between, this is why
> > we call __vlan_get_protocol().
> >
> > Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP 
> > packets")
> > Cc: Alaa Hleihel 
> > Cc: Or Gerlitz 
> > Cc: Saeed Mahameed 
> > Signed-off-by: Cong Wang 

Acked-by: Saeed Mahameed

Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Duyck, Alexander H

On Thu, 2018-11-29 at 15:58 -0500, Debabrata Banerjee wrote:
> Fix packet count when using vlan/macvlan drivers with gso. Without this it
> is not possible to reconcile packet counts between underlying devices and
> these virtual devices. Additionally, the output looks wrong in a standalone
> way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.
> 
> There are many other drivers that likely have a similar problem, although
> it is not clear how many of those could be used with gso. Perhaps all
> packet counting should be wrapped in a helper fn.
> 
> Signed-off-by: Debabrata Banerjee 
> ---
>  drivers/net/macvlan.c | 2 +-
>  net/8021q/vlan_core.c | 2 +-
>  net/8021q/vlan_dev.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index fc8d5f1ee1ad..15e67a87f202 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -566,7 +566,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
>  
>   pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>   u64_stats_update_begin(_stats->syncp);
> - pcpu_stats->tx_packets++;
> + pcpu_stats->tx_packets += max_t(u16, 1, 
> skb_shinfo(skb)->gso_segs);
>   pcpu_stats->tx_bytes += len;
>   u64_stats_update_end(_stats->syncp);
>   } else {
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index a313165e7a67..e85f6665d0ed 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -62,7 +62,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
>   rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats);
>  
>   u64_stats_update_begin(_stats->syncp);
> - rx_stats->rx_packets++;
> + rx_stats->rx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
>   rx_stats->rx_bytes += skb->len;
>   if (skb->pkt_type == PACKET_MULTICAST)
>   rx_stats->rx_multicast++;
> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
> index b2d9c8f27cd7..b28e7535a0b9 100644
> --- a/net/8021q/vlan_dev.c
> +++ b/net/8021q/vlan_dev.c
> @@ -135,7 +135,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct 
> sk_buff *skb,
>  
>   stats = this_cpu_ptr(vlan->vlan_pcpu_stats);
>   u64_stats_update_begin(>syncp);
> - stats->tx_packets++;
> + stats->tx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
>   stats->tx_bytes += len;
>   u64_stats_update_end(>syncp);
>   } else {

Instead of just checking for the max it might make more sense to do a
check using skb_is_gso, and then if true use gso_segs, otherwise just
default to 1.

Also your bytes are going to be totally messed up as well since the
headers are trimmed in the GSO frames. It might be worthwhile to just
have a branch based on skb_is_gso that sets the packets and bytes based
on the GSO values, and one that sets it for the default case.


smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH net-next 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-11-29 Thread Steve Douthit

On 11/29/18 2:03 PM, Jeff Kirsher wrote:
> On Thu, 2018-11-29 at 18:54 +, Steve Douthit wrote:
>> Most dsa devices currently expect to get a pointer to a mii_bus from
>> platform data of device tree information.
>>
>> The first patch exposes a mii_bus for ixgbe devices.
>>
>> Currently the ixgbe driver only allows accesses to the probed PHY
>> address via ioctls. The second patch changes that behavior to allow
>> access to all addresses.  This should be useful for doing switch peeks &
>> pokes from userspace for development, test, or if theres no dsa driver
>> for a particular switch yet.
>>
>> I've tested the clause 22 portion of this code on a board that has an
>> Intel C3xxx series SoC connected to a Marvell Peridot switch.  I don't
>> have any clause 45 devices or other ixgbe devices to test with.
>>
>> Stephen Douthit (2):
>>ixgbe: register a mdiobus
>>ixgbe: use mii_bus to handle MII related ioctls
>>
>>   drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  35 ++--
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192 ++
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
>>   4 files changed, 216 insertions(+), 15 deletions(-)
>>
> 
> I will add these changes to my queue, please remember to send patches
> against the drivers in drivers/net/ethernet/intel/* to
> intel-wired-...@lists.osuosl.org  mailing list as well, so that our
> patchwork project can track the patches.

Sorry about that, I'll add it to the CC list for the next rev.

I started adding another board to the platform driver that needs this
code and found a bug.  The new board has more lanes enabled via soft
straps.  That caused sysfs to complain because I ended up with duplicate
bus IDs from this code:

+   /* Use the position of the device in the PCI hierarchy as the id */
+   snprintf(bus->id, MII_BUS_ID_SIZE, "%s-mdio-%s", ixgbe_driver_name,
+pci_name(pdev->bus->self));

Please drop this series for now.

Thinking about this some more I probably need to ensure that only a
single mii_bus is registered out of all the discovered x550em_a devices.
More generally there's not always going to be a 1:1 mapping between MACs
and MIIs on these devices and the code should handle that.

I think for the next rev I'll only register a single mii_bus for the
lowest numbered bus:device.function x500em_a device, and skip mii_bus
setup entirely for other device IDs.

Let me know if there's another/better way to tackle that.

[RFC PATCH 2/4] net: use indirect call wrappers at GRO network layer

2018-11-29 Thread Paolo Abeni

This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as buildin, it will be checked first,
so we have a single additional compare for the more common path.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  2 ++
 net/core/dev.c| 10 --
 net/ipv4/af_inet.c|  4 
 net/ipv6/ip6_offload.c|  4 
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 3ca969cbd161..56e7592811ea 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -2,6 +2,8 @@
 #ifndef _INET_COMMON_H
 #define _INET_COMMON_H
 
+#include 
+
 extern const struct proto_ops inet_stream_ops;
 extern const struct proto_ops inet_dgram_ops;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f69b2fcdee40..619f5902600f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -5306,6 +5307,7 @@ static void flush_all_backlogs(void)
put_online_cpus();
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, network_gro_complete, struct sk_buff *, int);
 static int napi_gro_complete(struct sk_buff *skb)
 {
struct packet_offload *ptype;
@@ -5325,7 +5327,8 @@ static int napi_gro_complete(struct sk_buff *skb)
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
 
-   err = ptype->callbacks.gro_complete(skb, 0);
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+network_gro_complete, skb, 0);
break;
}
rcu_read_unlock();
@@ -5472,6 +5475,8 @@ static void gro_flush_oldest(struct list_head *head)
napi_gro_complete(oldest);
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, network_gro_receive,
+   struct list_head *, struct sk_buff *);
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -5521,7 +5526,8 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}
 
-   pp = ptype->callbacks.gro_receive(gro_head, skb);
+   pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
+   network_gro_receive, gro_head, skb);
break;
}
rcu_read_unlock();
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 326c422c22f8..04ab7ebd6e9b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1505,6 +1505,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
return pp;
 }
 EXPORT_SYMBOL(inet_gro_receive);
+INDIRECT_CALLABLE(inet_gro_receive, 1, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *ipip_gro_receive(struct list_head *head,
struct sk_buff *skb)
@@ -1589,6 +1591,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
return err;
 }
 EXPORT_SYMBOL(inet_gro_complete);
+INDIRECT_CALLABLE(inet_gro_complete, 1, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 70f525c33cb6..a1c2bfb2ce0d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -270,6 +270,8 @@ static struct sk_buff *ipv6_gro_receive(struct list_head 
*head,
 
return pp;
 }
+INDIRECT_CALLABLE(ipv6_gro_receive, 2, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *sit_ip6ip6_gro_receive(struct list_head *head,
  struct sk_buff *skb)
@@ -327,6 +329,8 @@ static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 
return err;
 }
+INDIRECT_CALLABLE(ipv6_gro_complete, 2, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int sit_gro_complete(struct sk_buff *skb, int nhoff)
 {
-- 
2.19.2

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Eric Dumazet




On 11/29/2018 02:55 PM, Christoph Paasch wrote:
> On 29/11/18 - 14:44:44, Eric Dumazet wrote:
>>
>>
>> On 11/29/2018 02:27 PM, Christoph Paasch wrote:
>>> There are places in the stack, where we access skb->prev directly and
>>> modify it. Namely, __qdisc_drop_all().
>>>
>>> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
>>> the skb-list handling has been changed to set skb->next to NULL and set
>>> the list-poison on skb->prev.
>>>
>>> With that change, __qdisc_drop_all() will panic when it tries to
>>> dereference skb->prev.
>>>
>>> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
>>> __list_del_entry is used, leaving skb->prev unchanged (thus,
>>> pointing to the list-head if it's the first skb of the list).
>>> This will make __qdisc_drop_all modify the next-pointer of the list-head
>>> and result in a panic later on:
>>>
>>> [   34.501053] general protection fault:  [#1] SMP KASAN PTI
>>> [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp 
>>> #108
>>> [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 0.5.1 01/01/2011
>>> [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
>>> [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 
>>> 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 
>>> <42> 0f b6 04 30 84 c0 74 08 3c 04
>>> [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
>>> [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
>>> 111072e165a6
>>> [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
>>> 0038
>>> [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
>>> 0062
>>> [   34.510346] R10: 0034 R11:  R12: 
>>> 
>>> [   34.511215] R13:  R14: dc00 R15: 
>>> 8883ac338008
>>> [   34.512082] FS:  () GS:8883af50() 
>>> knlGS:
>>> [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
>>> [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
>>> 06e0
>>> [   34.514593] Call Trace:
>>> [   34.514893]  
>>> [   34.515157]  napi_gro_receive+0x93/0x150
>>> [   34.515632]  receive_buf+0x893/0x3700
>>> [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
>>> [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
>>> [   34.517153]  ? __stable_node_chain+0x4d0/0x850
>>> [   34.517684]  ? kfree+0x9a/0x180
>>> [   34.518067]  ? __kasan_slab_free+0x171/0x190
>>> [   34.518582]  ? detach_buf+0x1df/0x650
>>> [   34.519061]  ? lapic_next_event+0x5a/0x90
>>> [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
>>> [   34.520093]  virtnet_poll+0x2df/0xd60
>>> [   34.520533]  ? receive_buf+0x3700/0x3700
>>> [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
>>> [   34.521631]  ? htb_dequeue+0x1817/0x25f0
>>> [   34.522107]  ? sch_direct_xmit+0x142/0xf30
>>> [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
>>> [   34.523155]  net_rx_action+0x2f6/0xc50
>>> [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
>>> [   34.524126]  ? kasan_check_read+0x11/0x20
>>> [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
>>> [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
>>> [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
>>> [   34.526130]  ? apic_ack_irq+0x9e/0xe0
>>> [   34.526567]  __do_softirq+0x188/0x4b5
>>> [   34.527015]  irq_exit+0x151/0x180
>>> [   34.527417]  do_IRQ+0xdb/0x150
>>> [   34.527783]  common_interrupt+0xf/0xf
>>> [   34.528223]  
>>>
>>> This patch makes sure that skb->prev is also set to NULL when removing
>>> it from the list.
>>>
>>> The bug is in v4.19.x as well, but the patch can't be backported easily.
>>> I can post a follow-up for that.
>>>
>>> Cc: Prashant Bhole 
>>> Cc: Tyler Hicks 
>>> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
>>> Signed-off-by: Christoph Paasch 
>>> ---
>>>  include/linux/skbuff.h | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 0d1b2c3f127b..3bb3bfd390eb 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff 
>>> *skb)
>>>  static inline void skb_mark_not_on_list(struct sk_buff *skb)
>>>  {
>>> skb->next = NULL;
>>> +   skb->prev = NULL;
>>>  }
>>>  
>>>  static inline void skb_list_del_init(struct sk_buff *skb)
>>>
>>
>> skb_mark_not_on_list() is used in many place where we do not care of 
>> skb->prev
>>
>> What about fixing netem instead ?
> 
> Yes, I have been looking at that and Alexey's patch which introduced the
> access to skb->prev (cfr.: https://patchwork.ozlabs.org/patch/880717/).
> 
> But then I thought that setting skb->prev to NULL is a less risky approach for
> -stable.
> 
> 
> How would you go about fixing netem instead?
> 
> Because, from what I see we basically can enter netem_enqueue here with two
> different

Re: [PATCH bpf-next] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Yonghong Song



On 11/29/18 1:00 PM, Jakub Kicinski wrote:
> On Thu, 29 Nov 2018 12:38:03 -0800, Yonghong Song wrote:
>> diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
>> index d83b17f8435c..286e497c50ec 100644
>> --- a/tools/lib/bpf/libbpf_errno.c
>> +++ b/tools/lib/bpf/libbpf_errno.c
>> @@ -40,9 +40,19 @@ int libbpf_strerror(int err, char *buf, size_t size)
>>  err = err > 0 ? err : -err;
>>   
>>  if (err < __LIBBPF_ERRNO__START) {
>> +#ifdef _GNU_SOURCE
>> +const char *ret_buf;
>> +#endif
>>  int ret;
>>   
>> +#ifdef _GNU_SOURCE
>> +ret_buf = strerror_r(err, buf, size);
>> +if (ret_buf != buf)
>> +snprintf(buf, size, "%s", ret_buf);
>> +ret = 0;
>> +#else
>>  ret = strerror_r(err, buf, size);
>> +#endif
>>  buf[size - 1] = '\0';
>>  return ret;
>>  }
> 
> That is kinda strange, the whole point for this file was to have
> non-GNU strerror_r, would doing #undef _GNU_SOURCE at the top not work?

Thanks for suggestion as I did not the history of this file.
Yes, #undef _GNU_SOURCE works. Will send a revision shortly.

Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch

On 29/11/18 - 15:09:18, Eric Dumazet wrote:
> 
> 
> On 11/29/2018 02:55 PM, Christoph Paasch wrote:
> > On 29/11/18 - 14:44:44, Eric Dumazet wrote:
> >>
> >>
> >> On 11/29/2018 02:27 PM, Christoph Paasch wrote:
> >>> There are places in the stack, where we access skb->prev directly and
> >>> modify it. Namely, __qdisc_drop_all().
> >>>
> >>> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> >>> the skb-list handling has been changed to set skb->next to NULL and set
> >>> the list-poison on skb->prev.
> >>>
> >>> With that change, __qdisc_drop_all() will panic when it tries to
> >>> dereference skb->prev.
> >>>
> >>> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> >>> __list_del_entry is used, leaving skb->prev unchanged (thus,
> >>> pointing to the list-head if it's the first skb of the list).
> >>> This will make __qdisc_drop_all modify the next-pointer of the list-head
> >>> and result in a panic later on:
> >>>
> >>> [   34.501053] general protection fault:  [#1] SMP KASAN PTI
> >>> [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp 
> >>> #108
> >>> [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> >>> BIOS 0.5.1 01/01/2011
> >>> [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
> >>> [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 
> >>> 8b 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 
> >>> 03 <42> 0f b6 04 30 84 c0 74 08 3c 04
> >>> [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
> >>> [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
> >>> 111072e165a6
> >>> [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
> >>> 0038
> >>> [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
> >>> 0062
> >>> [   34.510346] R10: 0034 R11:  R12: 
> >>> 
> >>> [   34.511215] R13:  R14: dc00 R15: 
> >>> 8883ac338008
> >>> [   34.512082] FS:  () GS:8883af50() 
> >>> knlGS:
> >>> [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
> >>> [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
> >>> 06e0
> >>> [   34.514593] Call Trace:
> >>> [   34.514893]  
> >>> [   34.515157]  napi_gro_receive+0x93/0x150
> >>> [   34.515632]  receive_buf+0x893/0x3700
> >>> [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
> >>> [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
> >>> [   34.517153]  ? __stable_node_chain+0x4d0/0x850
> >>> [   34.517684]  ? kfree+0x9a/0x180
> >>> [   34.518067]  ? __kasan_slab_free+0x171/0x190
> >>> [   34.518582]  ? detach_buf+0x1df/0x650
> >>> [   34.519061]  ? lapic_next_event+0x5a/0x90
> >>> [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
> >>> [   34.520093]  virtnet_poll+0x2df/0xd60
> >>> [   34.520533]  ? receive_buf+0x3700/0x3700
> >>> [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
> >>> [   34.521631]  ? htb_dequeue+0x1817/0x25f0
> >>> [   34.522107]  ? sch_direct_xmit+0x142/0xf30
> >>> [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
> >>> [   34.523155]  net_rx_action+0x2f6/0xc50
> >>> [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
> >>> [   34.524126]  ? kasan_check_read+0x11/0x20
> >>> [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
> >>> [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
> >>> [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
> >>> [   34.526130]  ? apic_ack_irq+0x9e/0xe0
> >>> [   34.526567]  __do_softirq+0x188/0x4b5
> >>> [   34.527015]  irq_exit+0x151/0x180
> >>> [   34.527417]  do_IRQ+0xdb/0x150
> >>> [   34.527783]  common_interrupt+0xf/0xf
> >>> [   34.528223]  
> >>>
> >>> This patch makes sure that skb->prev is also set to NULL when removing
> >>> it from the list.
> >>>
> >>> The bug is in v4.19.x as well, but the patch can't be backported easily.
> >>> I can post a follow-up for that.
> >>>
> >>> Cc: Prashant Bhole 
> >>> Cc: Tyler Hicks 
> >>> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> >>> Signed-off-by: Christoph Paasch 
> >>> ---
> >>>  include/linux/skbuff.h | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >>> index 0d1b2c3f127b..3bb3bfd390eb 100644
> >>> --- a/include/linux/skbuff.h
> >>> +++ b/include/linux/skbuff.h
> >>> @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff 
> >>> *skb)
> >>>  static inline void skb_mark_not_on_list(struct sk_buff *skb)
> >>>  {
> >>>   skb->next = NULL;
> >>> + skb->prev = NULL;
> >>>  }
> >>>  
> >>>  static inline void skb_list_del_init(struct sk_buff *skb)
> >>>
> >>
> >> skb_mark_not_on_list() is used in many place where we do not care of 
> >> skb->prev
> >>
> >> What about fixing netem instead ?
> > 
> > Yes, I have been looking at that and Alexey's patch which introduced the
> > access to skb->prev (cfr.:

Re: [RFC PATCH] net: mvpp2: fix detection of 10G SFP modules

2018-11-29 Thread Russell King - ARM Linux

On Thu, Nov 29, 2018 at 11:31:23AM -0800, Florian Fainelli wrote:
> 
> 
> On 11/29/2018 4:49 AM, Baruch Siach wrote:
> > The mvpp2_phylink_validate() relies on the interface field of
> > phylink_link_state to determine valid link modes. However, when called
> > from phylink_sfp_module_insert() this field in not initialized. The
> > default switch case then excludes 10G link modes. This allows 10G SFP
> > modules that are detected correctly to be configured at max rate of
> > 2.5G.
> > 
> > Catch the uninitialized PHY mode case, and allow 10G rates.
> > 
> > Cc: Maxime Chevallier 
> > Cc: Antoine Tenart 
> > Signed-off-by: Baruch Siach 
> > ---
> > Is that the right fix?
> 
> It would be a bit surprising that this is the right fix, you would
> expect validate to be called once everything has been parsed
> successfully from the SFP, is not that the case here? If not, can you
> find out what happens?

Two calls are made - the first with PHY_INTERFACE_MODE_NA to
determine what the advertising link mode may be, and then again
once the interface mode has been selected from the advertising mask.

Why?

Consider a 4.3Mbps fiberchannel SFP plugged into a 1G-only MAC.
If we did it as a single pass, we would end up passing an
interface mode of 2500BASEX first time around which is illogical.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

Re: [Patch net-next] net: explain __skb_checksum_complete() with comments

2018-11-29 Thread David Miller

From: Cong Wang 
Date: Mon, 26 Nov 2018 09:31:26 -0800

> Cc: Herbert Xu 
> Signed-off-by: Cong Wang 

Applied, thanks Cong.

[RFC PATCH 3/4] net: use indirect call wrapper at GRO transport layer

2018-11-29 Thread Paolo Abeni

This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  7 +++
 net/ipv4/af_inet.c| 11 +--
 net/ipv4/tcp_offload.c|  5 +
 net/ipv4/udp_offload.c|  5 +
 net/ipv6/ip6_offload.c| 10 --
 net/ipv6/tcpv6_offload.c  |  5 +
 net/ipv6/udp_offload.c|  5 +
 7 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 56e7592811ea..667bb8247f9a 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -56,4 +56,11 @@ static inline void inet_ctl_sock_destroy(struct sock *sk)
sock_release(sk->sk_socket);
 }
 
+#define indirect_call_gro_receive(name, cb, head, skb) \
+({ \
+   unlikely(gro_recursion_inc_test(skb)) ? \
+   NAPI_GRO_CB(skb)->flush |= 1, NULL :\
+   INDIRECT_CALL_2(cb, name, head, skb);   \
+})
+
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 04ab7ebd6e9b..774f183f56e3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1385,6 +1385,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(inet_gso_segment);
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, transport4_gro_receive,
+   struct list_head *head, struct sk_buff *skb);
 struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
const struct net_offload *ops;
@@ -1494,7 +1496,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));
 
-   pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+   pp = indirect_call_gro_receive(transport4_gro_receive,
+  ops->callbacks.gro_receive, head, skb);
 
 out_unlock:
rcu_read_unlock();
@@ -1558,6 +1561,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, transport4_gro_complete, struct sk_buff *skb,
+   int);
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
@@ -1583,7 +1588,9 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 * because any hdr with option will have been flushed in
 * inet_gro_receive().
 */
-   err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ transport4_gro_complete, skb,
+ nhoff + sizeof(*iph));
 
 out_unlock:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 870b0a335061..3d5dfac4cd1b 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -10,6 +10,7 @@
  * TCPv4 GSO/GRO support
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -317,6 +318,8 @@ static struct sk_buff *tcp4_gro_receive(struct list_head 
*head, struct sk_buff *
 
return tcp_gro_receive(head, skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_receive, 2, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 {
@@ -332,6 +335,8 @@ static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 
return tcp_gro_complete(skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_complete, 2, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload tcpv4_offload = {
.callbacks = {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..c3c5b237c8e0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
@@ -477,6 +478,8 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
*head,
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
 }
+INDIRECT_CALLABLE(udp4_gro_receive, 1, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int udp_gro_complete_segment(struct sk_buff *skb)
 {
@@ -536,6 +539,8 @@ static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
+INDIRECT_CALLABLE(udp4_gro_complete, 1, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload udpv4_offload = {
.callbacks = {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index a1c2bfb2ce0d..eeca4164a155 100644
---

[RFC PATCH 4/4] udp: use indirect call wrapper for GRO socket lookup

2018-11-29 Thread Paolo Abeni

This avoids another indirect call for UDP GRO. Again, the test
for the IPv6 variant is performed first.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 2 ++
 net/ipv4/udp_offload.c | 6 --
 net/ipv6/udp.c | 2 ++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..9ea851f47598 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -544,6 +544,8 @@ struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
return __udp4_lib_lookup_skb(skb, sport, dport, _table);
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup_skb);
+INDIRECT_CALLABLE(udp4_lib_lookup_skb, 1, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index c3c5b237c8e0..0ccd2aa1ab98 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -392,6 +392,8 @@ static struct sk_buff *udp_gro_receive_segment(struct 
list_head *head,
return NULL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sock *, udp_lookup, struct sk_buff *skb,
+   __be16 sport, __be16 dport);
 struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
 {
@@ -403,7 +405,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
struct sock *sk;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (!sk)
goto out_unlock;
 
@@ -505,7 +507,7 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31f..616f374760d1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -282,6 +282,8 @@ struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 inet6_sdif(skb), _table, skb);
 }
 EXPORT_SYMBOL_GPL(udp6_lib_lookup_skb);
+INDIRECT_CALLABLE(udp6_lib_lookup_skb, 2, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
-- 
2.19.2

[RFC PATCH 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-11-29 Thread Paolo Abeni

This header define a bunch of helpers that allow avoiding the
retpoline overhead when calling builtin functions via function pointers.
It boils down to explicitly comparing the function pointers to
known builtin functions and eventually invoke directly the latter.

The macro defined here implement the boilerplate for the above schema
and will be used by the next patches.

Suggested-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 include/linux/indirect_call_wrapper.h | 77 +++
 1 file changed, 77 insertions(+)
 create mode 100644 include/linux/indirect_call_wrapper.h

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
new file mode 100644
index ..57e82b4a166d
--- /dev/null
+++ b/include/linux/indirect_call_wrapper.h
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
+#define _LINUX_INDIRECT_CALL_WRAPPER_H
+
+#ifdef CONFIG_RETPOLINE
+
+/*
+ * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
+ *  @f: function pointer
+ *  @name: base name for builtin functions, see INDIRECT_CALLABLE_DECLARE_$NR
+ *  @__VA_ARGS__: arguments for @f
+ *
+ * Avoid retpoline overhead for known builtin, checking @f vs each of them and
+ * eventually invoking directly the builtin function. Fallback to the indirect
+ * call
+ */
+#define INDIRECT_CALL_1(f, name, ...)  \
+   ({  \
+   f == name ## 1 ? name ## 1(__VA_ARGS__) :   \
+f(__VA_ARGS__);\
+   })
+#define INDIRECT_CALL_2(f, name, ...)  \
+   ({  \
+   f == name ## 2 ? name ## 2(__VA_ARGS__) :   \
+INDIRECT_CALL_1(f, name, __VA_ARGS__); \
+   })
+
+/*
+ * INDIRECT_CALLABLE_DECLARE_$NR - declare $NR known builtin for
+ * INDIRECT_CALL_$NR usage
+ *  @type: return type for the builtin function
+ *  @name: base name for builtin functions, the full list is generated 
appending
+ *the numbers in the 1..@NR range
+ *  @__VA_ARGS__: arguments type list for the builtin function
+ *
+ * Builtin with higher $NR will be checked first by INDIRECT_CALL_$NR
+ */
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)   \
+   extern type name ## 1(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)   \
+   extern type name ## 2(__VA_ARGS__); \
+   INDIRECT_CALLABLE_DECLARE_1(type, name, __VA_ARGS__)
+
+/*
+ * INDIRECT_CALLABLE - allow usage of a builtin function from INDIRECT_CALL_$NR
+ *  @f: builtin function name
+ *  @nr: id associated with this builtin, higher values will be checked first 
by
+ *  INDIRECT_CALL_$NR
+ *  @type: function return type
+ *  @name: base name used by INDIRECT_CALL_ to access the builtin list
+ *  @__VA_ARGS__: arguments type list for @f
+ */
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)  \
+   __alias(f) type name ## nr(__VA_ARGS__)
+
+#else
+#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)
+#endif
+
+/*
+ * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
+ * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
+ * alternatives
+ */
+#if IS_BUILTIN(CONFIG_IPV6)
+#define INDIRECT_CALL_INET INDIRECT_CALL_2
+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET INDIRECT_CALL_1
+#else
+#define INDIRECT_CALL_INET(...)
+#endif
+
+#endif
-- 
2.19.2

[RFC PATCH 0/4] net: mitigate retpoline overhead

2018-11-29 Thread Paolo Abeni

The spectre v2 counter-measures, aka retpolines, are a source of measurable
overhead[1]. We can partially address that when the function pointer refers to
a builtin symbol resorting to a list of tests vs well-known builtin function and
direct calls.

This may lead to some uglification around the indirect calls. In netconf 2018
Eric Dumazet described a technique to hide the most relevant part of the needed
boilerplate with some macro help.

This series is a [re-]implementation of such idea, exposing the introduced
helpers in a new header file. They are later leveraged to avoid the indirect
call overhead in the GRO path, when possible.

Overall this gives > 10% performance improvement for UDP GRO benchmark, and
smaller but measurable under for TCP syn flood.

The added infra can be used in follow-up patches to cope with retpoline overhead
in other points of the networking stack (e.g. at the qdisc layer) and possibly
even in other subsystems.

Paolo Abeni (4):
  indirect call wrappers: helpers to speed-up indirect calls of builtin
  net: use indirect call wrappers at GRO network layer
  net: use indirect call wrapper at GRO transport layer
  udp: use indirect call wrapper for GRO socket lookup

 include/linux/indirect_call_wrapper.h | 77 +++
 include/net/inet_common.h |  9 
 net/core/dev.c| 10 +++-
 net/ipv4/af_inet.c| 15 +-
 net/ipv4/tcp_offload.c|  5 ++
 net/ipv4/udp.c|  2 +
 net/ipv4/udp_offload.c| 11 +++-
 net/ipv6/ip6_offload.c| 14 -
 net/ipv6/tcpv6_offload.c  |  5 ++
 net/ipv6/udp.c|  2 +
 net/ipv6/udp_offload.c|  5 ++
 11 files changed, 147 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/indirect_call_wrapper.h

-- 
2.19.2

[PATCH v2 net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch

__qdisc_drop_all() accesses skb->prev to get to the tail of the
segment-list.

With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
the skb-list handling has been changed to set skb->next to NULL and set
the list-poison on skb->prev.

With that change, __qdisc_drop_all() will panic when it tries to
dereference skb->prev.

Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
__list_del_entry is used, leaving skb->prev unchanged (thus,
pointing to the list-head if it's the first skb of the list).
This will make __qdisc_drop_all modify the next-pointer of the list-head
and result in a panic later on:

[   34.501053] general protection fault:  [#1] SMP KASAN PTI
[   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
[   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.5.1 01/01/2011
[   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
[   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 
24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f b6 
04 30 84 c0 74 08 3c 04
[   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
[   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 111072e165a6
[   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 0038
[   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 0062
[   34.510346] R10: 0034 R11:  R12: 
[   34.511215] R13:  R14: dc00 R15: 8883ac338008
[   34.512082] FS:  () GS:8883af50() 
knlGS:
[   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
[   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 06e0
[   34.514593] Call Trace:
[   34.514893]  
[   34.515157]  napi_gro_receive+0x93/0x150
[   34.515632]  receive_buf+0x893/0x3700
[   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
[   34.516629]  ? virtnet_probe+0x1b40/0x1b40
[   34.517153]  ? __stable_node_chain+0x4d0/0x850
[   34.517684]  ? kfree+0x9a/0x180
[   34.518067]  ? __kasan_slab_free+0x171/0x190
[   34.518582]  ? detach_buf+0x1df/0x650
[   34.519061]  ? lapic_next_event+0x5a/0x90
[   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
[   34.520093]  virtnet_poll+0x2df/0xd60
[   34.520533]  ? receive_buf+0x3700/0x3700
[   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
[   34.521631]  ? htb_dequeue+0x1817/0x25f0
[   34.522107]  ? sch_direct_xmit+0x142/0xf30
[   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
[   34.523155]  net_rx_action+0x2f6/0xc50
[   34.523601]  ? napi_complete_done+0x2f0/0x2f0
[   34.524126]  ? kasan_check_read+0x11/0x20
[   34.524608]  ? _raw_spin_lock+0x7d/0xd0
[   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
[   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
[   34.526130]  ? apic_ack_irq+0x9e/0xe0
[   34.526567]  __do_softirq+0x188/0x4b5
[   34.527015]  irq_exit+0x151/0x180
[   34.527417]  do_IRQ+0xdb/0x150
[   34.527783]  common_interrupt+0xf/0xf
[   34.528223]  

This patch makes sure that skb->prev is set to NULL when entering
netem_enqueue.

Cc: Prashant Bhole 
Cc: Tyler Hicks 
Cc: Eric Dumazet 
Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
Suggested-by: Eric Dumazet 
Signed-off-by: Christoph Paasch 
---

Notes:
v2: - Set skb->prev to NULL in netem_enqueue instead.

 net/sched/sch_netem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..22cd46a60057 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -431,6 +431,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
int count = 1;
int rc = NET_XMIT_SUCCESS;
 
+   /* Do not fool qdisc_drop_all() */
+   skb->prev = NULL;
+
/* Random duplication */
if (q->duplicate && q->duplicate >= get_crandom(>dup_cor))
++count;
-- 
2.16.2

Re: [PATCHv2 net] sctp: hold transport before accessing its asoc in sctp_epaddr_lookup_transport

2018-11-29 Thread Xin Long

On Fri, Nov 30, 2018 at 5:52 AM Neil Horman  wrote:
>
> On Thu, Nov 29, 2018 at 02:44:07PM +0800, Xin Long wrote:
> > Without holding transport to dereference its asoc, a use after
> > free panic can be caused in sctp_epaddr_lookup_transport. Note
> > that a sock lock can't protect these transports that belong to
> > other socks.
> >
> > A similar fix as Commit bab1be79a516 ("sctp: hold transport
> > before accessing its asoc in sctp_transport_get_next") is
> > needed to hold the transport before accessing its asoc in
> > sctp_epaddr_lookup_transport.
> >
> > Note that this extra atomic operation is on the datapath,
> > but as rhlist keeps the lists to a small size, it won't
> > see a noticeable performance hurt.
> >
> > v1->v2:
> >   - improve the changelog.
> >
> > Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> > rhashtable")
> > Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> > Signed-off-by: Xin Long 
> > ---
> >  net/sctp/input.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/sctp/input.c b/net/sctp/input.c
> > index 5c36a99..ce7351c 100644
> > --- a/net/sctp/input.c
> > +++ b/net/sctp/input.c
> > @@ -967,9 +967,15 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
> >   list = rhltable_lookup(_transport_hashtable, ,
> >  sctp_hash_params);
> >
> > - rhl_for_each_entry_rcu(t, tmp, list, node)
> > - if (ep == t->asoc->ep)
> > + rhl_for_each_entry_rcu(t, tmp, list, node) {
> > + if (!sctp_transport_hold(t))
> > + continue;
> > + if (ep == t->asoc->ep) {
> > + sctp_transport_put(t);
> >   return t;
> > + }
> > + sctp_transport_put(t);
> > + }
> >
> >   return NULL;
> >  }
>
> Wait a second, what if we just added an rcu_head to the association structure
> and changed the kfree call in sctp_association_destroy to a kfree_rcu call
> instead?  That would force the actual freeing of the association to pass 
> through
> a grace period, during which any in flight list traversal in
> sctp_epaddr_lookup_transport could complete safely.  Its another two pointers
We discussed this in last thread:
https://www.spinics.net/lists/netdev/msg535191.html

It will cause closed sk to linger longer.

> worth of space in the association, but I think that would be a worthwhile
> tradeoff for not having to do N atomic adds/puts every time you wanted to
> receive or send a frame.
N is not a big value, as rhlist itself keeps lists in a size.

>
> Neil
>
> > --
> > 2.1.0
> >
> >

[PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-11-29 Thread gerg

I have been working towards supporting the MT7530 switch as used in the
MediaTek MT7621 SoC. Unlike the MediaTek MT7623 the MT7621 is built around
a dual core MIPS CPU architecture. But underneath it is what appears to
be the same 7530 switch.

The following 3 patches are more of an RFC than anything. They allow
use of the mt7530 dsa driver on this device - though with some issues
still to resolve. The primary change required is to not use the 7623
specific clock and regulator setup - none of that applies when using
the 7621 (and maybe other devices?). The other change required is to
set the 7530 MFC register CPU port number and enable bit.

The unresolved issues I still have appear to be more related to the
MT7621 ethernet driver (drivers/staging/mt7621-eth/*). I am hoping
someone might have some ideas on these. I don't really have any good
documentation on the ethernet devices on the 7621, so I am kind of
working in the dark here.

1. TX packets are not getting an IP header checksum via the normal
   off-loaded checksumming when in DSA mode. I have to switch off
   NETIF_F_IP_CSUM, so the software stack generates the checksum.
   That checksum offloading works ok when not using the 7530 DSA driver.

2. Maximal sized RX packets get silently dropped. So receive side packets
   that are large (perfect case is the all-but-last packets in a fragemented
   larger packet) appear to be dropped at the mt7621 ethernet MAC level.
   The 7530 MIB switch register counters show receive packets at the physical
   switch port side and at the CPU switch port - but I get no packets
   received or errors in the 7621 ethernet MAC. If I set the mtu of the
   server at the other end a little smaller (a few bytes is enough) then
   I get all the packets through. It seems like the DSA/VLAN tag bytes
   are causing a too large packet to get silently dropped somewhere.

Any thoughts greatly appreciated...

 Documentation/devicetree/bindings/net/dsa/mt7530.txt |5 +
 drivers/net/dsa/mt7530.c |   95 ---
 drivers/net/dsa/mt7530.h |5 +
 3 files changed, 70 insertions(+), 35 deletions(-)

Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Toshiaki Makita

On 2018/11/30 11:40, Jason Wang wrote:
> On 2018/11/30 上午10:30, Prashant Bhole wrote:
>> In tun.c skb->len was accessed while doing stats accounting after a
>> call to netif_receive_skb. We can not access skb after this call
>> because buffers may be dropped.
>>
>> The fix for this bug would be to store skb->len in local variable and
>> then use it after netif_receive_skb(). IMO using xdp data size for
>> accounting bytes will be better because input for tun_xdp_one() is
>> xdp_buff.
>>
>> Hence this patch:
>> - fixes a bug by removing skb access after netif_receive_skb()
>> - uses xdp data size for accounting bytes
...
>> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through
>> sendmsg()")
>> Reviewed-by: Toshiaki Makita 
>> Signed-off-by: Prashant Bhole 
>> ---
>>   drivers/net/tun.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index e244f5d7512a..6e388792c0a8 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>  struct tun_file *tfile,
>>  struct xdp_buff *xdp, int *flush)
>>   {
>> +    unsigned int datasize = xdp->data_end - xdp->data;
>>   struct tun_xdp_hdr *hdr = xdp->data_hard_start;
>>   struct virtio_net_hdr *gso = >gso;
>>   struct tun_pcpu_stats *stats;
>> @@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>   stats = get_cpu_ptr(tun->pcpu_stats);
>>   u64_stats_update_begin(>syncp);
>>   stats->rx_packets++;
>> -    stats->rx_bytes += skb->len;
>> +    stats->rx_bytes += datasize;
>>   u64_stats_update_end(>syncp);
>>   put_cpu_ptr(stats);
>>   
> 
> 
> Good catch, but you probably need to calculate the datasize after XDP
> processing since it may modify the packet length.

(+CC David Ahern who may be interested in this area.)

I'd rather think we should calculate it before XDP.
I checked several drivers behavior. mlx5, bnxt and qede use hardware
counters for rx bytes, which means the size is calculated before XDP.
nfp calculate it by software but before XDP. On the other hand, intel
drivers use skb->len. So currently bytes counters do not look
consistent, but I think calculation before XDP is more common.

-- 
Toshiaki Makita

Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-29 Thread Xin Long

On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  wrote:
>
> On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > Now when using stream reconfig to add out streams, stream->out
> > will get re-allocated, and all old streams' information will
> > be copied to the new ones and the old ones will be freed.
> >
> > So without stream->out_curr updated, next time when trying to
> > send from stream->out_curr stream, a panic would be caused.
> >
> > This patch is to check and update stream->out_curr when
> > allocating stream_out.
> >
> > v1->v2:
> >   - define fa_index() to get elem index from stream->out_curr.
> >
> > Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> > Reported-by: Ying Xu 
> > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > Signed-off-by: Xin Long 
> > ---
> >  net/sctp/stream.c | 20 
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > index 3892e76..30e7809 100644
> > --- a/net/sctp/stream.c
> > +++ b/net/sctp/stream.c
> > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, size_t index, 
> > size_t count)
> >   }
> >  }
> >
> > +static size_t fa_index(struct flex_array *fa, void *elem, size_t count)
> > +{
> > + size_t index = 0;
> > +
> > + while (count--) {
> > + if (elem == flex_array_get(fa, index))
> > + break;
> > + index++;
> > + }
> > +
> > + return index;
> > +}
> > +
> >  /* Migrates chunks from stream queues to new stream queues if needed,
> >   * but not across associations. Also, removes those chunks to streams
> >   * higher than the new max.
> > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct sctp_stream 
> > *stream, __u16 outcnt,
> >
> >   if (stream->out) {
> >   fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
> > + if (stream->out_curr) {
> > + size_t index = fa_index(stream->out, stream->out_curr,
> > + stream->outcnt);
> > +
> > + BUG_ON(index == stream->outcnt);
> > + stream->out_curr = flex_array_get(out, index);
> > + }
> >   fa_free(stream->out);
> >   }
> >
> > --
> > 2.1.0
> >
> >
>
> This is the sort of thing I'm talking about. Its a little more code, but if 
> you
> augment the flex_array api like this, you can preform a resize operation on 
> your
> existing flex array, and you can avoid all the copying, and need to update
> pointers maintained outside the array.  Note this code isn't tested at all, 
> but
> its close to what I think should work.
>
>
> diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> index b94fa61b51fb..7fa1f27a91b5 100644
> --- a/include/linux/flex_array.h
> +++ b/include/linux/flex_array.h
> @@ -73,6 +73,8 @@ struct flex_array {
>  struct flex_array *flex_array_alloc(int element_size, unsigned int total,
> gfp_t flags);
>
> +struct flex_array *flex_array_resize(struct flex_array *fa, unsigned int 
> total, gfp_t flags);
> +
>  /**
>   * flex_array_prealloc() - Ensures that memory for the elements indexed in 
> the
>   * range defined by start and nr_elements has been allocated.
> diff --git a/lib/flex_array.c b/lib/flex_array.c
> index 2eed22fa507c..f8d54af3891b 100644
> --- a/lib/flex_array.c
> +++ b/lib/flex_array.c
> @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int element_size, 
> unsigned int total,
> ret->total_nr_elements = total;
> ret->elems_per_part = elems_per_part;
> ret->reciprocal_elems = reciprocal_elems;
> +   ret->elements_used = 0;
> if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO))
> memset(>parts[0], FLEX_ARRAY_FREE,
> FLEX_ARRAY_BASE_BYTES_LEFT);
> @@ -116,6 +117,53 @@ struct flex_array *flex_array_alloc(int element_size, 
> unsigned int total,
>  }
>  EXPORT_SYMBOL(flex_array_alloc);
>
> +static int flex_array_last_element_index(struct flex_array *fa)
> +{
> +   struct flex_array_part *part;
> +   int part_nr;
> +   int i,j;
> +
> +   if (elements_fit_in_base(fa)) {
> +   part = (struct flex_array_part *)>parts[0];
> +   for (i = fa->elems_per_part; i >= 0; i--)
> +   if (part->elements[i] != FLEX_ARRAY_FREE)
> +   return i;
> +   }
> +
> +   i = fa->total_nr_elements;
> +   for (part_nr = 0; part_nr < FLEX_ARRAY_NR_BASE_PTRS; part_nr++) {
> +   part = fa->parts[part_nr];
> +   if (!part) {
> +   i -= fa->elems_per_part;
> +   continue;
> +   }
> +   for (j = fa->elems_per_part; j >= 0; j--, i--)
> +   if (part->elements[j] != FLEX_ARRAY_FREE)
> +   goto out;
> +

[PATCH 3/3] dt-bindings: net: dsa: add new bindings MT7530

2018-11-29 Thread gerg

From: Greg Ungerer 

Add descriptive entries for the new bindings introduced to support the
MT7530 implementation in the MediaTek MT7621 SoC.

New bindings added for:

  mediatek,no-clock-regulator
  mediatek,mfc-has-cpuport

Signed-off-by: Greg Ungerer 
---
 Documentation/devicetree/bindings/net/dsa/mt7530.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/dsa/mt7530.txt 
b/Documentation/devicetree/bindings/net/dsa/mt7530.txt
index aa3527f71fdc..549ad7c338ea 100644
--- a/Documentation/devicetree/bindings/net/dsa/mt7530.txt
+++ b/Documentation/devicetree/bindings/net/dsa/mt7530.txt
@@ -9,6 +9,11 @@ Required properties:
 - mediatek,mcm: Boolean; if defined, indicates that either MT7530 is the part
on multi-chip module belong to MT7623A has or the remotely standalone
chip as the function MT7623N reference board provided for.
+- mediatek,no-clock-regulator: Boolean; if defined, indicates that the MT7530
+   is in a system-on-chip that does not require clock or regulator
+   control setup (for example the MT7621).
+- mediatek,mfc-has-cpuport: Boolean; if defined, indicates that the MT7530
+   has an MFC register that has a CPU PORT field and ENABLE bit
 - core-supply: Phandle to the regulator node necessary for the core power.
 - io-supply: Phandle to the regulator node necessary for the I/O power.
See Documentation/devicetree/bindings/regulator/mt6323-regulator.txt
-- 
2.17.1

[PATCH 1/3] net: dsa: mt7530: make clock/regulator setup optional

2018-11-29 Thread gerg

From: Greg Ungerer 

At least one device that contains the MediaTek MT7530 switch module
does not need the clock and regulator setup parts of the MT7530 DSA
driver. That setup looks to be very specific to the MT7623.

The MediaTek MT7621 SoC device contains a 7530 switch, but its MIPS CPU
cores and overall architecture do not have any of the same clock or
regulator setup as the MT7623.

Create a new devicetree tag, mediatek,no-clock-regulator, that controls
whether we should setup and use the clocks and regulators specific to
some uses of the 7530.

Signed-off-by: Greg Ungerer 
---
 drivers/net/dsa/mt7530.c | 86 
 drivers/net/dsa/mt7530.h |  1 +
 2 files changed, 52 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index a5de9bffe5be..532240c4bef9 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -625,14 +625,16 @@ static void mt7530_adjust_link(struct dsa_switch *ds, int 
port,
dev_dbg(priv->dev, "phy-mode for master device = %x\n",
phydev->interface);
 
-   /* Setup TX circuit incluing relevant PAD and driving */
-   mt7530_pad_clk_setup(ds, phydev->interface);
-
-   /* Setup RX circuit, relevant PAD and driving on the host
-* which must be placed after the setup on the device side is
-* all finished.
-*/
-   mt7623_pad_clk_setup(ds);
+   if (priv->clkreg) {
+   /* Setup TX circuit incluing relevant PAD and driving */
+   mt7530_pad_clk_setup(ds, phydev->interface);
+
+   /* Setup RX circuit, relevant PAD and driving on the
+* host which must be placed after the setup on the
+* device side is all finished.
+*/
+   mt7623_pad_clk_setup(ds);
+   }
} else {
u16 lcl_adv = 0, rmt_adv = 0;
u8 flowctrl;
@@ -1219,24 +1221,27 @@ mt7530_setup(struct dsa_switch *ds)
 * as two netdev instances.
 */
dn = ds->ports[MT7530_CPU_PORT].master->dev.of_node->parent;
-   priv->ethernet = syscon_node_to_regmap(dn);
-   if (IS_ERR(priv->ethernet))
-   return PTR_ERR(priv->ethernet);
 
-   regulator_set_voltage(priv->core_pwr, 100, 100);
-   ret = regulator_enable(priv->core_pwr);
-   if (ret < 0) {
-   dev_err(priv->dev,
-   "Failed to enable core power: %d\n", ret);
-   return ret;
-   }
+   if (priv->clkreg) {
+   priv->ethernet = syscon_node_to_regmap(dn);
+   if (IS_ERR(priv->ethernet))
+   return PTR_ERR(priv->ethernet);
+
+   regulator_set_voltage(priv->core_pwr, 100, 100);
+   ret = regulator_enable(priv->core_pwr);
+   if (ret < 0) {
+   dev_err(priv->dev,
+   "Failed to enable core power: %d\n", ret);
+   return ret;
+   }
 
-   regulator_set_voltage(priv->io_pwr, 330, 330);
-   ret = regulator_enable(priv->io_pwr);
-   if (ret < 0) {
-   dev_err(priv->dev, "Failed to enable io pwr: %d\n",
-   ret);
-   return ret;
+   regulator_set_voltage(priv->io_pwr, 330, 330);
+   ret = regulator_enable(priv->io_pwr);
+   if (ret < 0) {
+   dev_err(priv->dev, "Failed to enable io pwr: %d\n",
+   ret);
+   return ret;
+   }
}
 
/* Reset whole chip through gpio pin or memory-mapped registers for
@@ -1268,10 +1273,12 @@ mt7530_setup(struct dsa_switch *ds)
return -ENODEV;
}
 
-   /* Reset the switch through internal reset */
-   mt7530_write(priv, MT7530_SYS_CTRL,
-SYS_CTRL_PHY_RST | SYS_CTRL_SW_RST |
-SYS_CTRL_REG_RST);
+   if (priv->clkreg) {
+   /* Reset the switch through internal reset */
+   mt7530_write(priv, MT7530_SYS_CTRL,
+SYS_CTRL_PHY_RST | SYS_CTRL_SW_RST |
+SYS_CTRL_REG_RST);
+   }
 
/* Enable Port 6 only; P5 as GMAC5 which currently is not supported */
val = mt7530_read(priv, MT7530_MHWTRAP);
@@ -1356,13 +1363,22 @@ mt7530_probe(struct mdio_device *mdiodev)
}
}
 
-   priv->core_pwr = devm_regulator_get(>dev, "core");
-   if (IS_ERR(priv->core_pwr))
-   return PTR_ERR(priv->core_pwr);
-
-   priv->io_pwr = devm_regulator_get(>dev, "io");
-   if (IS_ERR(priv->io_pwr))
-   return PTR_ERR(priv->io_pwr);
+   /* Use mediatek,no-clock-regulator to

[PATCH 2/3] net: dsa: mt7530: optional setting CPU field in MFC register

2018-11-29 Thread gerg

From: Greg Ungerer 

Some versions of the MediaTek MT7530 switch have a CPU port number and
enable bit in their MFC register. The MT7530 instance on the MediaTek
MT7621 SoC is one that does for example. The switch will not work
without these fields being correctly setup on these devices.

Create a new devicetree tag, mediatek,mfc-has-cpuport, that signifies
that this device needs the CPU port field and enable bit set when the
port is enabled.

Signed-off-by: Greg Ungerer 
---
 drivers/net/dsa/mt7530.c | 9 +
 drivers/net/dsa/mt7530.h | 4 
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 532240c4bef9..e41ada47233a 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -689,6 +689,10 @@ mt7530_cpu_port_enable(struct mt7530_priv *priv,
/* Unknown unicast frame fordwarding to the cpu port */
mt7530_set(priv, MT7530_MFC, UNU_FFP(BIT(port)));
 
+   /* Set CPU port number */
+   if (priv->mfccpu)
+   mt7530_rmw(priv, MT7530_MFC, CPU_MASK, CPU_EN | CPU_PORT(port));
+
/* CPU port gets connected to all user ports of
 * the switch
 */
@@ -1380,6 +1384,11 @@ mt7530_probe(struct mdio_device *mdiodev)
 "MT7530 with no CLOCK or REGULATOR control\n");
}
 
+   /* Use mediatek,mfc-has-cpuport to distinguish hardware where the MFC
+* register has a CPU port number field setting.
+*/
+   priv->mfccpu = of_property_read_bool(dn, "mediatek,mfc-has-cpuport");
+
/* Not MCM that indicates switch works as the remote standalone
 * integrated circuit so the GPIO pin would be used to complete
 * the reset, otherwise memory-mapped register accessing used
diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h
index ee97944c4507..cbc0725c4258 100644
--- a/drivers/net/dsa/mt7530.h
+++ b/drivers/net/dsa/mt7530.h
@@ -36,6 +36,9 @@
 #define  UNM_FFP(x)(((x) & 0xff) << 16)
 #define  UNU_FFP(x)(((x) & 0xff) << 8)
 #define  UNU_FFP_MASK  UNU_FFP(~0)
+#define  CPU_ENBIT(7)
+#define  CPU_PORT(x)   ((x) << 4)
+#define  CPU_MASK  (0xf << 4)
 
 /* Registers for address table access */
 #define MT7530_ATA10x74
@@ -432,6 +435,7 @@ struct mt7530_priv {
struct gpio_desc*reset;
boolmcm;
boolclkreg;
+   boolmfccpu;
 
struct mt7530_port  ports[MT7530_NUM_PORTS];
/* protect among processes for registers access*/
-- 
2.17.1

RE: Invalid transport_offset with AF_PACKET socket

2018-11-29 Thread Maxim Mikityanskiy

> > > So it should return at least 18 and not 14.
> >
> > Yes, the function does its best to return at least 18, but it silently
> expects
> > skb_transport_offset to exceed 18. In normal conditions, it will be more
> that
> > 18, because it will be at least 14 + 20. But in my case, when I send a
> packet
> > via an AF_PACKET socket, skb_transport_offset returns 14 (which is
> nonsense),
> > and the driver uses this value, causing the hardware to fail, because it's
> less
> > than 18.
> >
>
> Got it, so even if you copy 18 it is not sufficient ! if the packet is
> ipv4 or ipv6
> and the inline mode is set to  MLX5_INLINE_MODE_IP in the vport
> context you must copy the IP headers as well !

Yes, I know that. That's why the driver uses skb_transport_offset - to include
the network header, but as skb_transport_offset returns a wrong offset, the IP
header doesn't get included. The thing is that with 14 bytes the driver even
fails to send a packet, and with 18 bytes and missing L3 header it seems to send
it, but it doesn't matter much, of course, we should pass the L3 header as well.

> but what do you expect from AF_PACKET socket ? to parse each and every
> packet and set skb_transport_offset ?

Not exactly.

First, AF_PACKET already parses each packet, and it even sets the transport
offset correctly if I specify the protocol in the third parameter of the
socket() call:

socket(AF_PACKET, SOCK_RAW, htons(0x0800));

So, if called in a specific way, the AF_SOCKET does exactly that - it parses
each packet and sets the transport offset, but if the protocol is not specified
(0) in the third parameter, it fails to parse correctly.

Second, I don't expect that. It's absolutely OK if AF_PACKET tells the driver
that it failed to parse the packet, then the driver can guess the network
protocol and restart parsing itself. The driver knows the format of its L2
header and can look at the necessary field.

However, if AF_PACKET could ask the driver to guess the protocol, it would be a
little bit faster, because this way skb_probe_transport_header is only called
once, and the field lookup in the L2 header is O(1).

[PATCH net] bpf: uninitialized variables in test code

2018-11-29 Thread Dan Carpenter

Smatch complains that if bpf_test_run() fails with -ENOMEM at the
begining then the "duration" is uninitialized.  We then copy the
unintialized variables to the user inside the bpf_test_finish()
function.  The functions require CAP_SYS_ADMIN so it's not really an
information leak.

Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command")
Signed-off-by: Dan Carpenter 
---
 net/bpf/test_run.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c89c22c49015..49304192a031 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -114,7 +114,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
union bpf_attr *kattr,
bool is_l2 = false, is_direct_pkt_access = false;
u32 size = kattr->test.data_size_in;
u32 repeat = kattr->test.repeat;
-   u32 retval, duration;
+   u32 retval, duration = 0;
int hh_len = ETH_HLEN;
struct sk_buff *skb;
struct sock *sk;
@@ -196,7 +196,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const 
union bpf_attr *kattr,
u32 repeat = kattr->test.repeat;
struct netdev_rx_queue *rxqueue;
struct xdp_buff xdp = {};
-   u32 retval, duration;
+   u32 retval, duration = 0;
void *data;
int ret;
 
-- 
2.11.0

[PATCH] net: phy: sfp: correct store of detected link modes

2018-11-29 Thread Baruch Siach

The link modes that sfp_parse_support() detects are stored in the
'modes' bitmap. There is no reason to make an exception for 1000Base-PX
or 1000Base-BX10.

Signed-off-by: Baruch Siach 
---
 drivers/net/phy/sfp-bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/sfp-bus.c b/drivers/net/phy/sfp-bus.c
index 83060fb349f4..ad9db652874d 100644
--- a/drivers/net/phy/sfp-bus.c
+++ b/drivers/net/phy/sfp-bus.c
@@ -162,7 +162,7 @@ void sfp_parse_support(struct sfp_bus *bus, const struct 
sfp_eeprom_id *id,
/* 1000Base-PX or 1000Base-BX10 */
if ((id->base.e_base_px || id->base.e_base_bx10) &&
br_min <= 1300 && br_max >= 1200)
-   phylink_set(support, 1000baseX_Full);
+   phylink_set(modes, 1000baseX_Full);
 
/* For active or passive cables, select the link modes
 * based on the bit rates and the cable compliance bytes.
-- 
2.19.2

[PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-29 Thread Eric Dumazet

Wei Wang added these fields in linux-4.19

Tested:

ss -ti ...

ts sack cubic wscale:8,8 rto:7 rtt:2.678/0.267 mss:1428 pmtu:1500
rcvmss:536 advmss:1428 cwnd:91 ssthresh:65
(*) bytes_sent:17470606104 bytes_retrans:2856
bytes_acked:17470483297
segs_out:12234320 segs_in:622983
data_segs_out:12234318 send 388.2Mbps lastrcv:986784 lastack:1
pacing_rate 465.8Mbps delivery_rate 162.7Mbps
delivered:12234235 delivered_ce:3669056
busy:986784ms unacked:84 retrans:0/2
(*) dsack_dups:2
rcv_space:14280 rcv_ssthresh:65535 notsent:2016336 minrtt:0.183

Signed-off-by: Eric Dumazet 
Cc: Wei Wang 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Soheil Hassas Yeganeh 
---
 misc/ss.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index 
3aa94f235085512510dca9fd597e8e37aaaf0fd3..3589ebedc5a0ab0615ba56f3df10d49198bed0d9
 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -819,6 +819,8 @@ struct tcpstat {
unsigned intnot_sent;
unsigned intdelivered;
unsigned intdelivered_ce;
+   unsigned intdsack_dups;
+   unsigned intreord_seen;
double  rcv_rtt;
double  min_rtt;
int rcv_space;
@@ -826,6 +828,8 @@ struct tcpstat {
unsigned long long  busy_time;
unsigned long long  rwnd_limited;
unsigned long long  sndbuf_limited;
+   unsigned long long  bytes_sent;
+   unsigned long long  bytes_retrans;
boolhas_ts_opt;
boolhas_sack_opt;
boolhas_ecn_opt;
@@ -2426,6 +2430,10 @@ static void tcp_stats_print(struct tcpstat *s)
if (s->ssthresh)
out(" ssthresh:%d", s->ssthresh);
 
+   if (s->bytes_sent)
+   out(" bytes_sent:%llu", s->bytes_sent);
+   if (s->bytes_retrans)
+   out(" bytes_retrans:%llu", s->bytes_retrans);
if (s->bytes_acked)
out(" bytes_acked:%llu", s->bytes_acked);
if (s->bytes_received)
@@ -2512,10 +2520,14 @@ static void tcp_stats_print(struct tcpstat *s)
out(" lost:%u", s->lost);
if (s->sacked && s->ss.state != SS_LISTEN)
out(" sacked:%u", s->sacked);
+   if (s->dsack_dups)
+   out(" dsack_dups:%u", s->dsack_dups);
if (s->fackets)
out(" fackets:%u", s->fackets);
if (s->reordering != 3)
out(" reordering:%d", s->reordering);
+   if (s->reord_seen)
+   out(" reord_seen:%d", s->reord_seen);
if (s->rcv_rtt)
out(" rcv_rtt:%g", s->rcv_rtt);
if (s->rcv_space)
@@ -2837,6 +2849,10 @@ static void tcp_show_info(const struct nlmsghdr *nlh, 
struct inet_diag_msg *r,
s.sndbuf_limited = info->tcpi_sndbuf_limited;
s.delivered = info->tcpi_delivered;
s.delivered_ce = info->tcpi_delivered_ce;
+   s.dsack_dups = info->tcpi_dsack_dups;
+   s.reord_seen = info->tcpi_reord_seen;
+   s.bytes_sent = info->tcpi_bytes_sent;
+   s.bytes_retrans = info->tcpi_bytes_retrans;
tcp_stats_print();
free(s.dctcp);
free(s.bbr_info);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

Re: [PATCH net-next] tun: implement carrier change

2018-11-29 Thread Nicolas Dichtel

Le 28/11/2018 à 22:48, Andrew Lunn a écrit :
> On Wed, Nov 28, 2018 at 07:12:56PM +0100, Nicolas Dichtel wrote:
>> The userspace may need to control the carrier state.
> 
> Hi Nicolas
Hi Andrew,

> 
> Could you explain your user case a bit more.
> 
> Are you running a routing daemon on top of the interface, and want it
> to reroute when the carrier goes down?
Sort of, we have a daemon that monitors the app and may re-route the traffic to
a secondary app if needed.


Regards,
Nicolas

[PATCH] selftests/bpf: add config fragment CONFIG_NF_NAT_IPV6

2018-11-29 Thread Naresh Kamboju

CONFIG_NF_NAT_IPV6=y is required for bpf test_sockmap test case
Fixes,
ip6tables v1.6.1: can't initialize ip6tables table `nat': Table does
not exist

Signed-off-by: Naresh Kamboju 
---
 tools/testing/selftests/bpf/config | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/config 
b/tools/testing/selftests/bpf/config
index 37f947ec44ed..d7076cf04a9d 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -23,3 +23,4 @@ CONFIG_LWTUNNEL=y
 CONFIG_BPF_STREAM_PARSER=y
 CONFIG_XDP_SOCKETS=y
 CONFIG_FTRACE_SYSCALLS=y
+CONFIG_NF_NAT_IPV6=y
-- 
2.17.1

[PATCH] net: phy: sfp: correct location of SFP standards

2018-11-29 Thread Baruch Siach

SFP standards are now available from the SNIA (Storage Networking
Industry Association) website.

Cc: Andrew Lunn 
Cc: Florian Fainelli 
Signed-off-by: Baruch Siach 
---
 include/linux/sfp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sfp.h b/include/linux/sfp.h
index d37518e89db2..d9d9de3fcf8e 100644
--- a/include/linux/sfp.h
+++ b/include/linux/sfp.h
@@ -224,7 +224,7 @@ struct sfp_eeprom_ext {
  *
  * See the SFF-8472 specification and related documents for the definition
  * of these structure members. This can be obtained from
- * ftp://ftp.seagate.com/sff
+ * https://www.snia.org/technology-communities/sff/specifications
  */
 struct sfp_eeprom_id {
struct sfp_eeprom_base base;
-- 
2.19.2

Re: [PATCH AUTOSEL 4.19 15/68] batman-adv: Expand merged fragment buffer for full packet

2018-11-29 Thread Sergei Shtylyov

Hello!

On 11/29/2018 08:55 AM, Sasha Levin wrote:

> From: Sven Eckelmann 
> 
> [ Upstream commit d7d8bbb40a5b1f682ee6589e212934f4c6b8ad60 ]
> 
> The complete size ("total_size") of the fragmented packet is stored in the
> fragment header and in the size of the fragment chain. When the fragments
> are ready for merge, the skbuff's tail of the first fragment is expanded to
> have enough room after the data pointer for at least total_size. This means
> that it gets expanded by total_size - first_skb->len.
> 
> But this is ignoring the fact that after expanding the buffer, the fragment
> header is pulled by from this buffer. Assuming that the tailroom of the

   Pulled by what?

> buffer was already 0, the buffer after the data pointer of the skbuff is
> now only total_size - len(fragment_header) large. When the merge function
> is then processing the remaining fragments, the code to copy the data over
> to the merged skbuff will cause an skb_over_panic when it tries to actually
> put enough data to fill the total_size bytes of the packet.
> 
> The size of the skb_pull must therefore also be taken into account when the
> buffer's tailroom is expanded.
> 
> Fixes: 610bfc6bc99b ("batman-adv: Receive fragmented packets and merge")
> Reported-by: Martin Weinelt 
> Co-authored-by: Linus Lüssing 
> Signed-off-by: Sven Eckelmann 
> Signed-off-by: Simon Wunderlich 
> Signed-off-by: Sasha Levin 
[...]

MBR, Sergei

Re: [PATCH] net: phy: sfp: correct store of detected link modes

2018-11-29 Thread Russell King - ARM Linux

On Thu, Nov 29, 2018 at 12:40:11PM +0200, Baruch Siach wrote:
> The link modes that sfp_parse_support() detects are stored in the
> 'modes' bitmap. There is no reason to make an exception for 1000Base-PX
> or 1000Base-BX10.

I think you may be carrying some local patch, have an incorrect merge,
or maybe there's a patch in -next which changed this.

Mainline has:

if (bitmap_empty(modes, __ETHTOOL_LINK_MODE_MASK_NBITS)) {
/* If the encoding and bit rate allows 1000baseX */
if (id->base.encoding == SFP_ENCODING_8B10B && br_nom &&
br_min <= 1300 && br_max >= 1200)
phylink_set(modes, 1000baseX_Full);
}

but your patch changes that phylink_set() from:

phylink_set(support, 1000baseX_Full);

to:

phylink_set(modes, 1000baseX_Full);

which in the context of what's in mainline doesn't make sense.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

[iproute PATCH] ssfilter: Fix for inverted last expression

2018-11-29 Thread Phil Sutter

When fixing for shift/reduce conflicts, possibility to invert the last
expression by prefixing with '!' or 'not' was accidentally removed.

Fix this by allowing for expr to be an inverted expr so that any
reference to it in exprlist accepts the inverted prefix.

Reported-by: Eric Dumazet 
Fixes: b2038cc0b2403 ("ssfilter: Eliminate shift/reduce conflicts")
Signed-off-by: Phil Sutter 
---
 misc/ssfilter.y | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/misc/ssfilter.y b/misc/ssfilter.y
index 0413dddaa7584..a901ae753a284 100644
--- a/misc/ssfilter.y
+++ b/misc/ssfilter.y
@@ -54,10 +54,6 @@ null:   /* NOTHING */ { $$ = NULL; }
 ;
 
 exprlist: expr
-| '!' expr
-{
-$$ = alloc_node(SSF_NOT, $2);
-}
 | exprlist '|' expr
 {
 $$ = alloc_node(SSF_OR, $1);
@@ -83,6 +79,10 @@ expr:'(' exprlist ')'
{
$$ = $2;
}
+   | '!' expr
+   {
+   $$ = alloc_node(SSF_NOT, $2);
+   }
| DCOND eq HOSTCOND
 {
$$ = alloc_node(SSF_DCOND, $3);
-- 
2.19.0

Re: [PATCH] selftests/bpf: add config fragment CONFIG_NF_NAT_IPV6

2018-11-29 Thread Paolo Abeni

Hi,

On Thu, 2018-11-29 at 15:01 +0530, Naresh Kamboju wrote:
> CONFIG_NF_NAT_IPV6=y is required for bpf test_sockmap test case
> Fixes,
> ip6tables v1.6.1: can't initialize ip6tables table `nat': Table does
> not exist
> 
> Signed-off-by: Naresh Kamboju 
> ---
>  tools/testing/selftests/bpf/config | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/bpf/config 
> b/tools/testing/selftests/bpf/config
> index 37f947ec44ed..d7076cf04a9d 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -23,3 +23,4 @@ CONFIG_LWTUNNEL=y
>  CONFIG_BPF_STREAM_PARSER=y
>  CONFIG_XDP_SOCKETS=y
>  CONFIG_FTRACE_SYSCALLS=y
> +CONFIG_NF_NAT_IPV6=y

AFAIK, the selftest Kconfig infra does not pull dependant CONFIG items 
('depends on '...) automatically: if the bpf test_sockmap test needs
CONFIG_NF_NAT_IPV6, you should also include non trivial chain of deps
up to whatever is currently explicitly requested.

On the other side, the self-tests already pull CONFIG_NF_NAT_IPV6=m for
'net' related tests, so if you run:

make kselftest-merge

before compiling the kernel for self-tests (which is AFAIK a mandatory
step) you should not get the reported error. Did you performed the
above step?

Cheers,

Paolo

1 2 >

1 - 100 of 135 matches

Mail list logo