Re: [RFC PATCH v3 04/10] ip: factor out protocol delivery helper

2018-10-31 Thread Subash Abhinov Kasiviswanathan

On 2018-10-30 11:24, Paolo Abeni wrote:

So that we can re-use it at the UDP lavel in a later patch



Hi Paolo

Minor queries -
Should it be "level" instead of "lavel"? Similar comment for the ipv6
patch as well.


Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c | 73 ++---
 1 file changed, 36 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 35a786c0aaa0..72250b4e466d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
return false;
 }

-static int ip_local_deliver_finish(struct net *net, struct sock *sk,
struct sk_buff *skb)
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb,
int protocol)


Would it be better if this function was declared in include/net/ip.h &
include/net/ipv6.h rather than in net/ipv4/udp.c & net/ipv6/udp.c as in
the patch "udp: cope with UDP GRO packet misdirection"

diff --git a/include/net/ip.h b/include/net/ip.h
index 72593e1..3d7fdb4 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -717,4 +717,6 @@ static inline void ip_cmsg_recv(struct msghdr *msg, 
struct sk_buff *skb)

 int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
struct netlink_ext_ack *extack);

+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
proto);

+
 #endif /* _IP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8296505..4d4d2cfe 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1101,4 +1101,8 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int 
ifindex,
  const struct in6_addr *addr, unsigned int 
mode);

 int ipv6_sock_mc_drop(struct sock *sk, int ifindex,
  const struct in6_addr *addr);
+
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,

+  bool have_final);
+
 #endif /* _NET_IPV6_H */

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable

2018-10-31 Thread Stephen Hemminger
On Thu,  1 Nov 2018 00:14:38 +
Jeff Barnhill <0xeff...@gmail.com> wrote:

> diff --git a/include/net/addrconf.h b/include/net/addrconf.h
> index 14b789a123e7..799af1a037d1 100644
> --- a/include/net/addrconf.h
> +++ b/include/net/addrconf.h
> @@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct 
> net_device *dev,
>const struct in6_addr *addr);
>  bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
>const struct in6_addr *addr);
> +int anycast_init(void);
> +void anycast_cleanup(void);

One minor nit that should be fixed.


To avoid any potential naming conflicts, please prefix all ipv6 global symbols
with ipv6_



Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Christoph Paasch
On 31/10/18 - 17:53:22, Eric Dumazet wrote:
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> > 
> 
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per 
> connection ?
> 
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, 
> receiving
> very few packets per second per flow)
> 
> Maybe you could elaborate on the need of having one UDP socket per connection.

I let Leif chime in on that as the ask came from him. Leif & his team are
implementing Quic in the Apache Traffic Server.


One advantage I can see is that it would allow to benefit from fq_pacing as
one could set sk_pacing_rate simply on the socket. That way there is no need
to implement the pacing in the user-space anymore.


> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand. Only after the connect() a full 4-tuple match
> > is happening. So, one can't really create this kind of a server that has
> > a connected UDP-socket per Quic connection.
> > 
> > So, what is needed is an "atomic bind & connect" that basically
> > prevents any incoming traffic until the connect() call has been issued
> > at which point the full 4-tuple is known.
> > 
> > 
> > This patchset implements this functionality and exposes a socket-option
> > to do this.
> > 
> > Usage would be:
> > 
> > int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> > 
> > int val = 1;
> > setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> > 
> > bind(fd, (struct sockaddr *)&src, sizeof(src));
> > 
> > /* At this point, incoming traffic will never match on this socket */
> > 
> > connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> > 
> > /* Only now incoming traffic will reach the socket */
> > 
> > 
> > 
> > There is literally an infinite number of ways on how to implement it,
> > which is why I first send it out as an RFC. With this approach here I
> > chose the least invasive one, just preventing the match on the incoming
> > path.
> > 
> > 
> > The reason for choosing a SOL_SOCKET socket-option and not at the
> > SOL_UDP-level is because that functionality actually could be useful for
> > other protocols as well. E.g., TCP wants to better use the full 4-tuple 
> > space
> > by binding to the source-IP and the destination-IP at the same time.
> 
> Passive TCP flows can not benefit from this idea.
> 
> Active TCP flows can already do that, I do not really understand what you are 
> suggesting.

What we had here is that we wanted to let a server initiate more than 64K
connections *while* binding also to a source-IP.
With TCP the bind() would then pick a source-port and we ended up hitting the
64K limit. If we could do an atomic "bind + connect", source-port selection
could ensure that the 4-tuple is unique.

Or has something changed in recent times that allows to use the 4-tuple
matching when doing this with TCP?


Christoph



Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Eric Dumazet



On 10/31/2018 10:08 PM, Eric Dumazet wrote:

> Our plan is to use EDT model for UDP packets, so that we can
> still use one (not connected) UDP socket per cpu/thread.
> 
> We added in linux-4.20 the EDT model for TCP, and I intend to add the 
> remaining part for sch_fq for 4.21.
> 
> UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can 
> be a GSO btw) a tstamp,
> and pacing will happen just fine.
> 

List of EDT patches in reverse order if you want to take a look :

3f80e08f40cdb308589a49077c87632fa4508b21 tcp: add tcp_reset_xmit_timer() helper
4c16128b6271e70c8743178e90cccee147858503 net: loopback: clear skb->tstamp 
before netif_rx()
79861919b8896e14b8e5707242721f2312c57ae4 tcp: fix TCP_REPAIR xmit queue setup
825e1c523d5000f067a1614e4a66bb282a2d373c tcp: cdg: use tcp high resolution 
clock cache
864e5c090749448e879e86bec06ee396aa2c19c5 tcp: optimize tcp internal pacing
7baf33bdac37da65ddce3adf4daa8c7805802174 net_sched: sch_fq: no longer use 
skb_is_tcp_pure_ack()
a7a2563064e963bc5e3f39f533974f2730c0ff56 tcp: mitigate scheduling jitter in EDT 
pacing model
76a9ebe811fb3d0605cb084f1ae6be5610541865 net: extend sk_pacing_rate to unsigned 
long
5f6188a8003d080e3753b8f14f4a5a2325ae1ff6 tcp: do not change tcp_wstamp_ns in 
tcp_mstamp_refresh
fb420d5d91c1274d5966917725e71f27ed092a85 tcp/fq: move back to CLOCK_MONOTONIC
90caf67b01fabdd51b6cdeeb23b29bf73901df90 net_sched: sch_fq: remove dead code 
dealing with retransmits
c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f tcp: switch tcp_internal_pacing() to 
tcp_wstamp_ns
ab408b6dc7449c0f791e9e5f8de72fa7428584f2 tcp: switch tcp and sch_fq to new 
earliest departure time model
fd2bca2aa7893586887b2370e90e85bd0abc805e tcp: switch internal pacing timer to 
CLOCK_TAI
d3edd06ea8ea9e03de6567fda80b8be57e21a537 tcp: provide earliest departure time 
in skb->tstamp
9799ccb0e984a5c1311b22a212e7ff96e8b736de tcp: add tcp_wstamp_ns socket field
142537e419234c396890a22806b8644dce21b132 net_sched: sch_fq: switch to CLOCK_TAI
2fd66ffba50716fc5ab481c48db643af3bda2276 tcp: introduce tcp_skb_timestamp_us() 
helper
72b0094f918294e6cb8cf5c3b4520d928fbb1a57 tcp: switch tcp_clock_ns() to 
CLOCK_TAI base


Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Eric Dumazet



On 10/31/2018 08:50 PM, Christoph Paasch wrote:
> On 31/10/18 - 17:53:22, Eric Dumazet wrote:
>> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>>> Implementations of Quic might want to create a separate socket for each
>>> Quic-connection by creating a connected UDP-socket.
>>>
>>
>> Nice proposal, but I doubt a QUIC server can afford having one UDP socket 
>> per connection ?
>>
>> It would add a huge overhead in term of memory usage in the kernel,
>> and lots of epoll events to manage (say a QUIC server with one million 
>> flows, receiving
>> very few packets per second per flow)
>>
>> Maybe you could elaborate on the need of having one UDP socket per 
>> connection.
> 
> I let Leif chime in on that as the ask came from him. Leif & his team are
> implementing Quic in the Apache Traffic Server.
> 
> 
> One advantage I can see is that it would allow to benefit from fq_pacing as
> one could set sk_pacing_rate simply on the socket. That way there is no need
> to implement the pacing in the user-space anymore.

Our plan is to use EDT model for UDP packets, so that we can
still use one (not connected) UDP socket per cpu/thread.

We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining 
part for sch_fq for 4.21.

UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can 
be a GSO btw) a tstamp,
and pacing will happen just fine.



Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Christoph Paasch



> On Oct 31, 2018, at 10:04 PM, Eric Dumazet  wrote:
> 
> 
> 
>> On 10/31/2018 08:50 PM, Christoph Paasch wrote:
>> 
>> What we had here is that we wanted to let a server initiate more than 64K
>> connections *while* binding also to a source-IP.
>> With TCP the bind() would then pick a source-port and we ended up hitting the
>> 64K limit. If we could do an atomic "bind + connect", source-port selection
>> could ensure that the 4-tuple is unique.
>> 
>> Or has something changed in recent times that allows to use the 4-tuple
>> matching when doing this with TCP?
> 
> 
> Well, yes, although it is not really recent (this came with linux-4.2)
> 
> You can now bind to an address only, and let the sport being automatically 
> chosen at connect()

Oh, I didn’t knew about this socket option. Thanks :-)


Christoph

> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
> 


Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Eric Dumazet



On 10/31/2018 08:50 PM, Christoph Paasch wrote:

> What we had here is that we wanted to let a server initiate more than 64K
> connections *while* binding also to a source-IP.
> With TCP the bind() would then pick a source-port and we ended up hitting the
> 64K limit. If we could do an atomic "bind + connect", source-port selection
> could ensure that the 4-tuple is unique.
> 
> Or has something changed in recent times that allows to use the 4-tuple
> matching when doing this with TCP?


Well, yes, although it is not really recent (this came with linux-4.2)

You can now bind to an address only, and let the sport being automatically 
chosen at connect()

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d



Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-10-31 Thread David Ahern
On 10/31/18 3:57 PM, Paweł Staszewski wrote:
> Hi
> 
> So maybee someone will be interested how linux kernel handles normal
> traffic (not pktgen :) )
> 
> 
> Server HW configuration:
> 
> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> 
> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> 
> 
> Server software:
> 
> FRR - as routing daemon
> 
> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa
> node)
> 
> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
> 
> 
> Maximum traffic that server can handle:
> 
> Bandwidth
> 
>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   \ iface   Rx Tx    Total
> ==
> 
>    enp175s0f1:  28.51 Gb/s   37.24 Gb/s  
> 65.74 Gb/s
>    enp175s0f0:  38.07 Gb/s   28.44 Gb/s  
> 66.51 Gb/s
> --
> 
>     total:  66.58 Gb/s   65.67 Gb/s 
> 132.25 Gb/s
> 
> 
> Packets per second:
> 
>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   - iface   Rx Tx    Total
> ==
> 
>    enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
>    enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
> --
> 
>     total:  8806533.00 P/s   8719134.00 P/s 17525668.00 P/s
> 
> 
> After reaching that limits nics on the upstream side (more RX traffic)
> start to drop packets
> 
> 
> I just dont understand that server can't handle more bandwidth
> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX
> side are increasing.
> 
> Was thinking that maybee reached some pcie x16 limit - but x16 8GT is
> 126Gbit - and also when testing with pktgen i can reach more bw and pps
> (like 4x more comparing to normal internet traffic)
> 
> And wondering if there is something that can be improved here.

This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement. You
need the vlan changes I have queued up though.



Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable

2018-10-31 Thread David Ahern
On 10/31/18 6:02 PM, Jeff Barnhill wrote:
> I'll follow this email with a new patch using ifacaddr6 instead of
> creating a new struct. I ended up using fib6_nh.nh_dev to get the net,
> instead of adding a back pointer to idev.  It seems that idev was
> recently removed in lieu of this, so if this is incorrect, please let
> me know. Hopefully, I got the locking correct.

That's correct. Make sure that the anycast code can not be accessed for
reject routes which will not have a device set. Should be ok, but double
check.


[PATCH v1 net] net: dsa: microchip: initialize mutex before use

2018-10-31 Thread Tristram.Ha
From: Tristram Ha 

Initialize mutex before use.  Avoid kernel complaint when
CONFIG_DEBUG_LOCK_ALLOC is enabled.

Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha 
---
v1
- Remove comment

 drivers/net/dsa/microchip/ksz_common.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index 54e0ca6..86b6464 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
 {
int i;
 
-   mutex_init(&dev->reg_mutex);
-   mutex_init(&dev->stats_mutex);
-   mutex_init(&dev->alu_mutex);
-   mutex_init(&dev->vlan_mutex);
-
dev->ds->ops = &ksz_switch_ops;
 
for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
@@ -1206,6 +1201,11 @@ int ksz_switch_register(struct ksz_device *dev)
if (dev->pdata)
dev->chip_id = dev->pdata->chip_id;
 
+   mutex_init(&dev->reg_mutex);
+   mutex_init(&dev->stats_mutex);
+   mutex_init(&dev->alu_mutex);
+   mutex_init(&dev->vlan_mutex);
+
if (ksz_switch_detect(dev))
return -EINVAL;
 
-- 
1.9.1



Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-31 Thread David Ahern


[ sorry, too many distractions and I forgot to respond ]

On 10/30/18 11:34 AM, Stefano Brivio wrote:
> On Tue, 30 Oct 2018 10:34:45 -0600
> David Ahern  wrote:
> 
>> A more flexible approach is to use format strings to allow users to
>> customize the output order and whitespace as well. So for ss and your
>> column list (winging it here):
>>
>> netid  = %N
>> state  = %S
>> recv Q = %Qr
>> send Q = %Qs
>> local address  = %Al
>> lport port = %Pl
>> remote address = %Ar
>> remote port= %Pr
>> process data   = %p
>> ...
>>
>> then a format string could be: "%S  %Qr %Qs  %Al:%Pl %Ar:%Pr  %p\n"
> 
> I like the idea indeed, but I see two issues with ss:
> 
> - the current column abstraction is rather lightweight, things are
>   already buffered in the defined column order so we don't have to jump
>   back and forth in the buffer while rendering. Doing that needs some
>   extra care to avoid a performance hit, but it's probably doable, I
>   can put that on my to-do list

The ss command is always a pain; it's much better in newer releases but
I am always having to adjust terminal width.

> 
> - how would you model automatic spacing in a format string? Should we
>   support width specifiers? Disable automatic spacing if a format
>   string is given? It might even make sense to allow partial automatic

Follow the format string for whitespace and order, and
yes, on the width specifiers if possible.

>   spacing with a special character in the format string, that is:
> 
>   "%S.%Qr.%Qs  %Al:%Pl %Ar:%Pr  %p\n"
> 
>   would mean "align everything to the right, distribute remaining
>   whitespace between %S, %Qr and %Qs". But it looks rather complicated
>   at a glance.
> 

My concern here is that once this goes in for 1 command, the others in
iproute2 need to follow suit - meaning same syntax style for all
commands. Given that I'd prefer we get a reasonable consensus on syntax
that will work across commands -- ss, ip, tc. If it is as simple as
column names with a fixed order, that is fine but just give proper
consideration given the impact.

The 'ip' syntax for example gets ugly quick with the various link types
and their options. We don't need to allow every detail of each link type
to be customized, but there are common attributes for links (e.g., mtu,
ifindex, link flags, lladdr), addresses, and link types such as bridges
and bonds where we can improve the amount of data thrown at a user -- a
better, more customizable version of what the brief option targeted.


Geschäfts- / Projektkredit 1,5%

2018-10-31 Thread SafetyNet Credit




--
Schönen Tag.
Wir hoffen, Sie gut zu treffen.
Benötigen Sie einen dringenden Kredit für ein Geschäft oder ein Projekt?
Wir bieten Kredite zu 1,5% und wir können Ihre Transaktion innerhalb von 
3 bis 5 Arbeitstagen abschließen.


Wenn Sie ernsthaft an einem Kredit interessiert sind, empfehlen wir 
Ihnen, unten die Einzelheiten zur Bearbeitung Ihrer Transaktion 
anzugeben.


Vollständiger Name:..
Darlehensbetrag: ..
Darlehensdauer: ..
Darlehen Zweck:..
Telefon:..

Wir erwarten Ihre Darlehensdaten wie oben beschrieben für die Abwicklung
Ihrer Transaktion.

Freundliche Grüße.

Wilson Rog.
Buchhalter / Berater


Re: [PATCH net] openvswitch: Fix push/pop ethernet validation

2018-10-31 Thread David Miller
From: Jaime Caamaño Ruiz 
Date: Wed, 31 Oct 2018 18:52:03 +0100

> When there are both pop and push ethernet header actions among the
> actions to be applied to a packet, an unexpected EINVAL (Invalid
> argument) error is obtained. This is due to mac_proto not being reset
> correctly when those actions are validated.
> 
> Reported-at:
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html
> Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions")
> Signed-off-by: Jaime Caamaño Ruiz 

Applied and queued up for -stable.


Re: [net 0/8][pull request] Intel Wired LAN Driver Updates 2018-10-31

2018-10-31 Thread David Miller
From: Jeff Kirsher 
Date: Wed, 31 Oct 2018 12:42:46 -0700

> This series contains a various collection of fixes.
 ...
> The following are changes since commit 
> a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:
>   net: mvpp2: Fix affinity hint allocation
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Pulled, thanks Jeff.


Re: [PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return

2018-10-31 Thread Roman Gushchin
On Thu, Nov 01, 2018 at 12:05:53AM +0100, Daniel Borkmann wrote:
> In the verifier there is no such semantics where registers with
> PTR_TO_MAP_VALUE type have an id assigned to them. This is only
> used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
> test against NULL has been pattern matched and type transformed
> into PTR_TO_MAP_VALUE.
> 
> Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
> Signed-off-by: Daniel Borkmann 
> Cc: Roman Gushchin 
> Acked-by: Alexei Starovoitov 

Looks good to me.
Acked-by: Roman Gushchin 

Thanks!


Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Eric Dumazet



On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
> 

Nice proposal, but I doubt a QUIC server can afford having one UDP socket per 
connection ?

It would add a huge overhead in term of memory usage in the kernel,
and lots of epoll events to manage (say a QUIC server with one million flows, 
receiving
very few packets per second per flow)

Maybe you could elaborate on the need of having one UDP socket per connection.

> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand. Only after the connect() a full 4-tuple match
> is happening. So, one can't really create this kind of a server that has
> a connected UDP-socket per Quic connection.
> 
> So, what is needed is an "atomic bind & connect" that basically
> prevents any incoming traffic until the connect() call has been issued
> at which point the full 4-tuple is known.
> 
> 
> This patchset implements this functionality and exposes a socket-option
> to do this.
> 
> Usage would be:
> 
> int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> 
> int val = 1;
> setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> 
> bind(fd, (struct sockaddr *)&src, sizeof(src));
> 
>   /* At this point, incoming traffic will never match on this socket */
> 
> connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> 
>   /* Only now incoming traffic will reach the socket */
> 
> 
> 
> There is literally an infinite number of ways on how to implement it,
> which is why I first send it out as an RFC. With this approach here I
> chose the least invasive one, just preventing the match on the incoming
> path.
> 
> 
> The reason for choosing a SOL_SOCKET socket-option and not at the
> SOL_UDP-level is because that functionality actually could be useful for
> other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> by binding to the source-IP and the destination-IP at the same time.

Passive TCP flows can not benefit from this idea.

Active TCP flows can already do that, I do not really understand what you are 
suggesting.






Re: pull-request: bpf 2018-11-01

2018-10-31 Thread David Miller
From: Daniel Borkmann 
Date: Thu,  1 Nov 2018 01:28:41 +0100

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking
>case when no data is available yet, from John.
> 
> 2) Fix a compilation error in libbpf_attach_type_by_name() when compiled
>with clang 3.8, from Andrey.
> 
> 3) Fix a partial copy of map pointer on scalar alu and remove id
>generation for RET_PTR_TO_MAP_VALUE return types, from Daniel.
> 
> 4) Add unlimited memlock limit for kernel selftest's flow_dissector_load
>program, from Yonghong.
> 
> 5) Fix ping for some BPF shell based kselftests where distro does not
>ship "ping -6" anymore, from Li.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


pull-request: bpf 2018-11-01

2018-10-31 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking
   case when no data is available yet, from John.

2) Fix a compilation error in libbpf_attach_type_by_name() when compiled
   with clang 3.8, from Andrey.

3) Fix a partial copy of map pointer on scalar alu and remove id
   generation for RET_PTR_TO_MAP_VALUE return types, from Daniel.

4) Add unlimited memlock limit for kernel selftest's flow_dissector_load
   program, from Yonghong.

5) Fix ping for some BPF shell based kselftests where distro does not
   ship "ping -6" anymore, from Li.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:

  net: mvpp2: Fix affinity hint allocation (2018-10-30 11:34:41 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to dfeb8f4c9692fd5e6c3eef19c2e4ae5338dbdb01:

  Merge branch 'verifier-fixes' (2018-10-31 16:53:18 -0700)


Alexei Starovoitov (1):
  Merge branch 'verifier-fixes'

Andrey Ignatov (1):
  libbpf: Fix compile error in libbpf_attach_type_by_name

Daniel Borkmann (4):
  bpf: fix partial copy of map_ptr when dst is scalar
  bpf: don't set id on after map lookup with ptr_to_map_val return
  bpf: add various test cases to test_verifier
  bpf: test make sure to run unpriv test cases in test_verifier

John Fastabend (1):
  bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data

Li Zhijian (1):
  kselftests/bpf: use ping6 as the default ipv6 ping binary if it exists

Yonghong Song (1):
  tools/bpf: add unlimited rlimit for flow_dissector_load

 include/linux/bpf_verifier.h  |   3 +
 kernel/bpf/verifier.c |  21 +-
 net/ipv4/tcp_bpf.c|   1 +
 tools/lib/bpf/libbpf.c|  13 +-
 tools/testing/selftests/bpf/flow_dissector_load.c |   2 +
 tools/testing/selftests/bpf/test_skb_cgroup_id.sh |   3 +-
 tools/testing/selftests/bpf/test_sock_addr.sh |   3 +-
 tools/testing/selftests/bpf/test_verifier.c   | 321 +++---
 8 files changed, 319 insertions(+), 48 deletions(-)


[PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable

2018-10-31 Thread Jeff Barnhill
icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
---
 include/net/addrconf.h |  2 ++
 include/net/if_inet6.h |  2 ++
 net/ipv6/af_inet6.c|  5 
 net/ipv6/anycast.c | 80 +++---
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..799af1a037d1 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device 
*dev,
 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 const struct in6_addr *addr);
+int anycast_init(void);
+void anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..c9c78c15bce0 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -146,10 +146,12 @@ struct ifacaddr6 {
struct in6_addr aca_addr;
struct fib6_info*aca_rt;
struct ifacaddr6*aca_next;
+   struct hlist_node   aca_addr_lst;
int aca_users;
refcount_t  aca_refcnt;
unsigned long   aca_cstamp;
unsigned long   aca_tstamp;
+   struct rcu_head rcu;
 };
 
 #defineIFA_HOSTIPV6_ADDR_LOOPBACK
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..ddc8a6dbfba2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
err = ip6_flowlabel_init();
if (err)
goto ip6_flowlabel_fail;
+   err = anycast_init();
+   if (err)
+   goto anycast_fail;
err = addrconf_init();
if (err)
goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
addrconf_cleanup();
 addrconf_fail:
+   anycast_cleanup();
+anycast_fail:
ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..f6c4c8ac184c 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include 
 
+#define IN6_ADDR_HSIZE_SHIFT   8
+#define IN6_ADDR_HSIZE BIT(IN6_ADDR_HSIZE_SHIFT)
+/* anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr 
*addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+   u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+   return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  * socket join an anycast group
  */
@@ -204,16 +218,39 @@ void ipv6_sock_ac_close(struct sock *sk)
rtnl_unlock();
 }
 
+static void ipv6_add_acaddr_hash(struct net *net, struct ifacaddr6 *aca)
+{
+   unsigned int hash = inet6_acaddr_hash(net, &aca->aca_addr);
+
+   spin_lock(&acaddr_hash_lock);
+   hlist_add_head_rcu(&aca->aca_addr_lst, &inet6_acaddr_lst[hash]);
+   spin_unlock(&acaddr_hash_lock);
+}
+
+static void ipv6_del_acaddr_hash(struct ifacaddr6 *aca)
+{
+   spin_lock(&acaddr_hash_lock);
+   hlist_del_init_rcu(&aca->aca_addr_lst);
+   spin_unlock(&acaddr_hash_lock);
+}
+
 static void aca_get(struct ifacaddr6 *aca)
 {
refcount_inc(&aca->aca_refcnt);
 }
 
+static void aca_free_rcu(struct rcu_head *h)
+{
+   struct ifacaddr6 *aca = container_of(h, struct ifacaddr6, rcu);
+
+   fib6_info_release(aca->aca_rt);
+   kfree(aca);
+}
+
 static void aca_put(struct ifacaddr6 *ac)
 {
if (refcount_dec_and_test(&ac->aca_refcnt)) {
-   fib6_info_release(ac->aca_rt);
-   kfree(ac);
+   call_rcu(&ac->rcu, aca_free_rcu);
}
 }
 
@@ -229,6 +266,7 @@ static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i,
aca->aca_addr = *addr;
fib6_info_hold(f6i);
aca->aca_rt = f6i;
+   INIT_HLIST_NODE(&aca->aca_addr_lst);
aca->aca_users = 1;
/* aca_tstamp should be updated upon changes */
aca->aca_cstamp = aca->aca_tstamp = jiffies;
@@ -285,6 +323

Re: [PATCH bpf 0/4] BPF fixes and tests

2018-10-31 Thread Alexei Starovoitov
On Thu, Nov 01, 2018 at 12:05:51AM +0100, Daniel Borkmann wrote:
> The series contains two fixes in BPF core and test cases. For details
> please see individual patches. Thanks!
> 
> Daniel Borkmann (4):
>   bpf: fix partial copy of map_ptr when dst is scalar
>   bpf: don't set id on after map lookup with ptr_to_map_val return
>   bpf: add various test cases to test_verifier
>   bpf: test make sure to run unpriv test cases in test_verifier
> 
>  include/linux/bpf_verifier.h|   3 +
>  kernel/bpf/verifier.c   |  21 +-
>  tools/testing/selftests/bpf/test_verifier.c | 321 
> +---
>  3 files changed, 305 insertions(+), 40 deletions(-)

Applied to bpf tree, Thanks

... and we achieved very nice milestone... crossed 1000 tests in test_verifier 
:)

Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED



Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable

2018-10-31 Thread Jeff Barnhill
I'll follow this email with a new patch using ifacaddr6 instead of
creating a new struct. I ended up using fib6_nh.nh_dev to get the net,
instead of adding a back pointer to idev.  It seems that idev was
recently removed in lieu of this, so if this is incorrect, please let
me know. Hopefully, I got the locking correct.
Thanks,
Jeff
On Tue, Oct 30, 2018 at 7:19 PM David Miller  wrote:
>
> From: David Ahern 
> Date: Tue, 30 Oct 2018 16:06:46 -0600
>
> > or make the table per namespace.
>
> This will increase namespace create/destroy cost, so I'd rather not
> for something like this.


[PATCH bpf 0/4] BPF fixes and tests

2018-10-31 Thread Daniel Borkmann
The series contains two fixes in BPF core and test cases. For details
please see individual patches. Thanks!

Daniel Borkmann (4):
  bpf: fix partial copy of map_ptr when dst is scalar
  bpf: don't set id on after map lookup with ptr_to_map_val return
  bpf: add various test cases to test_verifier
  bpf: test make sure to run unpriv test cases in test_verifier

 include/linux/bpf_verifier.h|   3 +
 kernel/bpf/verifier.c   |  21 +-
 tools/testing/selftests/bpf/test_verifier.c | 321 +---
 3 files changed, 305 insertions(+), 40 deletions(-)

-- 
2.9.5



[PATCH bpf 1/4] bpf: fix partial copy of map_ptr when dst is scalar

2018-10-31 Thread Daniel Borkmann
ALU operations on pointers such as scalar_reg += map_value_ptr are
handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
and range in the register state share a union, so transferring state
through dst_reg->range = ptr_reg->range is just buggy as any new
map_ptr in the dst_reg is then truncated (or null) for subsequent
checks. Fix this by adding a raw member and use it for copying state
over to dst_reg.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann 
Cc: Edward Cree 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf_verifier.h |  3 +++
 kernel/bpf/verifier.c| 10 ++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9e8056e..d93e897 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -51,6 +51,9 @@ struct bpf_reg_state {
 *   PTR_TO_MAP_VALUE_OR_NULL
 */
struct bpf_map *map_ptr;
+
+   /* Max size from any of the above. */
+   unsigned long raw;
};
/* Fixed part of pointer offset, pointer types only */
s32 off;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 171a2c8..774fa40 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3046,7 +3046,7 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->umax_value = umax_ptr;
dst_reg->var_off = ptr_reg->var_off;
dst_reg->off = ptr_reg->off + smin_val;
-   dst_reg->range = ptr_reg->range;
+   dst_reg->raw = ptr_reg->raw;
break;
}
/* A new variable offset is created.  Note that off_reg->off
@@ -3076,10 +3076,11 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
}
dst_reg->var_off = tnum_add(ptr_reg->var_off, off_reg->var_off);
dst_reg->off = ptr_reg->off;
+   dst_reg->raw = ptr_reg->raw;
if (reg_is_pkt_pointer(ptr_reg)) {
dst_reg->id = ++env->id_gen;
/* something was added to pkt_ptr, set range to zero */
-   dst_reg->range = 0;
+   dst_reg->raw = 0;
}
break;
case BPF_SUB:
@@ -3108,7 +3109,7 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->var_off = ptr_reg->var_off;
dst_reg->id = ptr_reg->id;
dst_reg->off = ptr_reg->off - smin_val;
-   dst_reg->range = ptr_reg->range;
+   dst_reg->raw = ptr_reg->raw;
break;
}
/* A new variable offset is created.  If the subtrahend is known
@@ -3134,11 +3135,12 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
}
dst_reg->var_off = tnum_sub(ptr_reg->var_off, off_reg->var_off);
dst_reg->off = ptr_reg->off;
+   dst_reg->raw = ptr_reg->raw;
if (reg_is_pkt_pointer(ptr_reg)) {
dst_reg->id = ++env->id_gen;
/* something was added to pkt_ptr, set range to zero */
if (smin_val < 0)
-   dst_reg->range = 0;
+   dst_reg->raw = 0;
}
break;
case BPF_AND:
-- 
2.9.5



[PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return

2018-10-31 Thread Daniel Borkmann
In the verifier there is no such semantics where registers with
PTR_TO_MAP_VALUE type have an id assigned to them. This is only
used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
test against NULL has been pattern matched and type transformed
into PTR_TO_MAP_VALUE.

Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
Signed-off-by: Daniel Borkmann 
Cc: Roman Gushchin 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 774fa40..1971ca32 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2852,10 +2852,6 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
regs[BPF_REG_0].type = NOT_INIT;
} else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL ||
   fn->ret_type == RET_PTR_TO_MAP_VALUE) {
-   if (fn->ret_type == RET_PTR_TO_MAP_VALUE)
-   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
-   else
-   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
/* There is no offset yet applied, variable or fixed */
mark_reg_known_zero(env, regs, BPF_REG_0);
/* remember map_ptr, so that check_map_access()
@@ -2868,7 +2864,12 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
return -EINVAL;
}
regs[BPF_REG_0].map_ptr = meta.map_ptr;
-   regs[BPF_REG_0].id = ++env->id_gen;
+   if (fn->ret_type == RET_PTR_TO_MAP_VALUE) {
+   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
+   } else {
+   regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
+   regs[BPF_REG_0].id = ++env->id_gen;
+   }
} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
int id = acquire_reference_state(env, insn_idx);
if (id < 0)
-- 
2.9.5



[PATCH bpf 4/4] bpf: test make sure to run unpriv test cases in test_verifier

2018-10-31 Thread Daniel Borkmann
Right now unprivileged tests are never executed as a BPF test run,
only loaded. Allow for running them as well so that we can check
the outcome and probe for regressions.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 71 -
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 4c7445d..6f61df6 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -76,7 +76,7 @@ struct bpf_test {
int fixup_percpu_cgroup_storage[MAX_FIXUPS];
const char *errstr;
const char *errstr_unpriv;
-   uint32_t retval;
+   uint32_t retval, retval_unpriv;
enum {
UNDEF,
ACCEPT,
@@ -3084,6 +3084,8 @@ static struct bpf_test tests[] = {
.fixup_prog1 = { 2 },
.result = ACCEPT,
.retval = 42,
+   /* Verifier rewrite for unpriv skips tail call here. */
+   .retval_unpriv = 2,
},
{
"stack pointer arithmetic",
@@ -14149,6 +14151,33 @@ static void do_test_fixup(struct bpf_test *test, enum 
bpf_map_type prog_type,
}
 }
 
+static int set_admin(bool admin)
+{
+   cap_t caps;
+   const cap_value_t cap_val = CAP_SYS_ADMIN;
+   int ret = -1;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+   if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
+   admin ? CAP_SET : CAP_CLEAR)) {
+   perror("cap_set_flag");
+   goto out;
+   }
+   if (cap_set_proc(caps)) {
+   perror("cap_set_proc");
+   goto out;
+   }
+   ret = 0;
+out:
+   if (cap_free(caps))
+   perror("cap_free");
+   return ret;
+}
+
 static void do_test_single(struct bpf_test *test, bool unpriv,
   int *passes, int *errors)
 {
@@ -14157,6 +14186,7 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
struct bpf_insn *prog = test->insns;
int map_fds[MAX_NR_MAPS];
const char *expected_err;
+   uint32_t expected_val;
uint32_t retval;
int i, err;
 
@@ -14176,6 +14206,8 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
   test->result_unpriv : test->result;
expected_err = unpriv && test->errstr_unpriv ?
   test->errstr_unpriv : test->errstr;
+   expected_val = unpriv && test->retval_unpriv ?
+  test->retval_unpriv : test->retval;
 
reject_from_alignment = fd_prog < 0 &&
(test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -14209,16 +14241,20 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
__u8 tmp[TEST_DATA_LEN << 2];
__u32 size_tmp = sizeof(tmp);
 
+   if (unpriv)
+   set_admin(true);
err = bpf_prog_test_run(fd_prog, 1, test->data,
sizeof(test->data), tmp, &size_tmp,
&retval, NULL);
+   if (unpriv)
+   set_admin(false);
if (err && errno != 524/*ENOTSUPP*/ && errno != EPERM) {
printf("Unexpected bpf_prog_test_run error\n");
goto fail_log;
}
-   if (!err && retval != test->retval &&
-   test->retval != POINTER_VALUE) {
-   printf("FAIL retval %d != %d\n", retval, test->retval);
+   if (!err && retval != expected_val &&
+   expected_val != POINTER_VALUE) {
+   printf("FAIL retval %d != %d\n", retval, expected_val);
goto fail_log;
}
}
@@ -14261,33 +14297,6 @@ static bool is_admin(void)
return (sysadmin == CAP_SET);
 }
 
-static int set_admin(bool admin)
-{
-   cap_t caps;
-   const cap_value_t cap_val = CAP_SYS_ADMIN;
-   int ret = -1;
-
-   caps = cap_get_proc();
-   if (!caps) {
-   perror("cap_get_proc");
-   return -1;
-   }
-   if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
-   admin ? CAP_SET : CAP_CLEAR)) {
-   perror("cap_set_flag");
-   goto out;
-   }
-   if (cap_set_proc(caps)) {
-   perror("cap_set_proc");
-   goto out;
-   }
-   ret = 0;
-out:
-   if (cap_free(caps))
-   perror("cap_free");
-   return ret;
-}
-
 static void get_unpriv_disabled()
 {
char buf[2];
-- 
2.9.5



[PATCH bpf 3/4] bpf: add various test cases to test_verifier

2018-10-31 Thread Daniel Borkmann
Add some more map related test cases to test_verifier kselftest
to improve test coverage. Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 250 
 1 file changed, 250 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 36f3d30..4c7445d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -6455,6 +6455,256 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
},
{
+   "map access: known scalar += value_ptr",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+   BPF_MOV64_IMM(BPF_REG_1, 4),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_array_48b = { 3 },
+   .result = ACCEPT,
+   .retval = 1,
+   },
+   {
+   "map access: value_ptr += known scalar",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+   BPF_MOV64_IMM(BPF_REG_1, 4),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_array_48b = { 3 },
+   .result = ACCEPT,
+   .retval = 1,
+   },
+   {
+   "map access: unknown scalar += value_ptr",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_array_48b = { 3 },
+   .result = ACCEPT,
+   .retval = 1,
+   },
+   {
+   "map access: value_ptr += unknown scalar",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+   BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_array_48b = { 3 },
+   .result = ACCEPT,
+   .retval = 1,
+   },
+   {
+   "map access: value_ptr += value_ptr",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+

[RFC 1/2] net: Add new socket-option SO_DELAYED_BIND

2018-10-31 Thread Christoph Paasch
And store it as a flag in the sk_flags.

Signed-off-by: Christoph Paasch 
---
 arch/alpha/include/uapi/asm/socket.h  |  2 ++
 arch/ia64/include/uapi/asm/socket.h   |  2 ++
 arch/mips/include/uapi/asm/socket.h   |  2 ++
 arch/parisc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h   |  2 ++
 arch/sparc/include/uapi/asm/socket.h  |  2 ++
 arch/xtensa/include/uapi/asm/socket.h |  2 ++
 include/net/sock.h|  1 +
 include/uapi/asm-generic/socket.h |  2 ++
 net/core/sock.c   | 21 +
 10 files changed, 38 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..add6aca13b53 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -115,4 +115,6 @@
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND62
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..98a86f406601 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -117,4 +117,6 @@
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND62
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..f84bd74d58ee 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -126,4 +126,6 @@
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND62
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..8fe20a7abf6e 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -107,4 +107,6 @@
 #define SO_TXTIME  0x4036
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND0x4037
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..c00b10909a72 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -114,4 +114,6 @@
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND62
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..0825db0c9f46 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -104,6 +104,8 @@
 #define SO_TXTIME  0x003f
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND0x0040
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 1de07a7f7680..cd4d91e982d5 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -119,4 +119,6 @@
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND62
+
 #endif /* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index f665d74ae509..16fbe54cf519 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -801,6 +801,7 @@ enum sock_flags {
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
SOCK_TXTIME,
SOCK_XDP, /* XDP is attached */
+   SOCK_DELAYED_BIND,
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/include/uapi/asm-generic/socket.h 
b/include/uapi/asm-generic/socket.h
index a12692e5f7a8..653f1f65a311 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -110,4 +110,6 @@
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
 
+#define SO_DELAYED_BIND62
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 6fcc4bc07d19..343baa820cf2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1047,6 +1047,23 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
}
break;
 
+   case SO_DELAYED_BIND:
+   if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
+   if (sk->sk_protocol != IPPROTO_UDP)
+   ret = -ENOTSUPP;
+   } else {
+   ret = -ENOTSUPP;
+   }
+
+   if (!ret) {
+   if (val < 0 || val > 1)
+   

[RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-10-31 Thread Christoph Paasch
Implementations of Quic might want to create a separate socket for each
Quic-connection by creating a connected UDP-socket.

To achieve that on the server-side, a "master-socket" needs to wait for
incoming new connections and then creates a new socket that will be a
connected UDP-socket. To create that latter one, the server needs to
first bind() and then connect(). However, after the bind() the server
might already receive traffic on that new socket that is unrelated to the
Quic-connection at hand. Only after the connect() a full 4-tuple match
is happening. So, one can't really create this kind of a server that has
a connected UDP-socket per Quic connection.

So, what is needed is an "atomic bind & connect" that basically
prevents any incoming traffic until the connect() call has been issued
at which point the full 4-tuple is known.


This patchset implements this functionality and exposes a socket-option
to do this.

Usage would be:

int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

int val = 1;
setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));

bind(fd, (struct sockaddr *)&src, sizeof(src));

/* At this point, incoming traffic will never match on this socket */

connect(fd, (struct sockaddr *)&dst, sizeof(dst));

/* Only now incoming traffic will reach the socket */



There is literally an infinite number of ways on how to implement it,
which is why I first send it out as an RFC. With this approach here I
chose the least invasive one, just preventing the match on the incoming
path.


The reason for choosing a SOL_SOCKET socket-option and not at the
SOL_UDP-level is because that functionality actually could be useful for
other protocols as well. E.g., TCP wants to better use the full 4-tuple space
by binding to the source-IP and the destination-IP at the same time.


Feedback is very welcome!


Christoph Paasch (2):
  net: Add new socket-option SO_DELAYED_BIND
  udp: Support SO_DELAYED_BIND

 arch/alpha/include/uapi/asm/socket.h  |  2 ++
 arch/ia64/include/uapi/asm/socket.h   |  2 ++
 arch/mips/include/uapi/asm/socket.h   |  2 ++
 arch/parisc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h   |  2 ++
 arch/sparc/include/uapi/asm/socket.h  |  2 ++
 arch/xtensa/include/uapi/asm/socket.h |  2 ++
 include/net/sock.h|  1 +
 include/uapi/asm-generic/socket.h |  2 ++
 net/core/sock.c   | 21 +
 net/ipv4/datagram.c   |  1 +
 net/ipv4/udp.c|  3 +++
 12 files changed, 42 insertions(+)

-- 
2.16.2



[RFC 2/2] udp: Support SO_DELAYED_BIND

2018-10-31 Thread Christoph Paasch
For UDP, there is only a single socket-hash table, the udptable.

We want to prevent incoming segments to match on this socket when
SO_DELAYED_BIND is set. Thus, when computing the score for unconnected
sockets, we simply prevent the match as long as the flag is set.

Signed-off-by: Christoph Paasch 
---
 net/ipv4/datagram.c | 1 +
 net/ipv4/udp.c  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 300921417f89..9bf0e0d2ea33 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -78,6 +78,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len
inet->inet_id = jiffies;
 
sk_dst_set(sk, &rt->dst);
+   sock_reset_flag(sk, SOCK_DELAYED_BIND);
err = 0;
 out:
return err;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca3ed931f2a9..fb55f925342b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -408,6 +408,9 @@ static int compute_score(struct sock *sk, struct net *net,
score += 4;
}
 
+   if (sock_flag(sk, SOCK_DELAYED_BIND))
+   return -1;
+
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
return score;
-- 
2.16.2



Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-10-31 Thread Paweł Staszewski




W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze:



W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:


On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles normal 
traffic (not pktgen :) )



Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local 
numa node)


enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local 
numa node)



Maximum traffic that server can handle:

Bandwidth

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   \ iface   Rx Tx Total
== 

    enp175s0f1:  28.51 Gb/s   37.24 
Gb/s   65.74 Gb/s
    enp175s0f0:  38.07 Gb/s   28.44 
Gb/s   66.51 Gb/s
-- 

 total:  66.58 Gb/s   65.67 
Gb/s  132.25 Gb/s



Packets per second:

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   - iface   Rx Tx Total
== 

    enp175s0f1:  5248589.00 P/s   3486617.75 P/s 
8735207.00 P/s
    enp175s0f0:  3557944.25 P/s   5232516.00 P/s 
8790460.00 P/s
-- 

 total:  8806533.00 P/s   8719134.00 P/s 
17525668.00 P/s



After reaching that limits nics on the upstream side (more RX 
traffic) start to drop packets



I just dont understand that server can't handle more bandwidth 
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX 
side are increasing.


Was thinking that maybee reached some pcie x16 limit - but x16 8GT 
is 126Gbit - and also when testing with pktgen i can reach more bw 
and pps (like 4x more comparing to normal internet traffic)


And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u



System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1





ip -s -d link ls dev enp175s0f0
6: enp175s0f0:  mtu 1500 qdisc mq 
state UP mode DEFAULT group default qlen 8192
 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 
0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
65536 gso_max_segs 65535

 RX: bytes  packets  errors  dropped overrun mcast
 184142375840858 141347715974 2   2806325 0 85050528
 TX: bytes  packets  errors  dropped carrier collsns
 99270697277430 172227994003 0   0   0   0

  ip -s -d link ls dev enp175s0f1
7: enp175s0f1:  mtu 1500 qdisc mq 
state UP mode DEFAULT group default qlen 8192
 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 
0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
65536 gso_max_segs 65535

 RX: bytes  packets  errors  dropped overrun mcast
 99686284170801 173507590134 61  669685  0 100304421
 TX: bytes  packets  errors  dropped carrier collsns
 184435107970545 142383178304 0   0   0   0


./softnet.sh
cpu  total    dropped   squeezed  collision    rps flow_limit




    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
--- 



 26.78%  [kernel]   [k] queued_spin_lock_slowpath

This is highly suspect.

A call graph (perf record -a -g sleep 1; perf report --stdio) would 
tell what is going on.

perf report:
https://ufile.io/rqp0h





With that many TX/RX queues, I would expect you to not use RPS/RFS, 
and have a 1/1 RX/TX mapping,

so I do not know what could request a spinlock contention.






And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on 
local cpu for the network controller for 28 RX+TX queues per nic .






Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-10-31 Thread Paweł Staszewski




W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:


On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles normal traffic 
(not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)


Maximum traffic that server can handle:

Bandwidth

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   \ iface   Rx Tx    Total
==
    enp175s0f1:  28.51 Gb/s   37.24 Gb/s   65.74 
Gb/s
    enp175s0f0:  38.07 Gb/s   28.44 Gb/s   66.51 
Gb/s
--
     total:  66.58 Gb/s   65.67 Gb/s  132.25 
Gb/s


Packets per second:

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   - iface   Rx Tx    Total
==
    enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
    enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
--
     total:  8806533.00 P/s   8719134.00 P/s 17525668.00 P/s


After reaching that limits nics on the upstream side (more RX traffic) start to 
drop packets


I just dont understand that server can't handle more bandwidth (~40Gbit/s is 
limit where all cpu's are 100% util) - where pps on RX side are increasing.

Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - 
and also when testing with pktgen i can reach more bw and pps (like 4x more 
comparing to normal internet traffic)

And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u



System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1





ip -s -d link ls dev enp175s0f0
6: enp175s0f0:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
     link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
     RX: bytes  packets  errors  dropped overrun mcast
     184142375840858 141347715974 2   2806325 0   85050528
     TX: bytes  packets  errors  dropped carrier collsns
     99270697277430 172227994003 0   0   0   0

  ip -s -d link ls dev enp175s0f1
7: enp175s0f1:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
     RX: bytes  packets  errors  dropped overrun mcast
     99686284170801 173507590134 61  669685  0   100304421
     TX: bytes  packets  errors  dropped carrier collsns
     184435107970545 142383178304 0   0   0   0


./softnet.sh
cpu  total    dropped   squeezed  collision    rps flow_limit




    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  
(all, 56 CPUs)
---

     26.78%  [kernel]   [k] queued_spin_lock_slowpath

This is highly suspect.

A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what 
is going on.

perf report:
https://ufile.io/rqp0h





With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 
1/1 RX/TX mapping,
so I do not know what could request a spinlock contention.







Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-10-31 Thread Eric Dumazet



On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
> Hi
> 
> So maybee someone will be interested how linux kernel handles normal traffic 
> (not pktgen :) )
> 
> 
> Server HW configuration:
> 
> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> 
> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> 
> 
> Server software:
> 
> FRR - as routing daemon
> 
> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
> 
> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
> 
> 
> Maximum traffic that server can handle:
> 
> Bandwidth
> 
>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   \ iface   Rx Tx    Total
> ==
>    enp175s0f1:  28.51 Gb/s   37.24 Gb/s   65.74 
> Gb/s
>    enp175s0f0:  38.07 Gb/s   28.44 Gb/s   66.51 
> Gb/s
> --
>     total:  66.58 Gb/s   65.67 Gb/s  132.25 
> Gb/s
> 
> 
> Packets per second:
> 
>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   - iface   Rx Tx    Total
> ==
>    enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
>    enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
> --
>     total:  8806533.00 P/s   8719134.00 P/s 17525668.00 P/s
> 
> 
> After reaching that limits nics on the upstream side (more RX traffic) start 
> to drop packets
> 
> 
> I just dont understand that server can't handle more bandwidth (~40Gbit/s is 
> limit where all cpu's are 100% util) - where pps on RX side are increasing.
> 
> Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit 
> - and also when testing with pktgen i can reach more bw and pps (like 4x more 
> comparing to normal internet traffic)
> 
> And wondering if there is something that can be improved here.
> 
> 
> 
> Some more informations / counters / stats and perf top below:
> 
> Perf top flame graph:
> 
> https://uploadfiles.io/7zo6u
> 
> 
> 
> System configuration(long):
> 
> 
> cat /sys/devices/system/node/node1/cpulist
> 14-27,42-55
> cat /sys/class/net/enp175s0f0/device/numa_node
> 1
> cat /sys/class/net/enp175s0f1/device/numa_node
> 1
> 
> 
> 
> 
> 
> ip -s -d link ls dev enp175s0f0
> 6: enp175s0f0:  mtu 1500 qdisc mq state UP 
> mode DEFAULT group default qlen 8192
>     link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
> addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
> gso_max_segs 65535
>     RX: bytes  packets  errors  dropped overrun mcast
>     184142375840858 141347715974 2   2806325 0   85050528
>     TX: bytes  packets  errors  dropped carrier collsns
>     99270697277430 172227994003 0   0   0   0
> 
>  ip -s -d link ls dev enp175s0f1
> 7: enp175s0f1:  mtu 1500 qdisc mq state UP 
> mode DEFAULT group default qlen 8192
>     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
> addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
> gso_max_segs 65535
>     RX: bytes  packets  errors  dropped overrun mcast
>     99686284170801 173507590134 61  669685  0   100304421
>     TX: bytes  packets  errors  dropped carrier collsns
>     184435107970545 142383178304 0   0   0   0
> 
> 
> ./softnet.sh
> cpu  total    dropped   squeezed  collision    rps flow_limit
> 
> 
> 
> 
>    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  
> (all, 56 CPUs)
> ---
> 
>     26.78%  [kernel]   [k] queued_spin_lock_slowpath

This is highly suspect.

A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what 
is going on.

With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 
1/1 RX/TX mapping,
so I do not know what could request a spinlock contention.




RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Keller, Jacob E



> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, October 31, 2018 2:17 PM
> To: Miroslav Lichvar 
> Cc: Keller, Jacob E ; netdev@vger.kernel.org; 
> intel-wired-
> l...@lists.osuosl.org
> Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> On Wed, Oct 31, 2018 at 03:49:35PM +0100, Miroslav Lichvar wrote:
> >
> > How about separating the PHC timestamp from the ptp_system_timestamp
> > structure and use NULL to indicate we don't want to read the system
> > clock? A gettimex64(ptp, ts, NULL) call would be equal to
> > gettime64(ptp, ts).
> 
> Doesn't sound too bad to me.
> 
> Thanks,
> Richard

Yep, this seems fine to me as well.

Regards,
Jake


Re: [PATCH bpf] libbpf: Fix compile error in libbpf_attach_type_by_name

2018-10-31 Thread Daniel Borkmann
On 10/31/2018 09:49 PM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Oct 31, 2018 at 12:57:18PM -0700, Andrey Ignatov escreveu:
>> Arnaldo Carvalho de Melo reported build error in libbpf when clang
>> version 3.8.1-24 (tags/RELEASE_381/final) is used:
>>
>> libbpf.c:2201:36: error: comparison of constant -22 with expression of
>> type 'const enum bpf_attach_type' is always false
>> [-Werror,-Wtautological-constant-out-of-range-compare]
>> if (section_names[i].attach_type == -EINVAL)
>>  ^  ~~~
>> 1 error generated.
>>
>> Fix the error by keeping "is_attachable" property of a program in a
>> separate struct field instead of trying to use attach_type itself.
> 
> Thanks, now it builds in all the previously failing systems:
> 
> # export PERF_TARBALL=http://192.168.86.4/perf/perf-4.19.0.tar.xz
> # dm debian:9 fedora:25 fedora:26 fedora:27 ubuntu:16.04 ubuntu:17.10
>1 debian:9: Ok   gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516   
>clang version 3.8.1-24 (tags/RELEASE_381/final)
>2 fedora:25   : Ok   gcc (GCC) 6.4.1 20170727 (Red Hat 6.4.1-1)
>clang version 3.9.1 (tags/RELEASE_391/final)
>3 fedora:26   : Ok   gcc (GCC) 7.3.1 20180130 (Red Hat 7.3.1-2)
>clang version 4.0.1 (tags/RELEASE_401/final)
>4 fedora:27   : Ok   gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
>clang version 5.0.2 (tags/RELEASE_502/final)
>5 ubuntu:16.04: Ok   gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 
> 20160609  clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
>6 ubuntu:17.10: Ok   gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0   
>clang version 4.0.1-6 (tags/RELEASE_401/final)
> #
> 
> Tested-by: Arnaldo Carvalho de Melo 

Thanks everyone, applied to bpf tree.


Re: Ethernet on my CycloneV broke since 4.9.124

2018-10-31 Thread Dinh Nguyen
Hi Clement,

On 10/31/2018 10:36 AM, Clément Péron wrote:
> Hi Dinh,
> 
> On Wed, 31 Oct 2018 at 15:42, Dinh Nguyen  wrote:
>>
>> Hi Clement,
>>
>> On 10/31/2018 08:01 AM, Clément Péron wrote:
>>> Hi,
>>>
>>> The patch "net: stmmac: socfpga: add additional ocp reset line for
>>> Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV
>>> board.
>>>
>>> When I boot i have this issue :
>>>
>>> socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2
>>> socfpga-dwmac: probe of ff702000.ethernet failed with error -2
>>>
>>> Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the 
>>> issue.
>>>
>>
>> Are you sure? I just booted v4.9.124 and did not see any errors. The
>> error should not appear because the commit is using
>> devm_reset_control_get_optional().
> 
> I'm booting on 4.9.130 actually, Agree with you that
> devm_reset_control_get_optional should not failed but checking other
> usage of this helper
> https://elixir.bootlin.com/linux/v4.9.135/source/drivers/i2c/busses/i2c-mv64xxx.c#L824
> https://elixir.bootlin.com/linux/v4.9.135/source/drivers/crypto/sunxi-ss/sun4i-ss-core.c#L259
> Show that they don't check for errors except for PROBE_DEFER
> 

I made a mistake, I was booting linux-next. I am seeing the error with
v4.9.124. It's due to this commit not getting backported:

"bb475230b8e59a reset: make optional functions really optional"

I have backported the patch and is available here if you like to take a
look:

https://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux.git/log/?h=v4.9.124_optional_reset

Dinh


Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Paweł Staszewski




W dniu 30.10.2018 o 15:16, Eric Dumazet pisze:


On 10/30/2018 01:09 AM, Paweł Staszewski wrote:


W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.





No RXFCS

And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure

Might be vlan related.

Can you first check this :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 
94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 skb->ip_summed = CHECKSUM_COMPLETE;
 skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
 if (network_depth > ETH_HLEN)
-   /* CQE csum is calculated from the IP header and does
-* not cover VLAN headers (if present). This will add
-* the checksum manually.
-*/
-   skb->csum = csum_partial(skb->data + ETH_HLEN,
-network_depth - ETH_HLEN,
-skb->csum);
+   /* Temporary debugging */
+   skb->ip_summed = CHECKSUM_NONE;
 if (unlikely(netdev->features & NETIF_F_RXFCS))
 skb->csum = csum_add(skb->csum,
  (__force 
__wsum)mlx5e_get_fcs(skb));




Ok thanks - will try it.




Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Paweł Staszewski




W dniu 31.10.2018 o 22:05, Saeed Mahameed pisze:

On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote:

On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet 
wrote:



On 10/30/2018 01:09 AM, Paweł Staszewski wrote:


W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent
errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting
ip_summed to CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was
canceling the bug you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I
mentioned earlier,
plus __get_unaligned_cpu32() as you hinted.





No RXFCS


Same with Pawel, RXFCS is disabled by default.



And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure

Might be vlan related.

Hi Pawel, is the vlan stripping offload disabled or enabled in your
case ?

To verify:
ethtool -k  | grep rx-vlan-offload
rx-vlan-offload: on
To set:
ethtool -K  rxvlan on/off

Enabled:
ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]




if the vlan offload is off then it will trigger the mlx5e vlan csum
adjustment code pointed out by Eric.

Anyhow, it should work in both cases, but i am trying to narrow down
the possibilities.

Also could it be a double tagged packet ?

no double tagged packets there






Unlike Pawel's case, we don't use vlan at all, maybe this is why we
see
it much less frequently than Pawel.

Also, it is probably not specific to mlx5, as there is another report
which
is probably a non-mlx5 driver.


Cong, How often does this happen ? can you some how verify if the
problematic packet has extra end padding after the ip payload ?

It would be cool if we had a feature in kernel to store such SKB in
memory when such issue occurs, and let the user dump it later (via
tcpdump) and send the dump to the vendor for debug so we could just
replay and see what happens.


Thanks.




Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Cong Wang
On Wed, Oct 31, 2018 at 2:05 PM Saeed Mahameed  wrote:
>
> Cong, How often does this happen ? can you some how verify if the
> problematic packet has extra end padding after the ip payload ?

For us, we need 10+ hours to get one warning. This is also
why we never capture the packet that causes this warning.


>
> It would be cool if we had a feature in kernel to store such SKB in
> memory when such issue occurs, and let the user dump it later (via
> tcpdump) and send the dump to the vendor for debug so we could just
> replay and see what happens.
>

Yeah, the warning kinda sucks, it tells almost nothing, the SKB
should be dumped up on this warning.


Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Richard Cochran
On Wed, Oct 31, 2018 at 03:49:35PM +0100, Miroslav Lichvar wrote:
> 
> How about separating the PHC timestamp from the ptp_system_timestamp
> structure and use NULL to indicate we don't want to read the system
> clock? A gettimex64(ptp, ts, NULL) call would be equal to
> gettime64(ptp, ts).

Doesn't sound too bad to me.

Thanks,
Richard


Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Saeed Mahameed
On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet 
> wrote:
> > 
> > 
> > 
> > On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> > > 
> > > 
> > > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
> > > > 
> > > > On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
> > > > 
> > > > > Indeed this is a bug. I would expect it to produce frequent
> > > > > errors
> > > > > though as many odd-length
> > > > > packets would trigger it. Do you have RXFCS? Regardless, how
> > > > > frequently do you see the problem?
> > > > > 
> > > > 
> > > > Old kernels (before 88078d98d1bb) were simply resetting
> > > > ip_summed to CHECKSUM_NONE
> > > > 
> > > > And before your fix (commit d55bef5059dd057bd), mlx5 bug was
> > > > canceling the bug you fixed.
> > > > 
> > > > So we now need to also fix mlx5.
> > > > 
> > > > And of course use skb_header_pointer() in mlx5e_get_fcs() as I
> > > > mentioned earlier,
> > > > plus __get_unaligned_cpu32() as you hinted.
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > No RXFCS
> 
> 
> Same with Pawel, RXFCS is disabled by default.
> 
> 
> > > 
> > > And this trace is rly frequently like once per 3/4 seconds
> > > like below:
> > > [28965.776864] vlan1490: hw csum failure
> > 
> > Might be vlan related.
> 

Hi Pawel, is the vlan stripping offload disabled or enabled in your
case ? 

To verify:
ethtool -k  | grep rx-vlan-offload
rx-vlan-offload: on
To set:
ethtool -K  rxvlan on/off

if the vlan offload is off then it will trigger the mlx5e vlan csum
adjustment code pointed out by Eric.

Anyhow, it should work in both cases, but i am trying to narrow down
the possibilities. 

Also could it be a double tagged packet ?


> Unlike Pawel's case, we don't use vlan at all, maybe this is why we
> see
> it much less frequently than Pawel.
> 
> Also, it is probably not specific to mlx5, as there is another report
> which
> is probably a non-mlx5 driver.
> 

Cong, How often does this happen ? can you some how verify if the
problematic packet has extra end padding after the ip payload ?

It would be cool if we had a feature in kernel to store such SKB in
memory when such issue occurs, and let the user dump it later (via
tcpdump) and send the dump to the vendor for debug so we could just
replay and see what happens.

> Thanks.


Re: [PATCH bpf] libbpf: Fix compile error in libbpf_attach_type_by_name

2018-10-31 Thread Arnaldo Carvalho de Melo
Em Wed, Oct 31, 2018 at 12:57:18PM -0700, Andrey Ignatov escreveu:
> Arnaldo Carvalho de Melo reported build error in libbpf when clang
> version 3.8.1-24 (tags/RELEASE_381/final) is used:
> 
> libbpf.c:2201:36: error: comparison of constant -22 with expression of
> type 'const enum bpf_attach_type' is always false
> [-Werror,-Wtautological-constant-out-of-range-compare]
> if (section_names[i].attach_type == -EINVAL)
>  ^  ~~~
> 1 error generated.
> 
> Fix the error by keeping "is_attachable" property of a program in a
> separate struct field instead of trying to use attach_type itself.

Thanks, now it builds in all the previously failing systems:

# export PERF_TARBALL=http://192.168.86.4/perf/perf-4.19.0.tar.xz
# dm debian:9 fedora:25 fedora:26 fedora:27 ubuntu:16.04 ubuntu:17.10
   1 debian:9: Ok   gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 
 clang version 3.8.1-24 (tags/RELEASE_381/final)
   2 fedora:25   : Ok   gcc (GCC) 6.4.1 20170727 (Red Hat 6.4.1-1)  
 clang version 3.9.1 (tags/RELEASE_391/final)
   3 fedora:26   : Ok   gcc (GCC) 7.3.1 20180130 (Red Hat 7.3.1-2)  
 clang version 4.0.1 (tags/RELEASE_401/final)
   4 fedora:27   : Ok   gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)  
 clang version 5.0.2 (tags/RELEASE_502/final)
   5 ubuntu:16.04: Ok   gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 
 clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
   6 ubuntu:17.10: Ok   gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0 
 clang version 4.0.1-6 (tags/RELEASE_401/final)
#

Tested-by: Arnaldo Carvalho de Melo 

I also have it tentatively applied to my perf/urgent branch, that I'll
push upstream soon.

- Arnaldo
 
> Fixes: commit 956b620fcf0b ("libbpf: Introduce libbpf_attach_type_by_name")
> Reported-by: Arnaldo Carvalho de Melo 
> Signed-off-by: Andrey Ignatov 
> ---
>  tools/lib/bpf/libbpf.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index b607be7236d3..d6e62e90e8d4 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -2084,19 +2084,19 @@ void bpf_program__set_expected_attach_type(struct 
> bpf_program *prog,
>   prog->expected_attach_type = type;
>  }
>  
> -#define BPF_PROG_SEC_IMPL(string, ptype, eatype, atype) \
> - { string, sizeof(string) - 1, ptype, eatype, atype }
> +#define BPF_PROG_SEC_IMPL(string, ptype, eatype, is_attachable, atype) \
> + { string, sizeof(string) - 1, ptype, eatype, is_attachable, atype }
>  
>  /* Programs that can NOT be attached. */
> -#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 
> -EINVAL)
> +#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 0, 0)
>  
>  /* Programs that can be attached. */
>  #define BPF_APROG_SEC(string, ptype, atype) \
> - BPF_PROG_SEC_IMPL(string, ptype, 0, atype)
> + BPF_PROG_SEC_IMPL(string, ptype, 0, 1, atype)
>  
>  /* Programs that must specify expected attach type at load time. */
>  #define BPF_EAPROG_SEC(string, ptype, eatype) \
> - BPF_PROG_SEC_IMPL(string, ptype, eatype, eatype)
> + BPF_PROG_SEC_IMPL(string, ptype, eatype, 1, eatype)
>  
>  /* Programs that can be attached but attach type can't be identified by 
> section
>   * name. Kept for backward compatibility.
> @@ -2108,6 +2108,7 @@ static const struct {
>   size_t len;
>   enum bpf_prog_type prog_type;
>   enum bpf_attach_type expected_attach_type;
> + int is_attachable;
>   enum bpf_attach_type attach_type;
>  } section_names[] = {
>   BPF_PROG_SEC("socket",  BPF_PROG_TYPE_SOCKET_FILTER),
> @@ -2198,7 +2199,7 @@ int libbpf_attach_type_by_name(const char *name,
>   for (i = 0; i < ARRAY_SIZE(section_names); i++) {
>   if (strncmp(name, section_names[i].sec, section_names[i].len))
>   continue;
> - if (section_names[i].attach_type == -EINVAL)
> + if (!section_names[i].is_attachable)
>   return -EINVAL;
>   *attach_type = section_names[i].attach_type;
>   return 0;
> -- 
> 2.17.1


Re: [PATCH net] openvswitch: Fix push/pop ethernet validation

2018-10-31 Thread Gregory Rose

On 10/31/2018 10:52 AM, Jaime Caamaño Ruiz wrote:

When there are both pop and push ethernet header actions among the
actions to be applied to a packet, an unexpected EINVAL (Invalid
argument) error is obtained. This is due to mac_proto not being reset
correctly when those actions are validated.

Reported-at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html
Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions")
Signed-off-by: Jaime Caamaño Ruiz 
---
  net/openvswitch/flow_netlink.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index a70097ecf33c..865ecef68196 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -3030,7 +3030,7 @@ static int __ovs_nla_copy_actions(struct net *net, const 
struct nlattr *attr,
 * is already present */
if (mac_proto != MAC_PROTO_NONE)
return -EINVAL;
-   mac_proto = MAC_PROTO_NONE;
+   mac_proto = MAC_PROTO_ETHERNET;
break;
  
  		case OVS_ACTION_ATTR_POP_ETH:

@@ -3038,7 +3038,7 @@ static int __ovs_nla_copy_actions(struct net *net, const 
struct nlattr *attr,
return -EINVAL;
if (vlan_tci & htons(VLAN_TAG_PRESENT))
return -EINVAL;
-   mac_proto = MAC_PROTO_ETHERNET;
+   mac_proto = MAC_PROTO_NONE;
break;
  
  		case OVS_ACTION_ATTR_PUSH_NSH:


Thanks Jaime!

Tested-by: Greg Rose 
Reviewed-by: Greg Rose 



Re: [PATCH net 0/4] mlxsw: Enable minimum shaper on MC TCs

2018-10-31 Thread David Miller
From: Ido Schimmel 
Date: Wed, 31 Oct 2018 09:56:41 +

> Petr says:
> 
> An MC-aware mode was introduced in commit 7b8195306694 ("mlxsw:
> spectrum: Configure MC-aware mode on mlxsw ports"). In MC-aware mode,
> BUM traffic gets a special treatment by being assigned to a separate set
> of traffic classes 8..15. Pairs of TCs 0 and 8, 1 and 9, etc., are then
> configured to strictly prioritize the lower-numbered ones. The intention
> is to prevent BUM traffic from flooding the switch and push out all UC
> traffic, which would otherwise happen, and instead give UC traffic
> precedence.
> 
> However strictly prioritizing UC traffic has the effect that UC overload
> pushes out all BUM traffic, such as legitimate ARP queries. These
> packets are kept in queues for a while, but under sustained UC overload,
> their lifetime eventually expires and these packets are dropped. That is
> detrimental to network performance as well.
> 
> In this patchset, MC TCs (8..15) are configured with minimum shaper of
> 200Mbps (a minimum permitted value) to allow a trickle of necessary
> control traffic to get through.
> 
> First in patch #1, the QEEC register is extended with fields necessary
> to configure the minimum shaper.
> 
> In patch #2, minimum shaper is enabled on TCs 8..15.
> 
> In patches #3 and #4, first the MC-awareness test is tweaked to support
> the minimum shaper, and then a new test is introduced to test that MC
> traffic behaves well under UC overload.

Series applied, thanks.


[PATCH bpf] libbpf: Fix compile error in libbpf_attach_type_by_name

2018-10-31 Thread Andrey Ignatov
Arnaldo Carvalho de Melo reported build error in libbpf when clang
version 3.8.1-24 (tags/RELEASE_381/final) is used:

libbpf.c:2201:36: error: comparison of constant -22 with expression of
type 'const enum bpf_attach_type' is always false
[-Werror,-Wtautological-constant-out-of-range-compare]
if (section_names[i].attach_type == -EINVAL)
 ^  ~~~
1 error generated.

Fix the error by keeping "is_attachable" property of a program in a
separate struct field instead of trying to use attach_type itself.

Fixes: commit 956b620fcf0b ("libbpf: Introduce libbpf_attach_type_by_name")
Reported-by: Arnaldo Carvalho de Melo 
Signed-off-by: Andrey Ignatov 
---
 tools/lib/bpf/libbpf.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index b607be7236d3..d6e62e90e8d4 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2084,19 +2084,19 @@ void bpf_program__set_expected_attach_type(struct 
bpf_program *prog,
prog->expected_attach_type = type;
 }
 
-#define BPF_PROG_SEC_IMPL(string, ptype, eatype, atype) \
-   { string, sizeof(string) - 1, ptype, eatype, atype }
+#define BPF_PROG_SEC_IMPL(string, ptype, eatype, is_attachable, atype) \
+   { string, sizeof(string) - 1, ptype, eatype, is_attachable, atype }
 
 /* Programs that can NOT be attached. */
-#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 
-EINVAL)
+#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 0, 0)
 
 /* Programs that can be attached. */
 #define BPF_APROG_SEC(string, ptype, atype) \
-   BPF_PROG_SEC_IMPL(string, ptype, 0, atype)
+   BPF_PROG_SEC_IMPL(string, ptype, 0, 1, atype)
 
 /* Programs that must specify expected attach type at load time. */
 #define BPF_EAPROG_SEC(string, ptype, eatype) \
-   BPF_PROG_SEC_IMPL(string, ptype, eatype, eatype)
+   BPF_PROG_SEC_IMPL(string, ptype, eatype, 1, eatype)
 
 /* Programs that can be attached but attach type can't be identified by section
  * name. Kept for backward compatibility.
@@ -2108,6 +2108,7 @@ static const struct {
size_t len;
enum bpf_prog_type prog_type;
enum bpf_attach_type expected_attach_type;
+   int is_attachable;
enum bpf_attach_type attach_type;
 } section_names[] = {
BPF_PROG_SEC("socket",  BPF_PROG_TYPE_SOCKET_FILTER),
@@ -2198,7 +2199,7 @@ int libbpf_attach_type_by_name(const char *name,
for (i = 0; i < ARRAY_SIZE(section_names); i++) {
if (strncmp(name, section_names[i].sec, section_names[i].len))
continue;
-   if (section_names[i].attach_type == -EINVAL)
+   if (!section_names[i].is_attachable)
return -EINVAL;
*attach_type = section_names[i].attach_type;
return 0;
-- 
2.17.1



Re: [PATCH net] net: dsa: microchip: initialize mutex before use

2018-10-31 Thread David Miller
From: 
Date: Tue, 30 Oct 2018 16:45:49 -0700

> @@ -1206,6 +1201,12 @@ int ksz_switch_register(struct ksz_device *dev)
>   if (dev->pdata)
>   dev->chip_id = dev->pdata->chip_id;
>  
> + /* mutex is used in next function call. */
> + mutex_init(&dev->reg_mutex);
> + mutex_init(&dev->stats_mutex);
> + mutex_init(&dev->alu_mutex);
> + mutex_init(&dev->vlan_mutex);
> +

Please remove this comment, as per Andrew Lunn's feedback.


[net 5/8] fm10k: bump driver version to match out-of-tree release

2018-10-31 Thread Jeff Kirsher
From: Jacob Keller 

The upstream and out-of-tree drivers are once again at comparable
functionality. It's been a while since we updated the upstream driver
version, so bump it now.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 503bbc017792..5b2a50e5798f 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -11,7 +11,7 @@
 
 #include "fm10k.h"
 
-#define DRV_VERSION"0.23.4-k"
+#define DRV_VERSION"0.26.1-k"
 #define DRV_SUMMARY"Intel(R) Ethernet Switch Host Interface Driver"
 const char fm10k_driver_version[] = DRV_VERSION;
 char fm10k_driver_name[] = "fm10k";
-- 
2.17.2



[net 6/8] ixgbe/ixgbevf: fix XFRM_ALGO dependency

2018-10-31 Thread Jeff Kirsher
Based on the original work from Arnd Bergmann.

When XFRM_ALGO is not enabled, the new ixgbe IPsec code produces a
link error:

drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.o: In function 
`ixgbe_ipsec_vf_add_sa':
ixgbe_ipsec.c:(.text+0x1266): undefined reference to `xfrm_aead_get_byname'

Simply selecting XFRM_ALGO from here causes circular dependencies, so
to fix it, we probably want this slightly more complex solution that is
similar to what other drivers with XFRM offload do:

A separate Kconfig symbol now controls whether we include the IPsec
offload code. To keep the old behavior, this is left as 'default y'. The
dependency in XFRM_OFFLOAD still causes a circular dependency but is
not actually needed because this symbol is not user visible, so removing
that dependency on top makes it all work.

CC: Arnd Bergmann 
CC: Shannon Nelson 
Fixes: eda0333ac293 ("ixgbe: add VF IPsec management")
Signed-off-by: Jeff Kirsher 
Tested-by: Andrew Bowers 
---
 drivers/net/ethernet/intel/Kconfig | 18 ++
 drivers/net/ethernet/intel/ixgbe/Makefile  |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  8 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  6 +++---
 drivers/net/ethernet/intel/ixgbevf/Makefile|  2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |  4 ++--
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
 net/xfrm/Kconfig   |  1 -
 8 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/Kconfig 
b/drivers/net/ethernet/intel/Kconfig
index fd3373d82a9e..59e1bc0f609e 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -200,6 +200,15 @@ config IXGBE_DCB
 
  If unsure, say N.
 
+config IXGBE_IPSEC
+   bool "IPSec XFRM cryptography-offload acceleration"
+   depends on IXGBE
+   depends on XFRM_OFFLOAD
+   default y
+   select XFRM_ALGO
+   ---help---
+ Enable support for IPSec offload in ixgbe.ko
+
 config IXGBEVF
tristate "Intel(R) 10GbE PCI Express Virtual Function Ethernet support"
depends on PCI_MSI
@@ -217,6 +226,15 @@ config IXGBEVF
  will be called ixgbevf.  MSI-X interrupt support is required
  for this driver to work correctly.
 
+config IXGBEVF_IPSEC
+   bool "IPSec XFRM cryptography-offload acceleration"
+   depends on IXGBEVF
+   depends on XFRM_OFFLOAD
+   default y
+   select XFRM_ALGO
+   ---help---
+ Enable support for IPSec offload in ixgbevf.ko
+
 config I40E
tristate "Intel(R) Ethernet Controller XL710 Family support"
imply PTP_1588_CLOCK
diff --git a/drivers/net/ethernet/intel/ixgbe/Makefile 
b/drivers/net/ethernet/intel/ixgbe/Makefile
index ca6b0c458e4a..4fb0d9e3f2da 100644
--- a/drivers/net/ethernet/intel/ixgbe/Makefile
+++ b/drivers/net/ethernet/intel/ixgbe/Makefile
@@ -17,4 +17,4 @@ ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o ixgbe_dcb_82598.o \
 ixgbe-$(CONFIG_IXGBE_HWMON) += ixgbe_sysfs.o
 ixgbe-$(CONFIG_DEBUG_FS) += ixgbe_debugfs.o
 ixgbe-$(CONFIG_FCOE:m=y) += ixgbe_fcoe.o
-ixgbe-$(CONFIG_XFRM_OFFLOAD) += ixgbe_ipsec.o
+ixgbe-$(CONFIG_IXGBE_IPSEC) += ixgbe_ipsec.o
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index ec1b87cc4410..143bdd5ee2a0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -769,9 +769,9 @@ struct ixgbe_adapter {
 #define IXGBE_RSS_KEY_SIZE 40  /* size of RSS Hash Key in bytes */
u32 *rss_key;
 
-#ifdef CONFIG_XFRM_OFFLOAD
+#ifdef CONFIG_IXGBE_IPSEC
struct ixgbe_ipsec *ipsec;
-#endif /* CONFIG_XFRM_OFFLOAD */
+#endif /* CONFIG_IXGBE_IPSEC */
 
/* AF_XDP zero-copy */
struct xdp_umem **xsk_umems;
@@ -1008,7 +1008,7 @@ void ixgbe_store_key(struct ixgbe_adapter *adapter);
 void ixgbe_store_reta(struct ixgbe_adapter *adapter);
 s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, u32 lp_reg,
   u32 adv_sym, u32 adv_asm, u32 lp_sym, u32 lp_asm);
-#ifdef CONFIG_XFRM_OFFLOAD
+#ifdef CONFIG_IXGBE_IPSEC
 void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter);
 void ixgbe_stop_ipsec_offload(struct ixgbe_adapter *adapter);
 void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
@@ -1036,5 +1036,5 @@ static inline int ixgbe_ipsec_vf_add_sa(struct 
ixgbe_adapter *adapter,
u32 *mbuf, u32 vf) { return -EACCES; }
 static inline int ixgbe_ipsec_vf_del_sa(struct ixgbe_adapter *adapter,
u32 *mbuf, u32 vf) { return -EACCES; }
-#endif /* CONFIG_XFRM_OFFLOAD */
+#endif /* CONFIG_IXGBE_IPSEC */
 #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0049a2becd7e..113b38e0defb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@

[net 4/8] fm10k: add missing device IDs to the upstream driver

2018-10-31 Thread Jeff Kirsher
From: Jacob Keller 

The device IDs for the Ethernet SDI Adapter devices were never added to
the upstream driver. The IDs are already in the pci.ids database, and
are supported by the out-of-tree driver.

Add the device IDs now, so that the upstream driver can recognize and
load these devices.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c  | 2 ++
 drivers/net/ethernet/intel/fm10k/fm10k_type.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 02345d381303..e49fb51d3613 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -23,6 +23,8 @@ static const struct fm10k_info *fm10k_info_tbl[] = {
  */
 static const struct pci_device_id fm10k_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, FM10K_DEV_ID_PF), fm10k_device_pf },
+   { PCI_VDEVICE(INTEL, FM10K_DEV_ID_SDI_FM10420_QDA2), fm10k_device_pf },
+   { PCI_VDEVICE(INTEL, FM10K_DEV_ID_SDI_FM10420_DA2), fm10k_device_pf },
{ PCI_VDEVICE(INTEL, FM10K_DEV_ID_VF), fm10k_device_vf },
/* required last entry */
{ 0, }
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_type.h 
b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
index 3e608e493f9d..9fb9fca375e3 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_type.h
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_type.h
@@ -15,6 +15,8 @@ struct fm10k_hw;
 
 #define FM10K_DEV_ID_PF0x15A4
 #define FM10K_DEV_ID_VF0x15A5
+#define FM10K_DEV_ID_SDI_FM10420_QDA2  0x15D0
+#define FM10K_DEV_ID_SDI_FM10420_DA2   0x15D5
 
 #define FM10K_MAX_QUEUES   256
 #define FM10K_MAX_QUEUES_PF128
-- 
2.17.2



[net 3/8] fm10k: ensure completer aborts are marked as non-fatal after a resume

2018-10-31 Thread Jeff Kirsher
From: Jacob Keller 

VF drivers can trigger PCIe completer aborts any time they read a queue
that they don't own. Even in nominal circumstances, it is not possible
to prevent the VF driver from reading queues it doesn't own. VF drivers
may attempt to read queues it previously owned, but which it no longer
does due to a PF reset.

Normally these completer aborts aren't an issue. However, on some
platforms these trigger machine check errors. This is true even if we
lower their severity from fatal to non-fatal. Indeed, we already have
code for lowering the severity.

We could attempt to mask these errors conditionally around resets, which
is the most common time they would occur. However this would essentially
be a race between the PF and VF drivers, and we may still occasionally
see machine check exceptions on these strictly configured platforms.

Instead, mask the errors entirely any time we resume VFs. By doing so,
we prevent the completer aborts from being sent to the parent PCIe
device, and thus these strict platforms will not upgrade them into
machine check errors.

Additionally, we don't lose any information by masking these errors,
because we'll still report VFs which attempt to access queues via the
FUM_BAD_VF_QACCESS errors.

Without this change, on platforms where completer aborts cause machine
check exceptions, the VF reading queues it doesn't own could crash the
host system. Masking the completer abort prevents this, so we should
mask it for good, and not just around a PCIe reset. Otherwise malicious
or misconfigured VFs could cause the host system to crash.

Because we are masking the error entirely, there is little reason to
also keep setting the severity bit, so that code is also removed.

Signed-off-by: Jacob Keller 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 48 
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
index 74160c2095ee..5d4f1761dc0c 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
@@ -303,6 +303,28 @@ void fm10k_iov_suspend(struct pci_dev *pdev)
}
 }
 
+static void fm10k_mask_aer_comp_abort(struct pci_dev *pdev)
+{
+   u32 err_mask;
+   int pos;
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+   if (!pos)
+   return;
+
+   /* Mask the completion abort bit in the ERR_UNCOR_MASK register,
+* preventing the device from reporting these errors to the upstream
+* PCIe root device. This avoids bringing down platforms which upgrade
+* non-fatal completer aborts into machine check exceptions. Completer
+* aborts can occur whenever a VF reads a queue it doesn't own.
+*/
+   pci_read_config_dword(pdev, pos + PCI_ERR_UNCOR_MASK, &err_mask);
+   err_mask |= PCI_ERR_UNC_COMP_ABORT;
+   pci_write_config_dword(pdev, pos + PCI_ERR_UNCOR_MASK, err_mask);
+
+   mmiowb();
+}
+
 int fm10k_iov_resume(struct pci_dev *pdev)
 {
struct fm10k_intfc *interface = pci_get_drvdata(pdev);
@@ -318,6 +340,12 @@ int fm10k_iov_resume(struct pci_dev *pdev)
if (!iov_data)
return -ENOMEM;
 
+   /* Lower severity of completer abort error reporting as
+* the VFs can trigger this any time they read a queue
+* that they don't own.
+*/
+   fm10k_mask_aer_comp_abort(pdev);
+
/* allocate hardware resources for the VFs */
hw->iov.ops.assign_resources(hw, num_vfs, num_vfs);
 
@@ -461,20 +489,6 @@ void fm10k_iov_disable(struct pci_dev *pdev)
fm10k_iov_free_data(pdev);
 }
 
-static void fm10k_disable_aer_comp_abort(struct pci_dev *pdev)
-{
-   u32 err_sev;
-   int pos;
-
-   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
-   if (!pos)
-   return;
-
-   pci_read_config_dword(pdev, pos + PCI_ERR_UNCOR_SEVER, &err_sev);
-   err_sev &= ~PCI_ERR_UNC_COMP_ABORT;
-   pci_write_config_dword(pdev, pos + PCI_ERR_UNCOR_SEVER, err_sev);
-}
-
 int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
 {
int current_vfs = pci_num_vf(pdev);
@@ -496,12 +510,6 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs)
 
/* allocate VFs if not already allocated */
if (num_vfs && num_vfs != current_vfs) {
-   /* Disable completer abort error reporting as
-* the VFs can trigger this any time they read a queue
-* that they don't own.
-*/
-   fm10k_disable_aer_comp_abort(pdev);
-
err = pci_enable_sriov(pdev, num_vfs);
if (err) {
dev_err(&pdev->dev,
-- 
2.17.2



[net 0/8][pull request] Intel Wired LAN Driver Updates 2018-10-31

2018-10-31 Thread Jeff Kirsher
This series contains a various collection of fixes.

Miroslav Lichvar from Red Hat or should I say IBM now?  Updates the PHC
timecounter interval for igb so that it gets updated at least once
every 550 seconds.

Ngai-Mint provides a fix for fm10k to prevent a soft lockup or system
crash by adding a new condition to determine if the SM mailbox is in the
correct state before proceeding.

Jake provides several fm10k fixes, first one marks complier aborts as
non-fatal since on some platforms trigger machine check errors when the
compile aborts.  Added missing device ids to the in-kernel driver.  Due
to the recent fixes, bumped the driver version.

I (Jeff Kirsher) fixed a XFRM_ALGO dependency for both ixgbe and
ixgbevf.  This fix was based on the original work from Arnd Bergmann,
which only fixed ixgbe.

Mitch provides a fix for i40e/avf to update the status codes, which
resolves an issue between a mis-match between i40e and the iavf driver,
which also supports the ice LAN driver.

Radoslaw fixes the ixgbe where the driver is logging a message about
spoofed packets detected when the VF is re-started with a different MAC
address.

The following are changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:
  net: mvpp2: Fix affinity hint allocation
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Jacob Keller (3):
  fm10k: ensure completer aborts are marked as non-fatal after a resume
  fm10k: add missing device IDs to the upstream driver
  fm10k: bump driver version to match out-of-tree release

Jeff Kirsher (1):
  ixgbe/ixgbevf: fix XFRM_ALGO dependency

Miroslav Lichvar (1):
  igb: shorten maximum PHC timecounter update interval

Mitch Williams (1):
  i40e: Update status codes

Ngai-Mint Kwan (1):
  fm10k: fix SM mailbox full condition

Radoslaw Tyl (1):
  ixgbe: fix MAC anti-spoofing filter after VFLR

 drivers/net/ethernet/intel/Kconfig| 18 +++
 drivers/net/ethernet/intel/fm10k/fm10k_iov.c  | 51 +++
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |  2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c  |  2 +
 drivers/net/ethernet/intel/fm10k/fm10k_type.h |  2 +
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c|  2 +-
 drivers/net/ethernet/intel/igb/igb_ptp.c  |  8 ++-
 drivers/net/ethernet/intel/ixgbe/Makefile |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  8 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  6 +--
 .../net/ethernet/intel/ixgbe/ixgbe_sriov.c|  4 +-
 drivers/net/ethernet/intel/ixgbevf/Makefile   |  2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  4 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
 include/linux/avf/virtchnl.h  | 12 +++--
 net/xfrm/Kconfig  |  1 -
 16 files changed, 85 insertions(+), 41 deletions(-)

-- 
2.17.2



[net 8/8] ixgbe: fix MAC anti-spoofing filter after VFLR

2018-10-31 Thread Jeff Kirsher
From: Radoslaw Tyl 

This change resolves a driver bug where the driver is logging a
message that says "Spoofed packets detected". This can occur on the PF
(host) when a VF has VLAN+MACVLAN enabled and is re-started with a
different MAC address.

MAC and VLAN anti-spoofing filters are to be enabled together.

Signed-off-by: Radoslaw Tyl 
Tested-by: Andrew Bowers 
Acked-by: Piotr Skajewski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index af25a8fffeb8..5dacfc870259 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -722,8 +722,10 @@ static inline void ixgbe_vf_reset_event(struct 
ixgbe_adapter *adapter, u32 vf)
ixgbe_set_vmvir(adapter, vfinfo->pf_vlan,
adapter->default_up, vf);
 
-   if (vfinfo->spoofchk_enabled)
+   if (vfinfo->spoofchk_enabled) {
hw->mac.ops.set_vlan_anti_spoofing(hw, true, vf);
+   hw->mac.ops.set_mac_anti_spoofing(hw, true, vf);
+   }
}
 
/* reset multicast table array for vf */
-- 
2.17.2



[net 1/8] igb: shorten maximum PHC timecounter update interval

2018-10-31 Thread Jeff Kirsher
From: Miroslav Lichvar 

The timecounter needs to be updated at least once per ~550 seconds in
order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
timestamp.

Since commit 500462a9d ("timers: Switch to a non-cascading wheel"),
scheduling of delayed work seems to be less accurate and a requested
delay of 540 seconds may actually be longer than 550 seconds. Shorten
the delay to 480 seconds to be sure the timecounter is updated in time.

This fixes an issue with HW timestamps on 82580/I350/I354 being off by
~1100 seconds for few seconds every ~9 minutes.

Cc: Jacob Keller 
Cc: Richard Cochran 
Cc: Thomas Gleixner 
Signed-off-by: Miroslav Lichvar 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_ptp.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c 
b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 9f4d700e09df..29ced6b74d36 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -51,9 +51,15 @@
  *
  * The 40 bit 82580 SYSTIM overflows every
  *   2^40 * 10^-9 /  60  = 18.3 minutes.
+ *
+ * SYSTIM is converted to real time using a timecounter. As
+ * timecounter_cyc2time() allows old timestamps, the timecounter
+ * needs to be updated at least once per half of the SYSTIM interval.
+ * Scheduling of delayed work is not very accurate, so we aim for 8
+ * minutes to be sure the actual interval is shorter than 9.16 minutes.
  */
 
-#define IGB_SYSTIM_OVERFLOW_PERIOD (HZ * 60 * 9)
+#define IGB_SYSTIM_OVERFLOW_PERIOD (HZ * 60 * 8)
 #define IGB_PTP_TX_TIMEOUT (HZ * 15)
 #define INCPERIOD_82576BIT(E1000_TIMINCA_16NS_SHIFT)
 #define INCVALUE_82576_MASKGENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0)
-- 
2.17.2



Re: [Patch V5 net 00/11] Bugfix for the HNS3 driver

2018-10-31 Thread David Miller
From: Huazhong Tan 
Date: Tue, 30 Oct 2018 21:50:42 +0800

> This patch series include bugfix for the HNS3 ethernet
> controller driver.
> 
> Change log:
> V4->V5:
>   Fixes comments from Joe Perches & Sergei Shtylyov
> V3->V4:
>   Fixes comments from Sergei Shtylyov
> V2->V3:
>   Fixes comments from Sergei Shtylyov
> V1->V2:
>   Fixes the compilation break reported by kbuild test robot
>   http://patchwork.ozlabs.org/patch/989818/

Series applied.


[net 2/8] fm10k: fix SM mailbox full condition

2018-10-31 Thread Jeff Kirsher
From: Ngai-Mint Kwan 

Current condition will always incorrectly report a full SM mailbox if an
IES API application is not running. Due to this, the
"fm10k_service_task" will be infinitely queued into the driver's
workqueue. This, in turn, will cause a "kworker" thread to report 100%
CPU utilization and might cause "soft lockup" events or system crashes.

To fix this issue, a new condition is added to determine if the SM
mailbox is in the correct state of FM10K_STATE_OPEN before proceeding.
In other words, an instance of the IES API must be running. If there is,
the remainder of the flow stays the same which is to determine if the SM
mailbox capacity has been exceeded or not and take appropriate action.

Signed-off-by: Ngai-Mint Kwan 
Signed-off-by: Jacob Keller 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
index e707d717012f..74160c2095ee 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
@@ -244,7 +244,8 @@ s32 fm10k_iov_mbx(struct fm10k_intfc *interface)
}
 
/* guarantee we have free space in the SM mailbox */
-   if (!hw->mbx.ops.tx_ready(&hw->mbx, FM10K_VFMBX_MSG_MTU)) {
+   if (hw->mbx.state == FM10K_STATE_OPEN &&
+   !hw->mbx.ops.tx_ready(&hw->mbx, FM10K_VFMBX_MSG_MTU)) {
/* keep track of how many times this occurs */
interface->hw_sm_mbx_full++;
 
-- 
2.17.2



[net 7/8] i40e: Update status codes

2018-10-31 Thread Jeff Kirsher
From: Mitch Williams 

Add a few new status code which will be used by the ice driver, and
rename a few to make them more consistent. Error code are mapped to
similar values as in i40e_status.h, so as to be compatible with older
VF drivers not using this status enum.

Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  2 +-
 include/linux/avf/virtchnl.h   | 12 +---
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 81b0e1f8d14b..ac5698ed0b11 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -3674,7 +3674,7 @@ int i40e_vc_process_vf_msg(struct i40e_pf *pf, s16 vf_id, 
u32 v_opcode,
dev_err(&pf->pdev->dev, "Invalid message from VF %d, opcode %d, 
len %d\n",
local_vf_id, v_opcode, msglen);
switch (ret) {
-   case VIRTCHNL_ERR_PARAM:
+   case VIRTCHNL_STATUS_ERR_PARAM:
return -EPERM;
default:
return -EINVAL;
diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h
index 2c9756bd9c4c..b2488055fd1d 100644
--- a/include/linux/avf/virtchnl.h
+++ b/include/linux/avf/virtchnl.h
@@ -62,13 +62,19 @@
 /* Error Codes */
 enum virtchnl_status_code {
VIRTCHNL_STATUS_SUCCESS = 0,
-   VIRTCHNL_ERR_PARAM  = -5,
+   VIRTCHNL_STATUS_ERR_PARAM   = -5,
+   VIRTCHNL_STATUS_ERR_NO_MEMORY   = -18,
VIRTCHNL_STATUS_ERR_OPCODE_MISMATCH = -38,
VIRTCHNL_STATUS_ERR_CQP_COMPL_ERROR = -39,
VIRTCHNL_STATUS_ERR_INVALID_VF_ID   = -40,
-   VIRTCHNL_STATUS_NOT_SUPPORTED   = -64,
+   VIRTCHNL_STATUS_ERR_ADMIN_QUEUE_ERROR   = -53,
+   VIRTCHNL_STATUS_ERR_NOT_SUPPORTED   = -64,
 };
 
+/* Backward compatibility */
+#define VIRTCHNL_ERR_PARAM VIRTCHNL_STATUS_ERR_PARAM
+#define VIRTCHNL_STATUS_NOT_SUPPORTED VIRTCHNL_STATUS_ERR_NOT_SUPPORTED
+
 #define VIRTCHNL_LINK_SPEED_100MB_SHIFT0x1
 #define VIRTCHNL_LINK_SPEED_1000MB_SHIFT   0x2
 #define VIRTCHNL_LINK_SPEED_10GB_SHIFT 0x3
@@ -831,7 +837,7 @@ virtchnl_vc_validate_vf_msg(struct virtchnl_version_info 
*ver, u32 v_opcode,
case VIRTCHNL_OP_EVENT:
case VIRTCHNL_OP_UNKNOWN:
default:
-   return VIRTCHNL_ERR_PARAM;
+   return VIRTCHNL_STATUS_ERR_PARAM;
}
/* few more checks */
if (err_msg_format || valid_len != msglen)
-- 
2.17.2



Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-31 Thread David Miller
From: Eric Dumazet 
Date: Tue, 30 Oct 2018 00:57:25 -0700

> As shown by Dmitris, we need to use csum_block_add() instead of csum_add()
> when adding the FCS contribution to skb csum.
> 
> Before 4.18 (more exactly commit 88078d98d1bb "net: pskb_trim_rcsum()
> and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away,
> so RXFCS changes were ignored.
> 
> Then before commit d55bef5059dd ("net: fix pskb_trim_rcsum_slow() with
> odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling
> each other.
> 
> Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5.
> 
> Note that this patch also rewrites mlx5e_get_fcs() to :
> 
> - Use skb_header_pointer() instead of reinventing it.
> - Use __get_unaligned_cpu32() to avoid possible non aligned accesses
>   as Dmitris pointed out.
> 
> Fixes: 902a545904c7 ("net/mlx5e: When RXFCS is set, add FCS data into 
> checksum calculation")
> Reported-by: Paweł Staszewski 
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable.


Re: [PATCH 2/2] net: drop a space before tabs

2018-10-31 Thread David Miller
From: Bo YU 
Date: Mon, 29 Oct 2018 23:42:10 -0400

> Fix a warning from checkpatch.pl:'please no space before tabs'
> in include/net/af_unix.h
> 
> Signed-off-by: Bo YU 

Applied.


Re: [PATCH 1/2] net: add an identifier name for 'struct sock *'

2018-10-31 Thread David Miller
From: Bo YU 
Date: Mon, 29 Oct 2018 23:42:09 -0400

> Fix a warning from checkpatch:
> function definition argument 'struct sock *' should also have an
> identifier name in include/net/af_unix.h.
> 
> Signed-off-by: Bo YU 

Applied.


Re: [Patch net] net: make pskb_trim_rcsum_slow() robust

2018-10-31 Thread David Miller
From: Cong Wang 
Date: Mon, 29 Oct 2018 17:35:15 -0700

> Most callers of pskb_trim_rcsum() simply drops the skb when
> it fails, however, ip_check_defrag() still continues to pass
> the skb up to stack. In that case, we should restore its previous
> csum if __pskb_trim() fails.
> 
> Found this during code review.
> 
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends")
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 

I kind of agree with Eric that we should make all callers, including
ip_check_defrag(), fail just as with any memory allocation failure.


[PATCH iproute2 v2] Use libbsd for strlcpy if available

2018-10-31 Thread Luca Boccassi
If libc does not provide strlcpy check for libbsd with pkg-config to
avoid relying on inline version.

Signed-off-by: Luca Boccassi 
---
Changed from -include /usr/include/bsd/string.h hack to HAVE_LIBBSD
and proper includes in each file that uses strlcpy.
The hack causes a compiler warning as ip/ipnetns.c defines
_ATFILE_SOURCE without a value, but system headers use 1, so there's
a mismatch.

 configure | 11 +--
 genl/ctrl.c   |  3 +++
 ip/iplink.c   |  3 +++
 ip/ipnetns.c  |  3 +++
 ip/iproute_lwtunnel.c |  3 +++
 ip/ipvrf.c|  3 +++
 ip/ipxfrm.c   |  3 +++
 ip/tunnel.c   |  3 +++
 ip/xfrm_state.c   |  3 +++
 lib/bpf.c |  3 +++
 lib/fs.c  |  3 +++
 lib/inet_proto.c  |  3 +++
 misc/ss.c |  3 +++
 tc/em_ipset.c |  3 +++
 tc/m_pedit.c  |  3 +++
 15 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 744d6282..c5655978 100755
--- a/configure
+++ b/configure
@@ -330,8 +330,15 @@ EOF
 then
echo "no"
 else
-   echo 'CFLAGS += -DNEED_STRLCPY' >>$CONFIG
-   echo "yes"
+   if ${PKG_CONFIG} libbsd --exists
+   then
+   echo 'CFLAGS += -DHAVE_LIBBSD' `${PKG_CONFIG} libbsd --cflags` 
>>$CONFIG
+   echo 'LDLIBS +=' `${PKG_CONFIG} libbsd --libs` >> $CONFIG
+   echo "no"
+   else
+   echo 'CFLAGS += -DNEED_STRLCPY' >>$CONFIG
+   echo "yes"
+   fi
 fi
 rm -f $TMPDIR/strtest.c $TMPDIR/strtest
 }
diff --git a/genl/ctrl.c b/genl/ctrl.c
index 616a..fef6aaa9 100644
--- a/genl/ctrl.c
+++ b/genl/ctrl.c
@@ -18,6 +18,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 
 #include "utils.h"
 #include "genl_utils.h"
diff --git a/ip/iplink.c b/ip/iplink.c
index b5519201..067f5409 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -24,6 +24,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index 0eac18cf..da019d76 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -8,6 +8,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 8f497015..2285bc1d 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
index 8a6b7f97..8572b4f2 100644
--- a/ip/ipvrf.c
+++ b/ip/ipvrf.c
@@ -21,6 +21,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index 17ab4abe..b02f30a6 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -28,6 +28,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/ip/tunnel.c b/ip/tunnel.c
index d0d55f37..73abb2e2 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -24,6 +24,9 @@
 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index e8c01746..18e0c6fa 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -27,6 +27,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include "utils.h"
 #include "xfrm.h"
diff --git a/lib/bpf.c b/lib/bpf.c
index 45f279fa..35d7c45a 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -15,6 +15,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/lib/fs.c b/lib/fs.c
index 86efd4ed..af36bea0 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 
diff --git a/lib/inet_proto.c b/lib/inet_proto.c
index 0836a4c9..b379d8f8 100644
--- a/lib/inet_proto.c
+++ b/lib/inet_proto.c
@@ -18,6 +18,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 
 #include "rt_names.h"
 #include "utils.h"
diff --git a/misc/ss.c b/misc/ss.c
index 4d12fb5d..c472fbd9 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -19,6 +19,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 #include 
diff --git a/tc/em_ipset.c b/tc/em_ipset.c
index 48b287f5..550b2101 100644
--- a/tc/em_ipset.c
+++ b/tc/em_ipset.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include 
 
diff --git a/tc/m_pedit.c b/tc/m_pedit.c
index 2aeb56d9..baacc80d 100644
--- a/tc/m_pedit.c
+++ b/tc/m_pedit.c
@@ -23,6 +23,9 @@
 #include 
 #include 
 #include 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
 #include 
 #include "utils.h"
 #include "tc_util.h"
-- 
2.19.1



Re: [PATCH iproute2] Use libbsd for strlcpy if available

2018-10-31 Thread Luca Boccassi
On Wed, 2018-10-31 at 08:09 -0700, Stephen Hemminger wrote:
> On Mon, 29 Oct 2018 10:46:50 +
> Luca Boccassi  wrote:
> 
> > If libc does not provide strlcpy check for libbsd with pkg-config
> > to
> > avoid relying on inline version.
> > 
> > Signed-off-by: Luca Boccassi 
> > ---
> > This allows distro maintainers to be able to choose to reduce
> > duplication and let this code be maintained in one place, in the
> > external library.
> > 
> 
> I like the idea, but it causes warnings on Debian testing, and maybe
> other distros.
> 
> ipnetns.c:2: warning: "_ATFILE_SOURCE" redefined
>  #define _ATFILE_SOURCE
>  
> In file included from /usr/include/x86_64-linux-gnu/bits/libc-header-
> start.h:33,
>  from /usr/include/string.h:26,
>  from /usr/include/bsd/string.h:30,
>  from :
> /usr/include/features.h:326: note: this is the location of the
> previous definition
>  # define _ATFILE_SOURCE 1
> 
> 
> Please figure out how to handle this and resubmit.  SUSE open build
> service might
> also work to test multiple distro's

Ah missed that. That happens because features.h defines _ATFILE_SOURCE
to 1, but ip/ipnetns.c defines it without a value. According to the
spec either way doesn't change the result.

This happens because of the quick hack of using -include
/usr/include/bsd/string.h which was, well, a quick hack and didn't
require to add the include manually everywhere strlcpy was used, even
in the future. But it has side effects like this.

So I'll send v2 with a less hacky fix, which means defining HAVE_LIBBSD
in configure and doing #ifdef HAVE_LIBBSD #include  in
every file. It also means that this needs to be done for every future
use of strlcpy, or the build with libbsd will break.

If you or David prefer the hacky way, I can instead send a v3 that does
the quick hack, and also changes _ATFILE_SOURCE to 1 so that there is
no complaint from the compiler, as the values will be the same.

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part


[PATCH net] openvswitch: Fix push/pop ethernet validation

2018-10-31 Thread Jaime Caamaño Ruiz
When there are both pop and push ethernet header actions among the
actions to be applied to a packet, an unexpected EINVAL (Invalid
argument) error is obtained. This is due to mac_proto not being reset
correctly when those actions are validated.

Reported-at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html
Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions")
Signed-off-by: Jaime Caamaño Ruiz 
---
 net/openvswitch/flow_netlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index a70097ecf33c..865ecef68196 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -3030,7 +3030,7 @@ static int __ovs_nla_copy_actions(struct net *net, const 
struct nlattr *attr,
 * is already present */
if (mac_proto != MAC_PROTO_NONE)
return -EINVAL;
-   mac_proto = MAC_PROTO_NONE;
+   mac_proto = MAC_PROTO_ETHERNET;
break;
 
case OVS_ACTION_ATTR_POP_ETH:
@@ -3038,7 +3038,7 @@ static int __ovs_nla_copy_actions(struct net *net, const 
struct nlattr *attr,
return -EINVAL;
if (vlan_tci & htons(VLAN_TAG_PRESENT))
return -EINVAL;
-   mac_proto = MAC_PROTO_ETHERNET;
+   mac_proto = MAC_PROTO_NONE;
break;
 
case OVS_ACTION_ATTR_PUSH_NSH:
-- 
2.16.4



Re: [PATCH net-next 0/8] net: sched: pie: align PIE implementation with RFC 8033

2018-10-31 Thread David Miller


net-next is closed, please resubmit this when net-next opens back up.

Thank you.


Re: libbpf build failure on debian:9 with clang

2018-10-31 Thread Andrey Ignatov
Arnaldo Carvalho de Melo  [Wed, 2018-10-31 07:12 -0700]:
>   1740.66 debian:9  : FAIL gcc (Debian 
> 6.3.0-18+deb9u1) 6.3.0 20170516
> 
> The failure was with clang tho:
> 
> clang version 3.8.1-24 (tags/RELEASE_381/final)
> 
> With:
> 
>   gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) 
> 
> it built without any warnings/errors.
> 
>   CC   /tmp/build/perf/libbpf.o
> libbpf.c:2201:36: error: comparison of constant -22 with expression of type 
> 'const enum bpf_attach_type' is always false 
> [-Werror,-Wtautological-constant-out-of-range-compare]
> if (section_names[i].attach_type == -EINVAL)
>  ^  ~~~
> 1 error generated.

Hi Arnaldo,

I have a clang version that I can reproduce this error with. Working on
patch.

Thank you for report!


>   CC   /tmp/build/perf/help.o
> mv: cannot stat '/tmp/build/perf/.libbpf.o.tmp': No such file or directory
> /git/linux/tools/build/Makefile.build:96: recipe for target 
> '/tmp/build/perf/libbpf.o' failed
> make[4]: *** [/tmp/build/perf/libbpf.o] Error 1
> 
> This is the cset:
> 
> commit 956b620fcf0b64de403cd26a56bc41e6e4826ea6
> Author: Andrey Ignatov 
> Date:   Wed Sep 26 15:24:53 2018 -0700
> 
> libbpf: Introduce libbpf_attach_type_by_name
> 
> 
> 
> Tests are continuing, so far:
> 
>143.53 alpine:3.4: Ok   gcc (Alpine 5.3.0) 5.3.0
>258.62 alpine:3.5: Ok   gcc (Alpine 6.2.1) 6.2.1 
> 20160822
>351.62 alpine:3.6: Ok   gcc (Alpine 6.3.0) 6.3.0
>451.68 alpine:3.7: Ok   gcc (Alpine 6.4.0) 6.4.0
>549.38 alpine:3.8: Ok   gcc (Alpine 6.4.0) 6.4.0
>679.07 alpine:edge   : Ok   gcc (Alpine 6.4.0) 6.4.0
>763.35 amazonlinux:1 : Ok   gcc (GCC) 4.8.5 20150623 
> (Red Hat 4.8.5-28)
>859.65 amazonlinux:2 : Ok   gcc (GCC) 7.3.1 20180303 
> (Red Hat 7.3.1-5)
>947.39 android-ndk:r12b-arm  : Ok   arm-linux-androideabi-gcc 
> (GCC) 4.9.x 20150123 (prerelease)
>   1050.64 android-ndk:r15c-arm  : Ok   arm-linux-androideabi-gcc 
> (GCC) 4.9.x 20150123 (prerelease)
>   1128.75 centos:5  : Ok   gcc (GCC) 4.1.2 20080704 
> (Red Hat 4.1.2-55)
>   1233.26 centos:6  : Ok   gcc (GCC) 4.4.7 20120313 
> (Red Hat 4.4.7-23)
>   1343.16 centos:7  : Ok   gcc (GCC) 4.8.5 20150623 
> (Red Hat 4.8.5-28)
>   1473.61 clearlinux:latest : FAIL gcc (Clear Linux OS for 
> Intel Architecture) 8.2.1 20180502
>   1545.56 debian:7  : Ok   gcc (Debian 4.7.2-5) 4.7.2
>   1645.53 debian:8  : Ok   gcc (Debian 
> 4.9.2-10+deb8u1) 4.9.2
>   1740.66 debian:9  : FAIL gcc (Debian 
> 6.3.0-18+deb9u1) 6.3.0 20170516
>   18   113.19 debian:experimental   : Ok   gcc (Debian 8.2.0-8) 8.2.0
>   1941.48 debian:experimental-x-arm64   : Ok   aarch64-linux-gnu-gcc 
> (Debian 8.2.0-7) 8.2.0
>   2041.51 debian:experimental-x-mips: Ok   mips-linux-gnu-gcc (Debian 
> 8.2.0-7) 8.2.0
>   2140.09 debian:experimental-x-mips64  : Ok   mips64-linux-gnuabi64-gcc 
> (Debian 8.1.0-12) 8.1.0
>   2242.17 debian:experimental-x-mipsel  : Ok   mipsel-linux-gnu-gcc 
> (Debian 8.2.0-7) 8.2.0
>   2340.02 fedora:20 : Ok   gcc (GCC) 4.8.3 20140911 
> (Red Hat 4.8.3-7)
>   2445.47 fedora:21 : Ok   gcc (GCC) 4.9.2 20150212 
> (Red Hat 4.9.2-6)
>   2541.64 fedora:22 : Ok   gcc (GCC) 5.3.1 20160406 
> (Red Hat 5.3.1-6)
>   2643.60 fedora:23 : Ok   gcc (GCC) 5.3.1 20160406 
> (Red Hat 5.3.1-6)
>   2744.04 fedora:24 : Ok   gcc (GCC) 6.3.1 20161221 
> (Red Hat 6.3.1-1)
>   2837.21 fedora:24-x-ARC-uClibc: Ok   arc-linux-gcc (ARCompact 
> ISA Linux uClibc toolchain 2017.09-rc2) 7.1.1 20170710
> 
> The problem with clearlinux is unrelated:
> 
> clang-7: error: unknown argument: '-fno-semantic-interposition'
> clang-7: error: unsupported argument '4' to option 'flto='
> clang-7: error: optimization flag '-ffat-lto-objects' is not supported 
> [-Werror,-Wignored-optimization-argument]
> 

-- 
Andrey Ignatov


Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-10-31 Thread David Ahern
On 10/31/18 10:55 AM, David Ahern wrote:
> I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
> restore the previous behavior. 

that is the only recourse. It has to bail if ret is non-0. Do you want
to send a patch with that fix?


RE: [RFC PATCH 3/4] igb: add support for extended PHC gettime

2018-10-31 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Wednesday, October 31, 2018 2:40 AM
> To: Richard Cochran 
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Keller, Jacob E
> 
> Subject: Re: [RFC PATCH 3/4] igb: add support for extended PHC gettime
> 
> On Tue, Oct 30, 2018 at 07:29:16PM -0700, Richard Cochran wrote:
> > On Fri, Oct 26, 2018 at 06:27:41PM +0200, Miroslav Lichvar wrote:
> > > +static int igb_ptp_gettimex(struct ptp_clock_info *ptp,
> > > + struct ptp_system_timestamp *sts)
> > > +{
> > > + struct igb_adapter *igb = container_of(ptp, struct igb_adapter,
> > > +ptp_caps);
> > > + struct e1000_hw *hw = &igb->hw;
> > > + unsigned long flags;
> > > + u32 lo, hi;
> > > + u64 ns;
> > > +
> > > + spin_lock_irqsave(&igb->tmreg_lock, flags);
> > > +
> > > + /* 82576 doesn't have SYSTIMR */
> > > + if (igb->hw.mac.type == e1000_82576) {
> >
> > Instead of if/then/else, can't you follow the pattern of providing
> > different function flavors ...
> 
> I can. I was just trying to minimize the amount of triplicated code.
> In the next version I'll add a patch to deprecate the old gettime
> functions, as Jacob suggested, and replace them with the extended
> versions, so the amount of code will not change that much.
> 

Excellent.

-Jake

> Thanks,
> 
> --
> Miroslav Lichvar


RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Keller, Jacob E



> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, October 31, 2018 7:40 AM
> To: Miroslav Lichvar 
> Cc: Keller, Jacob E ; netdev@vger.kernel.org; 
> intel-wired-
> l...@lists.osuosl.org
> Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote:
> > I think there could be a flag in ptp_system_timestamp, or a parameter
> > of gettimex64(), which would enable/disable reading of the system
> > clock.
> 
> I'm not a fan of functions that change their behavior based on flags
> in their input parameters.
> 
> Thanks,
> Richard

Neither am I. I do however want to find a solution that avoids having drivers 
needlessly duplicate almost the same functionality.

Thanks,
Jake


Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-10-31 Thread David Ahern
On 10/31/18 12:42 AM, Alexey Kodanev wrote:
> cb->args[2] can store the pointer to the struct fib6_walker,
> allocated in inet6_dump_fib(). On the next loop iteration in
> rtnl_dump_all(), 'memset(&cb, 0, sizeof(cb->args))' can reset
> that pointer, leaking the memory [1].
> 
> Fix it by calling cb->done, if it is set, before filling 'cb->args'
> with zeros.
> 
> Looks like the recent changes in rtnl_dump_all() contributed to
> the appearance of this kmemleak [1], commit c63586dc9b3e ("net:
> rtnl_dump_all needs to propagate error from dumpit function")
> breaks the loop only on an error now.
> 
...

It is more efficient to keep going.

I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
restore the previous behavior. For net-next I think the done callback is
not needed for ipv6; I think there is a simpler way to do it.


Re: [PATCH net-next 6/8] net: sched: pie: add mechanism to set PIE active/inactive

2018-10-31 Thread Stephen Hemminger
On Wed, 31 Oct 2018 21:49:30 +0530
Leslie Monis  wrote:

> diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
> index c84e91e..b68b367 100644
> --- a/net/sched/sch_pie.c
> +++ b/net/sched/sch_pie.c
> @@ -57,6 +57,7 @@ struct pie_vars {
>   psched_time_t dq_tstamp;/* drain rate */
>   u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */
>   u32 qlen_old;   /* in bytes */
> + bool active;/* inactive/active */
>  };

Current Linux best practice is to not use bool for true/false values
in a structure. This is because the size of bool is not obvious and
can cause padding.

Recommend using u8 instead.


Re: [PATCH net-next 5/8] net: sched: pie: add more conditions to auto-tune alpha and beta

2018-10-31 Thread Stephen Hemminger
On Wed, 31 Oct 2018 21:49:29 +0530
Leslie Monis  wrote:

> From: "Mohit P. Tahiliani" 
> 
> The update in drop probability depends on the parameters
> alpha and beta, which in turn reflect the current congestion
> level. However, the previous if-else cases were recommended
> when the supported bandwidth was up to 12 Mbps but, current
> data links support a much higher bandwidth, and the
> requirement for more bandwidth is in never-ending demand.
> Hence, RFC 8033 suggests using more if-else cases for better
> fine-tuning of parameters alpha and beta in order to control
> the congestion as much as possible.
> 
> Signed-off-by: Mohit P. Tahiliani 
> Signed-off-by: Dhaval Khandla 
> Signed-off-by: Hrishikesh Hiraskar 
> Signed-off-by: Manish Kumar B 
> Signed-off-by: Sachin D. Patil 
> Signed-off-by: Leslie Monis 
> ---
>  net/sched/sch_pie.c | 26 +++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
> index f4e189a..c84e91e 100644
> --- a/net/sched/sch_pie.c
> +++ b/net/sched/sch_pie.c
> @@ -343,10 +343,30 @@ static void calculate_probability(struct Qdisc *sch)
>* appropriately 2) scaling down by 16 to come to 0-2 range.
>* Please see paper for details.
>*
> -  * We scale alpha and beta differently depending on whether we are in
> -  * light, medium or high dropping mode.
> +  * We scale alpha and beta differently depending on how heavy the
> +  * congestion is.
>*/
> - if (q->vars.prob < MAX_PROB / 100) {
> + if (q->vars.prob < MAX_PROB / 100) {
> + alpha =
> + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15;
> + beta =
> + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15;
> + } else if (q->vars.prob < MAX_PROB / 10) {
> + alpha =
> + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13;
> + beta =
> + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13;
> + } else if (q->vars.prob < MAX_PROB / 1) {
> + alpha =
> + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11;
> + beta =
> + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11;
> + } else if (q->vars.prob < MAX_PROB / 1000) {
> + alpha =
> + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9;
> + beta =
> + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9;
> + } else if (q->vars.prob < MAX_PROB / 100) {
>   alpha =
>   (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 7;
>   beta =


Seems like the if/else chain is getting long in the tail. Maybe a loop
or table driven approach would be clearer.


Re: [PATCH net-next 7/8] net: sched: pie: add derandomization mechanism

2018-10-31 Thread Stephen Hemminger
On Wed, 31 Oct 2018 21:49:31 +0530
Leslie Monis  wrote:

> diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
> index b68b367..88e605c 100644
> --- a/net/sched/sch_pie.c
> +++ b/net/sched/sch_pie.c
> @@ -58,6 +58,7 @@ struct pie_vars {
>   u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */
>   u32 qlen_old;   /* in bytes */
>   bool active;/* inactive/active */
> + u64 accu_prob;  /* accumulated drop probability */
>  };
>  

Although putting it at the end seems like a natural place, it creates
holes in the c structure. My recommendation would be to put the new
field after dq_tsstamp.


Re: [PATCH net-next 0/8] net: sched: pie: align PIE implementation with RFC 8033

2018-10-31 Thread Stephen Hemminger
On Wed, 31 Oct 2018 21:49:24 +0530
Leslie Monis  wrote:

> The current implementation of PIE queueing discipline is according to an IETF
> draft [http://tools.ietf.org/html/draft-pan-aqm-pie-00] and the paper
> [PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem].
> However, a lot of necessary modifications and enhancements have been proposed
> in RFC 8033, which have not yet been incorporated in the source code of Linux
> kernel. The following series of patches helps in achieving the same.
> 
> This patch series includes:
> 
> 1. Change the value of QUEUE_THRESHOLD
> 2. Change the default value of pie_params->target
> 3. Change the default value of pie_params->tupdate
> 4. Change the initial value of pie_vars->burst_time
> 5. Add more conditions to auto-tune alpha and beta
> 6. Add mechanism to set PIE active/inactive
> 7. Add a derandomization mechanism
> 8. Update references
> 
> Mohit P. Tahiliani (8):
>   net: sched: pie: change value of QUEUE_THRESHOLD
>   net: sched: pie: change default value of pie_params->target
>   net: sched: pie: change default value of pie_params->tupdate
>   net: sched: pie: change initial value of pie_vars->burst_time
>   net: sched: pie: add more conditions to auto-tune alpha and beta
>   net: sched: pie: add mechanism to set PIE active/inactive
>   net: sched: pie: add derandomization mechanism
>   net: sched: pie: update references
> 
>  net/sched/sch_pie.c | 77 
> +++--
>  1 file changed, 63 insertions(+), 14 deletions(-)
> 

Did you do performance tests? Often the RFC is out of date and
the actual values are better than those in the standard.


[PATCH net-next 8/8] net: sched: pie: update references

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

RFC 8033 replaces the IETF draft for PIE

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 88e605c..708e8ad 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -17,7 +17,7 @@
  * University of Oslo, Norway.
  *
  * References:
- * IETF draft submission: http://tools.ietf.org/html/draft-pan-aqm-pie-00
+ * RFC 8033: https://tools.ietf.org/html/rfc8033
  * IEEE  Conference on High Performance Switching and Routing 2013 :
  * "PIE: A * Lightweight Control Scheme to Address the Bufferbloat Problem"
  */
-- 
2.7.4



[PATCH net-next 7/8] net: sched: pie: add derandomization mechanism

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

Random dropping of packets to achieve latency control may
introduce outlier situations where packets are dropped too
close to each other or too far from each other. This can
cause the real drop percentage to temporarily deviate from
the intended drop probability. In certain scenarios, such
as a small number of simultaneous TCP flows, these
deviations can cause significant deviations in link
utilization and queuing latency. RFC 8033 suggests using a
derandomization mechanism to avoid these deviations.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index b68b367..88e605c 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -58,6 +58,7 @@ struct pie_vars {
u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */
u32 qlen_old;   /* in bytes */
bool active;/* inactive/active */
+   u64 accu_prob;  /* accumulated drop probability */
 };
 
 /* statistics gathering */
@@ -96,6 +97,7 @@ static void pie_vars_init(struct pie_vars *vars)
/* default of 150 ms in pschedtime */
vars->burst_time = PSCHED_NS2TICKS(150 * NSEC_PER_MSEC);
vars->active = true;
+   vars->accu_prob = 0;
 }
 
 static bool drop_early(struct Qdisc *sch, u32 packet_size)
@@ -130,9 +132,21 @@ static bool drop_early(struct Qdisc *sch, u32 packet_size)
else
local_prob = q->vars.prob;
 
+   if (local_prob == 0)
+   q->vars.accu_prob = 0;
+
+   q->vars.accu_prob += local_prob;
+
+   if (q->vars.accu_prob < (MAX_PROB / 100) * 85)
+   return false;
+   if (q->vars.accu_prob >= ((u64)MAX_PROB * 17) / 2)
+   return true;
+
rnd = prandom_u32();
-   if (rnd < local_prob)
+   if (rnd < local_prob) {
+   q->vars.accu_prob = 0;
return true;
+   }
 
return false;
 }
@@ -181,6 +195,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
 
 out:
q->stats.dropped++;
+   q->vars.accu_prob = 0;
return qdisc_drop(skb, sch, to_free);
 }
 
-- 
2.7.4



[PATCH net-next 5/8] net: sched: pie: add more conditions to auto-tune alpha and beta

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

The update in drop probability depends on the parameters
alpha and beta, which in turn reflect the current congestion
level. However, the previous if-else cases were recommended
when the supported bandwidth was up to 12 Mbps but, current
data links support a much higher bandwidth, and the
requirement for more bandwidth is in never-ending demand.
Hence, RFC 8033 suggests using more if-else cases for better
fine-tuning of parameters alpha and beta in order to control
the congestion as much as possible.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index f4e189a..c84e91e 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -343,10 +343,30 @@ static void calculate_probability(struct Qdisc *sch)
 * appropriately 2) scaling down by 16 to come to 0-2 range.
 * Please see paper for details.
 *
-* We scale alpha and beta differently depending on whether we are in
-* light, medium or high dropping mode.
+* We scale alpha and beta differently depending on how heavy the
+* congestion is.
 */
-   if (q->vars.prob < MAX_PROB / 100) {
+   if (q->vars.prob < MAX_PROB / 100) {
+   alpha =
+   (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15;
+   beta =
+   (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15;
+   } else if (q->vars.prob < MAX_PROB / 10) {
+   alpha =
+   (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13;
+   beta =
+   (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13;
+   } else if (q->vars.prob < MAX_PROB / 1) {
+   alpha =
+   (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11;
+   beta =
+   (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11;
+   } else if (q->vars.prob < MAX_PROB / 1000) {
+   alpha =
+   (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9;
+   beta =
+   (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9;
+   } else if (q->vars.prob < MAX_PROB / 100) {
alpha =
(q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 7;
beta =
-- 
2.7.4



[PATCH net-next 6/8] net: sched: pie: add mechanism to set PIE active/inactive

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

To overcome unnecessary packet drops due to a spurious
uptick in queuing latency caused by fluctuations in a
network, PIE can choose to be active only when the queue
occupancy is over a certain threshold. RFC 8033 suggests
the value of this threshold be 1/3 of the tail drop
threshold. PIE becomes inactive when the congestion ends
i.e., when the drop probability reaches 0, and the current
and previous latency samples are all below half of the
target queue delay.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index c84e91e..b68b367 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -57,6 +57,7 @@ struct pie_vars {
psched_time_t dq_tstamp;/* drain rate */
u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */
u32 qlen_old;   /* in bytes */
+   bool active;/* inactive/active */
 };
 
 /* statistics gathering */
@@ -94,6 +95,7 @@ static void pie_vars_init(struct pie_vars *vars)
vars->avg_dq_rate = 0;
/* default of 150 ms in pschedtime */
vars->burst_time = PSCHED_NS2TICKS(150 * NSEC_PER_MSEC);
+   vars->active = true;
 }
 
 static bool drop_early(struct Qdisc *sch, u32 packet_size)
@@ -141,12 +143,23 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
struct pie_sched_data *q = qdisc_priv(sch);
bool enqueue = false;
 
+   if (!q->vars.active && qdisc_qlen(sch) >= sch->limit / 3) {
+   /* If the queue occupancy is over 1/3 of the tail drop
+* threshold, turn on PIE.
+*/
+   pie_vars_init(&q->vars);
+   q->vars.prob = 0;
+   q->vars.qdelay_old = 0;
+   q->vars.dq_count = 0;
+   q->vars.dq_tstamp = psched_get_time();
+   }
+
if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
q->stats.overlimit++;
goto out;
}
 
-   if (!drop_early(sch, skb->len)) {
+   if (!q->vars.active || !drop_early(sch, skb->len)) {
enqueue = true;
} else if (q->params.ecn && (q->vars.prob <= MAX_PROB / 10) &&
   INET_ECN_set_ce(skb)) {
@@ -431,7 +444,7 @@ static void calculate_probability(struct Qdisc *sch)
q->vars.qdelay = qdelay;
q->vars.qlen_old = qlen;
 
-   /* We restart the measurement cycle if the following conditions are met
+   /* We turn off PIE if the following conditions are met
 * 1. If the delay has been low for 2 consecutive Tupdate periods
 * 2. Calculated drop probability is zero
 * 3. We have atleast one estimate for the avg_dq_rate ie.,
@@ -441,7 +454,7 @@ static void calculate_probability(struct Qdisc *sch)
(q->vars.qdelay_old < q->params.target / 2) &&
q->vars.prob == 0 &&
q->vars.avg_dq_rate > 0)
-   pie_vars_init(&q->vars);
+   q->vars.active = false;
 }
 
 static void pie_timer(struct timer_list *t)
@@ -451,7 +464,8 @@ static void pie_timer(struct timer_list *t)
spinlock_t *root_lock = qdisc_lock(qdisc_root_sleeping(sch));
 
spin_lock(root_lock);
-   calculate_probability(sch);
+   if (q->vars.active)
+   calculate_probability(sch);
 
/* reset the timer to fire after 'tupdate'. tupdate is in jiffies. */
if (q->params.tupdate)
-- 
2.7.4



[PATCH net-next 4/8] net: sched: pie: change initial value of pie_vars->burst_time

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

RFC 8033 suggests an initial value of 150 milliseconds for
the maximum time allowed for a burst of packets instead of
100 milliseconds.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 9912d9cc..f4e189a 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -92,8 +92,8 @@ static void pie_vars_init(struct pie_vars *vars)
 {
vars->dq_count = DQCOUNT_INVALID;
vars->avg_dq_rate = 0;
-   /* default of 100 ms in pschedtime */
-   vars->burst_time = PSCHED_NS2TICKS(100 * NSEC_PER_MSEC);
+   /* default of 150 ms in pschedtime */
+   vars->burst_time = PSCHED_NS2TICKS(150 * NSEC_PER_MSEC);
 }
 
 static bool drop_early(struct Qdisc *sch, u32 packet_size)
-- 
2.7.4



[PATCH net-next 3/8] net: sched: pie: change default value of pie_params->tupdate

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

RFC 8033 suggests a default value of 15 milliseconds for
the update interval instead of 30 milliseconds.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index c5d6d6b..9912d9cc 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -81,7 +81,7 @@ static void pie_params_init(struct pie_params *params)
 {
params->alpha = 2;
params->beta = 20;
-   params->tupdate = usecs_to_jiffies(30 * USEC_PER_MSEC); /* 30 ms */
+   params->tupdate = usecs_to_jiffies(15 * USEC_PER_MSEC); /* 15 ms */
params->limit = 1000;   /* default of 1000 packets */
params->target = PSCHED_NS2TICKS(15 * NSEC_PER_MSEC);   /* 15 ms */
params->ecn = false;
-- 
2.7.4



[PATCH net-next 2/8] net: sched: pie: change default value of pie_params->target

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

RFC 8033 suggests a default value of 15 milliseconds for
the target queue delay instead of 20 milliseconds.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 56c9e4d..c5d6d6b 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -83,7 +83,7 @@ static void pie_params_init(struct pie_params *params)
params->beta = 20;
params->tupdate = usecs_to_jiffies(30 * USEC_PER_MSEC); /* 30 ms */
params->limit = 1000;   /* default of 1000 packets */
-   params->target = PSCHED_NS2TICKS(20 * NSEC_PER_MSEC);   /* 20 ms */
+   params->target = PSCHED_NS2TICKS(15 * NSEC_PER_MSEC);   /* 15 ms */
params->ecn = false;
params->bytemode = false;
 }
-- 
2.7.4



[PATCH net-next 1/8] net: sched: pie: change value of QUEUE_THRESHOLD

2018-10-31 Thread Leslie Monis
From: "Mohit P. Tahiliani" 

RFC 8033 recommends a value of 16384 bytes for the queue
threshold instead of 1 bytes.

Signed-off-by: Mohit P. Tahiliani 
Signed-off-by: Dhaval Khandla 
Signed-off-by: Hrishikesh Hiraskar 
Signed-off-by: Manish Kumar B 
Signed-off-by: Sachin D. Patil 
Signed-off-by: Leslie Monis 
---
 net/sched/sch_pie.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index d142937..56c9e4d 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -31,7 +31,7 @@
 #include 
 #include 
 
-#define QUEUE_THRESHOLD 1
+#define QUEUE_THRESHOLD 16384  /* 16KB i.e. 2^14 bytes */
 #define DQCOUNT_INVALID -1
 #define MAX_PROB  0x
 #define PIE_SCALE 8
-- 
2.7.4



[PATCH net-next 0/8] net: sched: pie: align PIE implementation with RFC 8033

2018-10-31 Thread Leslie Monis
The current implementation of PIE queueing discipline is according to an IETF
draft [http://tools.ietf.org/html/draft-pan-aqm-pie-00] and the paper
[PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem].
However, a lot of necessary modifications and enhancements have been proposed
in RFC 8033, which have not yet been incorporated in the source code of Linux
kernel. The following series of patches helps in achieving the same.

This patch series includes:

1. Change the value of QUEUE_THRESHOLD
2. Change the default value of pie_params->target
3. Change the default value of pie_params->tupdate
4. Change the initial value of pie_vars->burst_time
5. Add more conditions to auto-tune alpha and beta
6. Add mechanism to set PIE active/inactive
7. Add a derandomization mechanism
8. Update references

Mohit P. Tahiliani (8):
  net: sched: pie: change value of QUEUE_THRESHOLD
  net: sched: pie: change default value of pie_params->target
  net: sched: pie: change default value of pie_params->tupdate
  net: sched: pie: change initial value of pie_vars->burst_time
  net: sched: pie: add more conditions to auto-tune alpha and beta
  net: sched: pie: add mechanism to set PIE active/inactive
  net: sched: pie: add derandomization mechanism
  net: sched: pie: update references

 net/sched/sch_pie.c | 77 +++--
 1 file changed, 63 insertions(+), 14 deletions(-)

-- 
2.7.4



[PATCH v2 net 2/3] net: do not abort bulk send on BQL status

2018-10-31 Thread Eric Dumazet
Before calling dev_hard_start_xmit(), upper layers tried
to cook optimal skb list based on BQL budget.

Problem is that GSO packets can end up comsuming more than
the BQL budget.

Breaking the loop is not useful, since requeued packets
are ahead of any packets still in the qdisc.

It is also more expensive, since next TX completion will
push these packets later, while skbs are not in cpu caches.

It is also a behavior difference with TSO packets, that can
break the BQL limit by a large amount.

Note that drivers should use __netdev_tx_sent_queue()
in order to have optimal xmit_more support, and avoid
useless atomic operations as shown in the following patch.

Signed-off-by: Eric Dumazet 
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 
77d43ae2a7bbe1267f8430d5c35637d1984f463c..0ffcbdd55fa9ee545c807f2ed3fc178830e3075a
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3272,7 +3272,7 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff 
*first, struct net_device *de
}
 
skb = next;
-   if (netif_xmit_stopped(txq) && skb) {
+   if (netif_tx_queue_stopped(txq) && skb) {
rc = NETDEV_TX_BUSY;
break;
}
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v2 net 3/3] net/mlx4_en: use __netdev_tx_sent_queue()

2018-10-31 Thread Eric Dumazet
doorbell only depends on xmit_more and netif_tx_queue_stopped()

Using __netdev_tx_sent_queue() avoids messing with BQL stop flag,
and is more generic.

This patch increases performance on GSO workload by keeping
doorbells to the minimum required.

Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 
1857ee0f0871d48285a6d3711f7c3e9a1e08a05f..6f5153afcab4dfc331c099da854c54f1b9500887
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -1006,7 +1006,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
net_device *dev)
ring->packets++;
}
ring->bytes += tx_info->nr_bytes;
-   netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
if (tx_info->inl)
@@ -1044,7 +1043,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
net_device *dev)
netif_tx_stop_queue(ring->tx_queue);
ring->queue_stopped++;
}
-   send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
+
+   send_doorbell = __netdev_tx_sent_queue(ring->tx_queue,
+  tx_info->nr_bytes,
+  skb->xmit_more);
 
real_size = (real_size / 16) & 0x3f;
 
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v2 net 1/3] net: bql: add __netdev_tx_sent_queue()

2018-10-31 Thread Eric Dumazet
When qdisc_run() tries to use BQL budget to bulk-dequeue a batch
of packets, GSO can later transform this list in another list
of skbs, and each skb is sent to device ndo_start_xmit(),
one at a time, with skb->xmit_more being set to one but
for last skb.

Problem is that very often, BQL limit is hit in the middle of
the packet train, forcing dev_hard_start_xmit() to stop the
bulk send and requeue the end of the list.

BQL role is to avoid head of line blocking, making sure
a qdisc can deliver high priority packets before low priority ones.

But there is no way requeued packets can be bypassed by fresh
packets in the qdisc.

Aborting the bulk send increases TX softirqs, and hot cache
lines (after skb_segment()) are wasted.

Note that for TSO packets, we never split a packet in the middle
because of BQL limit being hit.

Drivers should be able to update BQL counters without
flipping/caring about BQL status, if the current skb
has xmit_more set.

Upper layers are ultimately responsible to stop sending another
packet train when BQL limit is hit.

Code template in a driver might look like the following :

send_doorbell = __netdev_tx_sent_queue(tx_queue, nr_bytes, 
skb->xmit_more);

Note that __netdev_tx_sent_queue() use is not mandatory,
since following patch will change dev_hard_start_xmit()
to not care about BQL status.

But it is highly recommended so that xmit_more full benefits
can be reached (less doorbells sent, and less atomic operations as well)

Signed-off-by: Eric Dumazet 
---
 include/linux/netdevice.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
dc1d9ed33b3192e9406b17c3107b3235b28ff1b9..857f8abf7b91bc79731873fc8f68e31f6bff4d03
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3190,6 +3190,26 @@ static inline void netdev_tx_sent_queue(struct 
netdev_queue *dev_queue,
 #endif
 }
 
+/* Variant of netdev_tx_sent_queue() for drivers that are aware
+ * that they should not test BQL status themselves.
+ * We do want to change __QUEUE_STATE_STACK_XOFF only for the last
+ * skb of a batch.
+ * Returns true if the doorbell must be used to kick the NIC.
+ */
+static inline bool __netdev_tx_sent_queue(struct netdev_queue *dev_queue,
+ unsigned int bytes,
+ bool xmit_more)
+{
+   if (xmit_more) {
+#ifdef CONFIG_BQL
+   dql_queued(&dev_queue->dql, bytes);
+#endif
+   return netif_tx_queue_stopped(dev_queue);
+   }
+   netdev_tx_sent_queue(dev_queue, bytes);
+   return true;
+}
+
 /**
  * netdev_sent_queue - report the number of bytes queued to hardware
  * @dev: network device
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH v2 net 0/3] net: bql: better deal with GSO

2018-10-31 Thread Eric Dumazet
While BQL bulk dequeue works well for TSO packets, it is
not very efficient as soon as GSO is involved.

On a GSO only workload (UDP or TCP), this patch series
can save about 8 % of cpu cycles on a 40Gbit mlx4 NIC,
by keeping optimal batching, and avoiding expensive
doorbells, qdisc requeues and reschedules.

This patch series :

- Add __netdev_tx_sent_queue() so that drivers
  can implement efficient BQL and xmit_more support.

- Implement a work around in dev_hard_start_xmit()
  for drivers not using __netdev_tx_sent_queue()

- changes mlx4 to use __netdev_tx_sent_queue()

v2: Tariq and Willem feedback addressed.
added __netdev_tx_sent_queue() (Willem suggestion)


Eric Dumazet (3):
  net: bql: add __netdev_tx_sent_queue()
  net: do not abort bulk send on BQL status
  net/mlx4_en: use __netdev_tx_sent_queue()

 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  6 --
 include/linux/netdevice.h  | 20 
 net/core/dev.c |  2 +-
 3 files changed, 25 insertions(+), 3 deletions(-)

-- 
2.19.1.930.g4563a0d9d0-goog



Re: Ethernet on my CycloneV broke since 4.9.124

2018-10-31 Thread Clément Péron
Hi Dinh,

On Wed, 31 Oct 2018 at 15:42, Dinh Nguyen  wrote:
>
> Hi Clement,
>
> On 10/31/2018 08:01 AM, Clément Péron wrote:
> > Hi,
> >
> > The patch "net: stmmac: socfpga: add additional ocp reset line for
> > Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV
> > board.
> >
> > When I boot i have this issue :
> >
> > socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2
> > socfpga-dwmac: probe of ff702000.ethernet failed with error -2
> >
> > Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the 
> > issue.
> >
>
> Are you sure? I just booted v4.9.124 and did not see any errors. The
> error should not appear because the commit is using
> devm_reset_control_get_optional().

I'm booting on 4.9.130 actually, Agree with you that
devm_reset_control_get_optional should not failed but checking other
usage of this helper
https://elixir.bootlin.com/linux/v4.9.135/source/drivers/i2c/busses/i2c-mv64xxx.c#L824
https://elixir.bootlin.com/linux/v4.9.135/source/drivers/crypto/sunxi-ss/sun4i-ss-core.c#L259
Show that they don't check for errors except for PROBE_DEFER

Thanks,
Clement

>
> Dinh


Re: [PATCH iproute2] bpf: check map symbol type properly with newer llvm compiler

2018-10-31 Thread Stephen Hemminger
On Mon, 29 Oct 2018 15:32:03 -0700
Yonghong Song  wrote:

> With llvm 7.0 or earlier, the map symbol type is STT_NOTYPE.
>   -bash-4.4$ cat t.c
>   __attribute__((section("maps"))) int g;
>   -bash-4.4$ clang -target bpf -O2 -c t.c
>   -bash-4.4$ readelf -s t.o
> 
>   Symbol table '.symtab' contains 2 entries:
>  Num:Value  Size TypeBind   Vis  Ndx Name
>0:  0 NOTYPE  LOCAL  DEFAULT  UND
>1:  0 NOTYPE  GLOBAL DEFAULT3 g
>   -bash-4.4$
> 
> The following llvm commit enables BPF target to generate
> proper symbol type and size.
>   commit bf6ec206615b9718869d48b4e5400d0c6e3638dd
>   Author: Yonghong Song 
>   Date:   Wed Sep 19 16:04:13 2018 +
> 
>   [bpf] Symbol sizes and types in object file
> 
>   Clang-compiled object files currently don't include the symbol sizes and
>   types.  Some tools however need that information.  For example, 
> ctfconvert
>   uses that information to generate FreeBSD's CTF representation from ELF
>   files.
>   With this patch, symbol sizes and types are included in object files.
> 
>   Signed-off-by: Paul Chaignon 
>   Reported-by: Yutaro Hayakawa 
> 
> Hence, for llvm 8.0.0 (currently trunk), symbol type will be not NOTYPE, but 
> OBJECT.
>   -bash-4.4$ clang -target bpf -O2 -c t.c
>   -bash-4.4$ readelf -s t.o
> 
>   Symbol table '.symtab' contains 3 entries:
>  Num:Value  Size TypeBind   Vis  Ndx Name
>0:  0 NOTYPE  LOCAL  DEFAULT  UND
>1:  0 FILELOCAL  DEFAULT  ABS t.c
>2:  4 OBJECT  GLOBAL DEFAULT3 g
> -bash-4.4$
> 
> This patch makes sure bpf library accepts both NOTYPE and OBJECT types
> of global map symbols.
> 
> Signed-off-by: Yonghong Song 


Looks good, applied.


Re: [PATCH iproute] ss: Actually print left delimiter for columns

2018-10-31 Thread Stephen Hemminger
On Mon, 29 Oct 2018 23:04:25 +0100
Stefano Brivio  wrote:

> While rendering columns, we use a local variable to keep track of the
> field currently being printed, without touching current_field, which is
> used for buffering.
> 
> Use the right pointer to access the left delimiter for the current column,
> instead of always printing the left delimiter for the last buffered field,
> which is usually an empty string.
> 
> This fixes an issue especially visible on narrow terminals, where some
> columns might be displayed without separation.
> 
> Reported-by: YoyPa 
> Fixes: 691bd854bf4a ("ss: Buffer raw fields first, then render them as a 
> table")
> Signed-off-by: Stefano Brivio 
> Tested-by: YoyPa 

Looks good, applied.


Re: [PATCH 3/4] net: macb: Add pm runtime support

2018-10-31 Thread Harini Katakam
Hi Andrew,
On Wed, Oct 31, 2018 at 8:24 PM Andrew Lunn  wrote:
>
> On Wed, Oct 31, 2018 at 09:10:22AM +0530, Harini Katakam wrote:
> > From: Harini Katakam 
> >
> > Add runtime pm functions and move clock handling there.
> > If device is suspended and not a wake device, then return from
> > mdio read/write functions without performing any action because
> > the clocks are not active.
> >
> > Signed-off-by: Shubhrajyoti Datta 
> > Signed-off-by: Harini Katakam 
> > ---
> > Changes from RFC:
> > Updated pm get sync/put sync calls.
> > Removed unecessary clk up in mdio helpers.
>
> This last bit has me worried.
>
> The MDIO bus is a shared bus with a life of its own.  You can have
> multiple PHYs and switches on it. The PHYs can for a different
> Ethernet MAC. Switch drivers will expect to be able to address the
> switch when the interface is down.
>
> The FEC driver did something similar for a while. I had to make MDIO
> read/write runtime PM aware, otherwise i could not access the Ethernet
> switch hanging of its MDIO bus.

Yes, I understand. But I thought I'd handle it when we have a separate
MDIO bus driver. As of now, with macb, I do not see a way, for ex.,
for MAC1 to access MDIO0 when MAC0 is suspended.
However, I will work a bit more on this solution.

With the clk up, I've noticed that there is atleast one PHY status poll
that ends up bringing the clock up after the MAC has been suspended.
Of course phy is supended immediately after that but it just delays
this full power down process.

Regards,
Harini


Re: [PATCH iproute2] Use libbsd for strlcpy if available

2018-10-31 Thread Stephen Hemminger
On Mon, 29 Oct 2018 10:46:50 +
Luca Boccassi  wrote:

> If libc does not provide strlcpy check for libbsd with pkg-config to
> avoid relying on inline version.
> 
> Signed-off-by: Luca Boccassi 
> ---
> This allows distro maintainers to be able to choose to reduce
> duplication and let this code be maintained in one place, in the
> external library.
> 

I like the idea, but it causes warnings on Debian testing, and maybe other 
distros.

ipnetns.c:2: warning: "_ATFILE_SOURCE" redefined
 #define _ATFILE_SOURCE
 
In file included from /usr/include/x86_64-linux-gnu/bits/libc-header-start.h:33,
 from /usr/include/string.h:26,
 from /usr/include/bsd/string.h:30,
 from :
/usr/include/features.h:326: note: this is the location of the previous 
definition
 # define _ATFILE_SOURCE 1


Please figure out how to handle this and resubmit.  SUSE open build service 
might
also work to test multiple distro's


Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Miroslav Lichvar
On Wed, Oct 31, 2018 at 07:40:03AM -0700, Richard Cochran wrote:
> On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote:
> > I think there could be a flag in ptp_system_timestamp, or a parameter
> > of gettimex64(), which would enable/disable reading of the system
> > clock.
> 
> I'm not a fan of functions that change their behavior based on flags
> in their input parameters.

How about separating the PHC timestamp from the ptp_system_timestamp
structure and use NULL to indicate we don't want to read the system
clock? A gettimex64(ptp, ts, NULL) call would be equal to
gettime64(ptp, ts).

struct ptp_system_timestamp {
struct timespec64 pre_ts;
struct timespec64 post_ts;
};

int (*gettimex64)(struct ptp_clock_info *ptp, struct timespec64 *ts,
  struct ptp_system_timestamp *sts);

-- 
Miroslav Lichvar


Re: Ethernet on my CycloneV broke since 4.9.124

2018-10-31 Thread Dinh Nguyen
Hi Clement,

On 10/31/2018 08:01 AM, Clément Péron wrote:
> Hi,
> 
> The patch "net: stmmac: socfpga: add additional ocp reset line for
> Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV
> board.
> 
> When I boot i have this issue :
> 
> socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2
> socfpga-dwmac: probe of ff702000.ethernet failed with error -2
> 
> Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the issue.
> 

Are you sure? I just booted v4.9.124 and did not see any errors. The
error should not appear because the commit is using
devm_reset_control_get_optional().

Dinh


Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Richard Cochran
On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote:
> I think there could be a flag in ptp_system_timestamp, or a parameter
> of gettimex64(), which would enable/disable reading of the system
> clock.

I'm not a fan of functions that change their behavior based on flags
in their input parameters.

Thanks,
Richard


libbpf build failure on debian:9 with clang

2018-10-31 Thread Arnaldo Carvalho de Melo
  1740.66 debian:9  : FAIL gcc (Debian 6.3.0-18+deb9u1) 
6.3.0 20170516

The failure was with clang tho:

clang version 3.8.1-24 (tags/RELEASE_381/final)

With:

  gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) 

it built without any warnings/errors.

  CC   /tmp/build/perf/libbpf.o
libbpf.c:2201:36: error: comparison of constant -22 with expression of type 
'const enum bpf_attach_type' is always false 
[-Werror,-Wtautological-constant-out-of-range-compare]
if (section_names[i].attach_type == -EINVAL)
 ^  ~~~
1 error generated.
  CC   /tmp/build/perf/help.o
mv: cannot stat '/tmp/build/perf/.libbpf.o.tmp': No such file or directory
/git/linux/tools/build/Makefile.build:96: recipe for target 
'/tmp/build/perf/libbpf.o' failed
make[4]: *** [/tmp/build/perf/libbpf.o] Error 1

This is the cset:

commit 956b620fcf0b64de403cd26a56bc41e6e4826ea6
Author: Andrey Ignatov 
Date:   Wed Sep 26 15:24:53 2018 -0700

libbpf: Introduce libbpf_attach_type_by_name



Tests are continuing, so far:

   143.53 alpine:3.4: Ok   gcc (Alpine 5.3.0) 5.3.0
   258.62 alpine:3.5: Ok   gcc (Alpine 6.2.1) 6.2.1 
20160822
   351.62 alpine:3.6: Ok   gcc (Alpine 6.3.0) 6.3.0
   451.68 alpine:3.7: Ok   gcc (Alpine 6.4.0) 6.4.0
   549.38 alpine:3.8: Ok   gcc (Alpine 6.4.0) 6.4.0
   679.07 alpine:edge   : Ok   gcc (Alpine 6.4.0) 6.4.0
   763.35 amazonlinux:1 : Ok   gcc (GCC) 4.8.5 20150623 
(Red Hat 4.8.5-28)
   859.65 amazonlinux:2 : Ok   gcc (GCC) 7.3.1 20180303 
(Red Hat 7.3.1-5)
   947.39 android-ndk:r12b-arm  : Ok   arm-linux-androideabi-gcc 
(GCC) 4.9.x 20150123 (prerelease)
  1050.64 android-ndk:r15c-arm  : Ok   arm-linux-androideabi-gcc 
(GCC) 4.9.x 20150123 (prerelease)
  1128.75 centos:5  : Ok   gcc (GCC) 4.1.2 20080704 
(Red Hat 4.1.2-55)
  1233.26 centos:6  : Ok   gcc (GCC) 4.4.7 20120313 
(Red Hat 4.4.7-23)
  1343.16 centos:7  : Ok   gcc (GCC) 4.8.5 20150623 
(Red Hat 4.8.5-28)
  1473.61 clearlinux:latest : FAIL gcc (Clear Linux OS for 
Intel Architecture) 8.2.1 20180502
  1545.56 debian:7  : Ok   gcc (Debian 4.7.2-5) 4.7.2
  1645.53 debian:8  : Ok   gcc (Debian 4.9.2-10+deb8u1) 
4.9.2
  1740.66 debian:9  : FAIL gcc (Debian 6.3.0-18+deb9u1) 
6.3.0 20170516
  18   113.19 debian:experimental   : Ok   gcc (Debian 8.2.0-8) 8.2.0
  1941.48 debian:experimental-x-arm64   : Ok   aarch64-linux-gnu-gcc 
(Debian 8.2.0-7) 8.2.0
  2041.51 debian:experimental-x-mips: Ok   mips-linux-gnu-gcc (Debian 
8.2.0-7) 8.2.0
  2140.09 debian:experimental-x-mips64  : Ok   mips64-linux-gnuabi64-gcc 
(Debian 8.1.0-12) 8.1.0
  2242.17 debian:experimental-x-mipsel  : Ok   mipsel-linux-gnu-gcc (Debian 
8.2.0-7) 8.2.0
  2340.02 fedora:20 : Ok   gcc (GCC) 4.8.3 20140911 
(Red Hat 4.8.3-7)
  2445.47 fedora:21 : Ok   gcc (GCC) 4.9.2 20150212 
(Red Hat 4.9.2-6)
  2541.64 fedora:22 : Ok   gcc (GCC) 5.3.1 20160406 
(Red Hat 5.3.1-6)
  2643.60 fedora:23 : Ok   gcc (GCC) 5.3.1 20160406 
(Red Hat 5.3.1-6)
  2744.04 fedora:24 : Ok   gcc (GCC) 6.3.1 20161221 
(Red Hat 6.3.1-1)
  2837.21 fedora:24-x-ARC-uClibc: Ok   arc-linux-gcc (ARCompact ISA 
Linux uClibc toolchain 2017.09-rc2) 7.1.1 20170710

The problem with clearlinux is unrelated:

clang-7: error: unknown argument: '-fno-semantic-interposition'
clang-7: error: unsupported argument '4' to option 'flto='
clang-7: error: optimization flag '-ffat-lto-objects' is not supported 
[-Werror,-Wignored-optimization-argument]



Re: [PATCH net 3/3] net/mlx4_en: use netdev_tx_sent_queue_more()

2018-10-31 Thread Eric Dumazet
On Wed, Oct 31, 2018 at 4:37 AM Tariq Toukan  wrote:
>
>
>
> On 30/10/2018 1:25 AM, Eric Dumazet wrote:
> > This patch has two changes :
> >
> > 1) Use netdev_tx_sent_queue_more() for skbs with xmit_more
> > This avoids mangling BQL status, since we only need to
> > take care of it for the last skb of the batch.
> >
> > 2) doorbel only depends on xmit_more and netif_tx_queue_stopped()
> >
> >While not strictly necessary after 1), it is more consistent
> >this way.
> >
> > Signed-off-by: Eric Dumazet 
> > Cc: Tariq Toukan 
> > ---
> >   drivers/net/ethernet/mellanox/mlx4/en_tx.c | 10 --
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
> > b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 
> > 1857ee0f0871d48285a6d3711f7c3e9a1e08a05f..3acce02ade6a115881ecd72e4710e332d3f380cb
> >  100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > @@ -1006,7 +1006,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
> > net_device *dev)
> >   ring->packets++;
> >   }
> >   ring->bytes += tx_info->nr_bytes;
> > - netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
> >   AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
> >
> >   if (tx_info->inl)
> > @@ -1044,7 +1043,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
> > net_device *dev)
> >   netif_tx_stop_queue(ring->tx_queue);
> >   ring->queue_stopped++;
> >   }
> > - send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> > +
> > + if (skb->xmit_more) {
> > + netdev_tx_sent_queue_more(ring->tx_queue, tx_info->nr_bytes);
> > + send_doorbell = netif_tx_queue_stopped(ring->tx_queue);
> > + } else {
> > + netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
> > + send_doorbell = true;
> > + }
> >
> >   real_size = (real_size / 16) & 0x3f;
> >
> >
>
> The drivers' code template would be nicer if we unify the two functions
> netdev_tx_sent_queue/netdev_tx_sent_queue_more to a single one with a
> parameter.
>
> Currently, all drivers that would want to benefit from this optimization
> will have to repeat these if/else blocks.

I can add a helper sure, but I can not change drivers that I am not
able to test.

So I can not change existing helper.

This patch series shows the problem and fixes one driver, a common
helper can be added when a second
driver is updated, there is no hurry.


Re: [PATCH net 1/3] net: bql: add netdev_tx_sent_queue_more() helper

2018-10-31 Thread Eric Dumazet
On Wed, Oct 31, 2018 at 4:30 AM Tariq Toukan  wrote:
>
>
>
> On 30/10/2018 1:25 AM, Eric Dumazet wrote:
> > When qdisc_run() tries to use BQL budget to bulk-dequeue a batch
> > of packets, GSO can later transform this list in another list
> > of skbs, and each skb is sent to device ndo_start_xmit(),
> > one at a time, with skb->xmit_more being set to one but
> > for last skb.
> >
> > Problem is that very often, BQL limit is hit in the middle of
> > the packet train, forcing dev_hard_start_xmit() to stop the
> > bulk send and requeue the end of the list.
> >
> > BQL role is to avoid head of line blocking, making sure
> > a qdisc can deliver high priority packets before low priority ones.
> >
> > But there is no way requeued packets can be bypassed by fresh
> > packets in the qdisc.
> >
> > Aborting the bulk send increases TX softirqs, and hot cache
> > lines (after skb_segment()) are wasted.
> >
> > Note that for TSO packets, we never split a packet in the middle
> > because of BQL limit being hit.
> >
> > Drivers should be able to update BQL counters without
> > flipping/caring about BQL status, if the current skb
> > has xmit_more set.
> >
> > Upper layers are ultimately responsible to stop sending another
> > packet train when BQL limit is hit.
> >
> > Code template in a driver might look like the following :
> >
> >   if (skb->xmit_more) {
> >   netdev_tx_sent_queue_more(tx_queue, nr_bytes);
> >   send_doorbell = netif_tx_queue_stopped(tx_queue);
> >   } else {
> >   netdev_tx_sent_queue(tx_queue, nr_bytes);
> >   send_doorbell = true;
> >   }
> >
>
> Hi Eric,
> Nice optimization.
>
> I thought of another way of implementing it, by just extending the
> existing netdev_tx_sent_queue function with a new xmit_more parameter,
> that the driver passes.
> Something like this:
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d837dad24b4c..feb9cbcb5759 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3129,12 +3129,12 @@ static inline void
> netdev_txq_bql_complete_prefetchw(struct netdev_queue *dev_qu
>   }
>
>   static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
> -   unsigned int bytes)
> +   unsigned int bytes, bool more)
>   {
>   #ifdef CONFIG_BQL
>  dql_queued(&dev_queue->dql, bytes);
>
> -   if (likely(dql_avail(&dev_queue->dql) >= 0))
> +   if (more || likely(dql_avail(&dev_queue->dql) >= 0))
>  return;
>
>  set_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
>
>
> This unifies and simplifies both the stack and driver code, as the new
> suggested function netdev_tx_sent_queue_more can become a private case
> of the existing one.

Yes I thought of this, but it is too risky/invasive and I prefer
adding an opt-in mechanism
so that patch authors can test their patch.

Then when/if all drivers are updated, we can remove the legacy stuff
or rename everything.

mlx4 was the only NIC I could reasonably test myself.

Another suggestion from Willem is to add the following helper, returning
a boolean (doorbell needed)

+static inline bool __netdev_tx_sent_queue(struct netdev_queue *dev_queue,
+ unsigned int bytes, bool xmit_more)
+{
+   if (xmit_more) {
+   netdev_tx_sent_queue_more(ring->tx_queue, tx_info->nr_bytes);
+   return netif_tx_queue_stopped(dev_queue);
+   } else {
+   netdev_tx_sent_queue(dev_queue, bytes);
+   return true;
+   }
+}
+


Re: [PATCH net] net/mlx4_en: add a missing include

2018-10-31 Thread Abdul Haleem
On Tue, 2018-10-30 at 00:18 -0700, Eric Dumazet wrote:
> Abdul Haleem reported a build error on ppc :
> 
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
> iphdr` declared inside parameter list [enabled by default]
>struct iphdr *iph)
>   ^
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
> only this definition or declaration, which is probably not what you want
> [enabled by default]
> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
> get_fixed_ipv4_csum:
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
> pointer to incomplete type
>   __u8 ipproto = iph->protocol;
> ^
> 
> Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when 
> not needed")
> Signed-off-by: Eric Dumazet 
> Reported-by: Abdul Haleem 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> 5a6d0919533d6e0e619927abd753c5d07ed95dac..db00bf1c23f5ad31d64652ddc8bee32e2e7534c8
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
> 
> +#include 
>  #if IS_ENABLED(CONFIG_IPV6)
>  #include 
>  #endif

Tested-by: Abdul Haleem 
-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH v4.14-stable] sch_netem: restore skb->dev after dequeuing from the rbtree

2018-10-31 Thread Sasha Levin

On Tue, Oct 30, 2018 at 12:12:51PM -0700, Eduardo Valentin wrote:

Greg,

On Thu, Oct 18, 2018 at 03:43:48PM -0700, David Miller wrote:

From: Christoph Paasch 
Date: Thu, 18 Oct 2018 13:38:40 -0700

> Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got
> backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the
> v4.14.x-tree.
>
> However, the backport does not include the changes in sch_netem.c
>
> We need these, as otherwise the skb->dev pointer is not set when
> dequeueing from the netem rbtree, resulting in a panic:
 ...
> Fixes: 6b921536f170 ("net: sk_buff rbnode reorg")
> Cc: Stephen Hemminger 
> Cc: Eric Dumazet 
> Cc: Soheil Hassas Yeganeh 
> Cc: Wei Wang 
> Cc: Willem de Bruijn 
> Signed-off-by: Christoph Paasch 
> ---
>
> Notes:
> This patch should only make it into v4.14-stable as that's the only 
branch where
> the offending commit has been backported to.

Greg, please queue up.


Are you planing to queue this one ?

Looks to me it was a miss on the backport.

It seams that the backport was touching different files, and missed the change
on net/sched/sch_netem.c. So, to me, even if this patch may not follow the
strictly the rules of stable, as it is not a patch in upstream, seams to be a
needed change, even if it is specific to stable linux-4.14.y.


I've queued this patch for 4.14.

--
Thanks,
Sasha


Ethernet on my CycloneV broke since 4.9.124

2018-10-31 Thread Clément Péron
Hi,

The patch "net: stmmac: socfpga: add additional ocp reset line for
Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV
board.

When I boot i have this issue :

socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2
socfpga-dwmac: probe of ff702000.ethernet failed with error -2

Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the issue.

Thanks,
Clement


Re: [PATCH net] net: sched: Remove TCA_OPTIONS from policy

2018-10-31 Thread Marco Berizzi
> Il 26 ottobre 2018 alle 20.19 Cong Wang  ha scritto:
> 
> On Fri, Oct 26, 2018 at 4:35 AM Marco Berizzi  wrote:
> 
> > Apologies for bothering you again.
> > I applied your patch to 4.19, but after issuing this
> > command:
> > 
> > root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> > root@Calimero:~# ping 10.81.104.1
> > PING 10.81.104.1 (10.81.104.1) 56(84) bytes of data.
> > ^C
> > --- 10.81.104.1 ping statistics ---
> > 2 packets transmitted, 0 received, 100% packet loss, time 1001ms
> > 
> > I'm losing ipv4 connectivity.
> > If I remove the qdisc everything is going to work again:
> 
> Did this really work before?
> 
> You specify a default class without adding it, so the packets are dropped.
> 
> How would you expect this to work? :)

:-) yes indeed. Apologies for the noise.


  1   2   >