Re: [RFC PATCH v3 04/10] ip: factor out protocol delivery helper
On 2018-10-30 11:24, Paolo Abeni wrote: So that we can re-use it at the UDP lavel in a later patch Hi Paolo Minor queries - Should it be "level" instead of "lavel"? Similar comment for the ipv6 patch as well. Signed-off-by: Paolo Abeni --- net/ipv4/ip_input.c | 73 ++--- 1 file changed, 36 insertions(+), 37 deletions(-) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index 35a786c0aaa0..72250b4e466d 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb) return false; } -static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb) +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol) Would it be better if this function was declared in include/net/ip.h & include/net/ipv6.h rather than in net/ipv4/udp.c & net/ipv6/udp.c as in the patch "udp: cope with UDP GRO packet misdirection" diff --git a/include/net/ip.h b/include/net/ip.h index 72593e1..3d7fdb4 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -717,4 +717,6 @@ static inline void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb) int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto, struct netlink_ext_ack *extack); +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int proto); + #endif /* _IP_H */ diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 8296505..4d4d2cfe 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -1101,4 +1101,8 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int ifindex, const struct in6_addr *addr, unsigned int mode); int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr); + +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr, + bool have_final); + #endif /* _NET_IPV6_H */ -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable
On Thu, 1 Nov 2018 00:14:38 + Jeff Barnhill <0xeff...@gmail.com> wrote: > diff --git a/include/net/addrconf.h b/include/net/addrconf.h > index 14b789a123e7..799af1a037d1 100644 > --- a/include/net/addrconf.h > +++ b/include/net/addrconf.h > @@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct > net_device *dev, >const struct in6_addr *addr); > bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev, >const struct in6_addr *addr); > +int anycast_init(void); > +void anycast_cleanup(void); One minor nit that should be fixed. To avoid any potential naming conflicts, please prefix all ipv6 global symbols with ipv6_
Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
On 31/10/18 - 17:53:22, Eric Dumazet wrote: > On 10/31/2018 04:26 PM, Christoph Paasch wrote: > > Implementations of Quic might want to create a separate socket for each > > Quic-connection by creating a connected UDP-socket. > > > > Nice proposal, but I doubt a QUIC server can afford having one UDP socket per > connection ? > > It would add a huge overhead in term of memory usage in the kernel, > and lots of epoll events to manage (say a QUIC server with one million flows, > receiving > very few packets per second per flow) > > Maybe you could elaborate on the need of having one UDP socket per connection. I let Leif chime in on that as the ask came from him. Leif & his team are implementing Quic in the Apache Traffic Server. One advantage I can see is that it would allow to benefit from fq_pacing as one could set sk_pacing_rate simply on the socket. That way there is no need to implement the pacing in the user-space anymore. > > To achieve that on the server-side, a "master-socket" needs to wait for > > incoming new connections and then creates a new socket that will be a > > connected UDP-socket. To create that latter one, the server needs to > > first bind() and then connect(). However, after the bind() the server > > might already receive traffic on that new socket that is unrelated to the > > Quic-connection at hand. Only after the connect() a full 4-tuple match > > is happening. So, one can't really create this kind of a server that has > > a connected UDP-socket per Quic connection. > > > > So, what is needed is an "atomic bind & connect" that basically > > prevents any incoming traffic until the connect() call has been issued > > at which point the full 4-tuple is known. > > > > > > This patchset implements this functionality and exposes a socket-option > > to do this. > > > > Usage would be: > > > > int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); > > > > int val = 1; > > setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val)); > > > > bind(fd, (struct sockaddr *)&src, sizeof(src)); > > > > /* At this point, incoming traffic will never match on this socket */ > > > > connect(fd, (struct sockaddr *)&dst, sizeof(dst)); > > > > /* Only now incoming traffic will reach the socket */ > > > > > > > > There is literally an infinite number of ways on how to implement it, > > which is why I first send it out as an RFC. With this approach here I > > chose the least invasive one, just preventing the match on the incoming > > path. > > > > > > The reason for choosing a SOL_SOCKET socket-option and not at the > > SOL_UDP-level is because that functionality actually could be useful for > > other protocols as well. E.g., TCP wants to better use the full 4-tuple > > space > > by binding to the source-IP and the destination-IP at the same time. > > Passive TCP flows can not benefit from this idea. > > Active TCP flows can already do that, I do not really understand what you are > suggesting. What we had here is that we wanted to let a server initiate more than 64K connections *while* binding also to a source-IP. With TCP the bind() would then pick a source-port and we ended up hitting the 64K limit. If we could do an atomic "bind + connect", source-port selection could ensure that the 4-tuple is unique. Or has something changed in recent times that allows to use the 4-tuple matching when doing this with TCP? Christoph
Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
On 10/31/2018 10:08 PM, Eric Dumazet wrote: > Our plan is to use EDT model for UDP packets, so that we can > still use one (not connected) UDP socket per cpu/thread. > > We added in linux-4.20 the EDT model for TCP, and I intend to add the > remaining part for sch_fq for 4.21. > > UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can > be a GSO btw) a tstamp, > and pacing will happen just fine. > List of EDT patches in reverse order if you want to take a look : 3f80e08f40cdb308589a49077c87632fa4508b21 tcp: add tcp_reset_xmit_timer() helper 4c16128b6271e70c8743178e90cccee147858503 net: loopback: clear skb->tstamp before netif_rx() 79861919b8896e14b8e5707242721f2312c57ae4 tcp: fix TCP_REPAIR xmit queue setup 825e1c523d5000f067a1614e4a66bb282a2d373c tcp: cdg: use tcp high resolution clock cache 864e5c090749448e879e86bec06ee396aa2c19c5 tcp: optimize tcp internal pacing 7baf33bdac37da65ddce3adf4daa8c7805802174 net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() a7a2563064e963bc5e3f39f533974f2730c0ff56 tcp: mitigate scheduling jitter in EDT pacing model 76a9ebe811fb3d0605cb084f1ae6be5610541865 net: extend sk_pacing_rate to unsigned long 5f6188a8003d080e3753b8f14f4a5a2325ae1ff6 tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh fb420d5d91c1274d5966917725e71f27ed092a85 tcp/fq: move back to CLOCK_MONOTONIC 90caf67b01fabdd51b6cdeeb23b29bf73901df90 net_sched: sch_fq: remove dead code dealing with retransmits c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f tcp: switch tcp_internal_pacing() to tcp_wstamp_ns ab408b6dc7449c0f791e9e5f8de72fa7428584f2 tcp: switch tcp and sch_fq to new earliest departure time model fd2bca2aa7893586887b2370e90e85bd0abc805e tcp: switch internal pacing timer to CLOCK_TAI d3edd06ea8ea9e03de6567fda80b8be57e21a537 tcp: provide earliest departure time in skb->tstamp 9799ccb0e984a5c1311b22a212e7ff96e8b736de tcp: add tcp_wstamp_ns socket field 142537e419234c396890a22806b8644dce21b132 net_sched: sch_fq: switch to CLOCK_TAI 2fd66ffba50716fc5ab481c48db643af3bda2276 tcp: introduce tcp_skb_timestamp_us() helper 72b0094f918294e6cb8cf5c3b4520d928fbb1a57 tcp: switch tcp_clock_ns() to CLOCK_TAI base
Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
On 10/31/2018 08:50 PM, Christoph Paasch wrote: > On 31/10/18 - 17:53:22, Eric Dumazet wrote: >> On 10/31/2018 04:26 PM, Christoph Paasch wrote: >>> Implementations of Quic might want to create a separate socket for each >>> Quic-connection by creating a connected UDP-socket. >>> >> >> Nice proposal, but I doubt a QUIC server can afford having one UDP socket >> per connection ? >> >> It would add a huge overhead in term of memory usage in the kernel, >> and lots of epoll events to manage (say a QUIC server with one million >> flows, receiving >> very few packets per second per flow) >> >> Maybe you could elaborate on the need of having one UDP socket per >> connection. > > I let Leif chime in on that as the ask came from him. Leif & his team are > implementing Quic in the Apache Traffic Server. > > > One advantage I can see is that it would allow to benefit from fq_pacing as > one could set sk_pacing_rate simply on the socket. That way there is no need > to implement the pacing in the user-space anymore. Our plan is to use EDT model for UDP packets, so that we can still use one (not connected) UDP socket per cpu/thread. We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21. UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp, and pacing will happen just fine.
Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
> On Oct 31, 2018, at 10:04 PM, Eric Dumazet wrote: > > > >> On 10/31/2018 08:50 PM, Christoph Paasch wrote: >> >> What we had here is that we wanted to let a server initiate more than 64K >> connections *while* binding also to a source-IP. >> With TCP the bind() would then pick a source-port and we ended up hitting the >> 64K limit. If we could do an atomic "bind + connect", source-port selection >> could ensure that the 4-tuple is unique. >> >> Or has something changed in recent times that allows to use the 4-tuple >> matching when doing this with TCP? > > > Well, yes, although it is not really recent (this came with linux-4.2) > > You can now bind to an address only, and let the sport being automatically > chosen at connect() Oh, I didn’t knew about this socket option. Thanks :-) Christoph > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d >
Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
On 10/31/2018 08:50 PM, Christoph Paasch wrote: > What we had here is that we wanted to let a server initiate more than 64K > connections *while* binding also to a source-IP. > With TCP the bind() would then pick a source-port and we ended up hitting the > 64K limit. If we could do an atomic "bind + connect", source-port selection > could ensure that the 4-tuple is unique. > > Or has something changed in recent times that allows to use the 4-tuple > matching when doing this with TCP? Well, yes, although it is not really recent (this came with linux-4.2) You can now bind to an address only, and let the sport being automatically chosen at connect() https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
On 10/31/18 3:57 PM, Paweł Staszewski wrote: > Hi > > So maybee someone will be interested how linux kernel handles normal > traffic (not pktgen :) ) > > > Server HW configuration: > > CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz > > NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) > > > Server software: > > FRR - as routing daemon > > enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa > node) > > enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) > > > Maximum traffic that server can handle: > > Bandwidth > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > input: /proc/net/dev type: rate > \ iface Rx Tx Total > == > > enp175s0f1: 28.51 Gb/s 37.24 Gb/s > 65.74 Gb/s > enp175s0f0: 38.07 Gb/s 28.44 Gb/s > 66.51 Gb/s > -- > > total: 66.58 Gb/s 65.67 Gb/s > 132.25 Gb/s > > > Packets per second: > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > input: /proc/net/dev type: rate > - iface Rx Tx Total > == > > enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s > enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s > -- > > total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s > > > After reaching that limits nics on the upstream side (more RX traffic) > start to drop packets > > > I just dont understand that server can't handle more bandwidth > (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX > side are increasing. > > Was thinking that maybee reached some pcie x16 limit - but x16 8GT is > 126Gbit - and also when testing with pktgen i can reach more bw and pps > (like 4x more comparing to normal internet traffic) > > And wondering if there is something that can be improved here. This is mainly a forwarding use case? Seems so based on the perf report. I suspect forwarding with XDP would show pretty good improvement. You need the vlan changes I have queued up though.
Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
On 10/31/18 6:02 PM, Jeff Barnhill wrote: > I'll follow this email with a new patch using ifacaddr6 instead of > creating a new struct. I ended up using fib6_nh.nh_dev to get the net, > instead of adding a back pointer to idev. It seems that idev was > recently removed in lieu of this, so if this is incorrect, please let > me know. Hopefully, I got the locking correct. That's correct. Make sure that the anycast code can not be accessed for reject routes which will not have a device set. Should be ok, but double check.
[PATCH v1 net] net: dsa: microchip: initialize mutex before use
From: Tristram Ha Initialize mutex before use. Avoid kernel complaint when CONFIG_DEBUG_LOCK_ALLOC is enabled. Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477") Signed-off-by: Tristram Ha --- v1 - Remove comment drivers/net/dsa/microchip/ksz_common.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 54e0ca6..86b6464 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev) { int i; - mutex_init(&dev->reg_mutex); - mutex_init(&dev->stats_mutex); - mutex_init(&dev->alu_mutex); - mutex_init(&dev->vlan_mutex); - dev->ds->ops = &ksz_switch_ops; for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) { @@ -1206,6 +1201,11 @@ int ksz_switch_register(struct ksz_device *dev) if (dev->pdata) dev->chip_id = dev->pdata->chip_id; + mutex_init(&dev->reg_mutex); + mutex_init(&dev->stats_mutex); + mutex_init(&dev->alu_mutex); + mutex_init(&dev->vlan_mutex); + if (ksz_switch_detect(dev)) return -EINVAL; -- 1.9.1
Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
[ sorry, too many distractions and I forgot to respond ] On 10/30/18 11:34 AM, Stefano Brivio wrote: > On Tue, 30 Oct 2018 10:34:45 -0600 > David Ahern wrote: > >> A more flexible approach is to use format strings to allow users to >> customize the output order and whitespace as well. So for ss and your >> column list (winging it here): >> >> netid = %N >> state = %S >> recv Q = %Qr >> send Q = %Qs >> local address = %Al >> lport port = %Pl >> remote address = %Ar >> remote port= %Pr >> process data = %p >> ... >> >> then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n" > > I like the idea indeed, but I see two issues with ss: > > - the current column abstraction is rather lightweight, things are > already buffered in the defined column order so we don't have to jump > back and forth in the buffer while rendering. Doing that needs some > extra care to avoid a performance hit, but it's probably doable, I > can put that on my to-do list The ss command is always a pain; it's much better in newer releases but I am always having to adjust terminal width. > > - how would you model automatic spacing in a format string? Should we > support width specifiers? Disable automatic spacing if a format > string is given? It might even make sense to allow partial automatic Follow the format string for whitespace and order, and yes, on the width specifiers if possible. > spacing with a special character in the format string, that is: > > "%S.%Qr.%Qs %Al:%Pl %Ar:%Pr %p\n" > > would mean "align everything to the right, distribute remaining > whitespace between %S, %Qr and %Qs". But it looks rather complicated > at a glance. > My concern here is that once this goes in for 1 command, the others in iproute2 need to follow suit - meaning same syntax style for all commands. Given that I'd prefer we get a reasonable consensus on syntax that will work across commands -- ss, ip, tc. If it is as simple as column names with a fixed order, that is fine but just give proper consideration given the impact. The 'ip' syntax for example gets ugly quick with the various link types and their options. We don't need to allow every detail of each link type to be customized, but there are common attributes for links (e.g., mtu, ifindex, link flags, lladdr), addresses, and link types such as bridges and bonds where we can improve the amount of data thrown at a user -- a better, more customizable version of what the brief option targeted.
Geschäfts- / Projektkredit 1,5%
-- Schönen Tag. Wir hoffen, Sie gut zu treffen. Benötigen Sie einen dringenden Kredit für ein Geschäft oder ein Projekt? Wir bieten Kredite zu 1,5% und wir können Ihre Transaktion innerhalb von 3 bis 5 Arbeitstagen abschließen. Wenn Sie ernsthaft an einem Kredit interessiert sind, empfehlen wir Ihnen, unten die Einzelheiten zur Bearbeitung Ihrer Transaktion anzugeben. Vollständiger Name:.. Darlehensbetrag: .. Darlehensdauer: .. Darlehen Zweck:.. Telefon:.. Wir erwarten Ihre Darlehensdaten wie oben beschrieben für die Abwicklung Ihrer Transaktion. Freundliche Grüße. Wilson Rog. Buchhalter / Berater
Re: [PATCH net] openvswitch: Fix push/pop ethernet validation
From: Jaime Caamaño Ruiz Date: Wed, 31 Oct 2018 18:52:03 +0100 > When there are both pop and push ethernet header actions among the > actions to be applied to a packet, an unexpected EINVAL (Invalid > argument) error is obtained. This is due to mac_proto not being reset > correctly when those actions are validated. > > Reported-at: > https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html > Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions") > Signed-off-by: Jaime Caamaño Ruiz Applied and queued up for -stable.
Re: [net 0/8][pull request] Intel Wired LAN Driver Updates 2018-10-31
From: Jeff Kirsher Date: Wed, 31 Oct 2018 12:42:46 -0700 > This series contains a various collection of fixes. ... > The following are changes since commit > a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2: > net: mvpp2: Fix affinity hint allocation > and are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE Pulled, thanks Jeff.
Re: [PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return
On Thu, Nov 01, 2018 at 12:05:53AM +0100, Daniel Borkmann wrote: > In the verifier there is no such semantics where registers with > PTR_TO_MAP_VALUE type have an id assigned to them. This is only > used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the > test against NULL has been pattern matched and type transformed > into PTR_TO_MAP_VALUE. > > Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE") > Signed-off-by: Daniel Borkmann > Cc: Roman Gushchin > Acked-by: Alexei Starovoitov Looks good to me. Acked-by: Roman Gushchin Thanks!
Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
On 10/31/2018 04:26 PM, Christoph Paasch wrote: > Implementations of Quic might want to create a separate socket for each > Quic-connection by creating a connected UDP-socket. > Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ? It would add a huge overhead in term of memory usage in the kernel, and lots of epoll events to manage (say a QUIC server with one million flows, receiving very few packets per second per flow) Maybe you could elaborate on the need of having one UDP socket per connection. > To achieve that on the server-side, a "master-socket" needs to wait for > incoming new connections and then creates a new socket that will be a > connected UDP-socket. To create that latter one, the server needs to > first bind() and then connect(). However, after the bind() the server > might already receive traffic on that new socket that is unrelated to the > Quic-connection at hand. Only after the connect() a full 4-tuple match > is happening. So, one can't really create this kind of a server that has > a connected UDP-socket per Quic connection. > > So, what is needed is an "atomic bind & connect" that basically > prevents any incoming traffic until the connect() call has been issued > at which point the full 4-tuple is known. > > > This patchset implements this functionality and exposes a socket-option > to do this. > > Usage would be: > > int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); > > int val = 1; > setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val)); > > bind(fd, (struct sockaddr *)&src, sizeof(src)); > > /* At this point, incoming traffic will never match on this socket */ > > connect(fd, (struct sockaddr *)&dst, sizeof(dst)); > > /* Only now incoming traffic will reach the socket */ > > > > There is literally an infinite number of ways on how to implement it, > which is why I first send it out as an RFC. With this approach here I > chose the least invasive one, just preventing the match on the incoming > path. > > > The reason for choosing a SOL_SOCKET socket-option and not at the > SOL_UDP-level is because that functionality actually could be useful for > other protocols as well. E.g., TCP wants to better use the full 4-tuple space > by binding to the source-IP and the destination-IP at the same time. Passive TCP flows can not benefit from this idea. Active TCP flows can already do that, I do not really understand what you are suggesting.
Re: pull-request: bpf 2018-11-01
From: Daniel Borkmann Date: Thu, 1 Nov 2018 01:28:41 +0100 > The following pull-request contains BPF updates for your *net* tree. > > The main changes are: > > 1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking >case when no data is available yet, from John. > > 2) Fix a compilation error in libbpf_attach_type_by_name() when compiled >with clang 3.8, from Andrey. > > 3) Fix a partial copy of map pointer on scalar alu and remove id >generation for RET_PTR_TO_MAP_VALUE return types, from Daniel. > > 4) Add unlimited memlock limit for kernel selftest's flow_dissector_load >program, from Yonghong. > > 5) Fix ping for some BPF shell based kselftests where distro does not >ship "ping -6" anymore, from Li. > > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Pulled, thanks Daniel.
pull-request: bpf 2018-11-01
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking case when no data is available yet, from John. 2) Fix a compilation error in libbpf_attach_type_by_name() when compiled with clang 3.8, from Andrey. 3) Fix a partial copy of map pointer on scalar alu and remove id generation for RET_PTR_TO_MAP_VALUE return types, from Daniel. 4) Add unlimited memlock limit for kernel selftest's flow_dissector_load program, from Yonghong. 5) Fix ping for some BPF shell based kselftests where distro does not ship "ping -6" anymore, from Li. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2: net: mvpp2: Fix affinity hint allocation (2018-10-30 11:34:41 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to dfeb8f4c9692fd5e6c3eef19c2e4ae5338dbdb01: Merge branch 'verifier-fixes' (2018-10-31 16:53:18 -0700) Alexei Starovoitov (1): Merge branch 'verifier-fixes' Andrey Ignatov (1): libbpf: Fix compile error in libbpf_attach_type_by_name Daniel Borkmann (4): bpf: fix partial copy of map_ptr when dst is scalar bpf: don't set id on after map lookup with ptr_to_map_val return bpf: add various test cases to test_verifier bpf: test make sure to run unpriv test cases in test_verifier John Fastabend (1): bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data Li Zhijian (1): kselftests/bpf: use ping6 as the default ipv6 ping binary if it exists Yonghong Song (1): tools/bpf: add unlimited rlimit for flow_dissector_load include/linux/bpf_verifier.h | 3 + kernel/bpf/verifier.c | 21 +- net/ipv4/tcp_bpf.c| 1 + tools/lib/bpf/libbpf.c| 13 +- tools/testing/selftests/bpf/flow_dissector_load.c | 2 + tools/testing/selftests/bpf/test_skb_cgroup_id.sh | 3 +- tools/testing/selftests/bpf/test_sock_addr.sh | 3 +- tools/testing/selftests/bpf/test_verifier.c | 321 +++--- 8 files changed, 319 insertions(+), 48 deletions(-)
[PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable
icmp6_send() function is expensive on systems with a large number of interfaces. Every time it’s called, it has to verify that the source address does not correspond to an existing anycast address by looping through every device and every anycast address on the device. This can result in significant delays for a CPU when there are a large number of neighbors and ND timers are frequently timing out and calling neigh_invalidate(). Add anycast addresses to a global hashtable to allow quick searching for matching anycast addresses. This is based on inet6_addr_lst in addrconf.c. Signed-off-by: Jeff Barnhill <0xeff...@gmail.com> --- include/net/addrconf.h | 2 ++ include/net/if_inet6.h | 2 ++ net/ipv6/af_inet6.c| 5 net/ipv6/anycast.c | 80 +++--- 4 files changed, 85 insertions(+), 4 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 14b789a123e7..799af1a037d1 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev, const struct in6_addr *addr); bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev, const struct in6_addr *addr); +int anycast_init(void); +void anycast_cleanup(void); /* Device notifier */ int register_inet6addr_notifier(struct notifier_block *nb); diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index d7578cf49c3a..c9c78c15bce0 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -146,10 +146,12 @@ struct ifacaddr6 { struct in6_addr aca_addr; struct fib6_info*aca_rt; struct ifacaddr6*aca_next; + struct hlist_node aca_addr_lst; int aca_users; refcount_t aca_refcnt; unsigned long aca_cstamp; unsigned long aca_tstamp; + struct rcu_head rcu; }; #defineIFA_HOSTIPV6_ADDR_LOOPBACK diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 3f4d61017a69..ddc8a6dbfba2 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -1001,6 +1001,9 @@ static int __init inet6_init(void) err = ip6_flowlabel_init(); if (err) goto ip6_flowlabel_fail; + err = anycast_init(); + if (err) + goto anycast_fail; err = addrconf_init(); if (err) goto addrconf_fail; @@ -1091,6 +1094,8 @@ static int __init inet6_init(void) ipv6_exthdrs_fail: addrconf_cleanup(); addrconf_fail: + anycast_cleanup(); +anycast_fail: ip6_flowlabel_cleanup(); ip6_flowlabel_fail: ndisc_late_cleanup(); diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c index 4e0ff7031edd..f6c4c8ac184c 100644 --- a/net/ipv6/anycast.c +++ b/net/ipv6/anycast.c @@ -44,8 +44,22 @@ #include +#define IN6_ADDR_HSIZE_SHIFT 8 +#define IN6_ADDR_HSIZE BIT(IN6_ADDR_HSIZE_SHIFT) +/* anycast address hash table + */ +static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE]; +static DEFINE_SPINLOCK(acaddr_hash_lock); + static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr *addr); +static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr) +{ + u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net); + + return hash_32(val, IN6_ADDR_HSIZE_SHIFT); +} + /* * socket join an anycast group */ @@ -204,16 +218,39 @@ void ipv6_sock_ac_close(struct sock *sk) rtnl_unlock(); } +static void ipv6_add_acaddr_hash(struct net *net, struct ifacaddr6 *aca) +{ + unsigned int hash = inet6_acaddr_hash(net, &aca->aca_addr); + + spin_lock(&acaddr_hash_lock); + hlist_add_head_rcu(&aca->aca_addr_lst, &inet6_acaddr_lst[hash]); + spin_unlock(&acaddr_hash_lock); +} + +static void ipv6_del_acaddr_hash(struct ifacaddr6 *aca) +{ + spin_lock(&acaddr_hash_lock); + hlist_del_init_rcu(&aca->aca_addr_lst); + spin_unlock(&acaddr_hash_lock); +} + static void aca_get(struct ifacaddr6 *aca) { refcount_inc(&aca->aca_refcnt); } +static void aca_free_rcu(struct rcu_head *h) +{ + struct ifacaddr6 *aca = container_of(h, struct ifacaddr6, rcu); + + fib6_info_release(aca->aca_rt); + kfree(aca); +} + static void aca_put(struct ifacaddr6 *ac) { if (refcount_dec_and_test(&ac->aca_refcnt)) { - fib6_info_release(ac->aca_rt); - kfree(ac); + call_rcu(&ac->rcu, aca_free_rcu); } } @@ -229,6 +266,7 @@ static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i, aca->aca_addr = *addr; fib6_info_hold(f6i); aca->aca_rt = f6i; + INIT_HLIST_NODE(&aca->aca_addr_lst); aca->aca_users = 1; /* aca_tstamp should be updated upon changes */ aca->aca_cstamp = aca->aca_tstamp = jiffies; @@ -285,6 +323
Re: [PATCH bpf 0/4] BPF fixes and tests
On Thu, Nov 01, 2018 at 12:05:51AM +0100, Daniel Borkmann wrote: > The series contains two fixes in BPF core and test cases. For details > please see individual patches. Thanks! > > Daniel Borkmann (4): > bpf: fix partial copy of map_ptr when dst is scalar > bpf: don't set id on after map lookup with ptr_to_map_val return > bpf: add various test cases to test_verifier > bpf: test make sure to run unpriv test cases in test_verifier > > include/linux/bpf_verifier.h| 3 + > kernel/bpf/verifier.c | 21 +- > tools/testing/selftests/bpf/test_verifier.c | 321 > +--- > 3 files changed, 305 insertions(+), 40 deletions(-) Applied to bpf tree, Thanks ... and we achieved very nice milestone... crossed 1000 tests in test_verifier :) Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED
Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
I'll follow this email with a new patch using ifacaddr6 instead of creating a new struct. I ended up using fib6_nh.nh_dev to get the net, instead of adding a back pointer to idev. It seems that idev was recently removed in lieu of this, so if this is incorrect, please let me know. Hopefully, I got the locking correct. Thanks, Jeff On Tue, Oct 30, 2018 at 7:19 PM David Miller wrote: > > From: David Ahern > Date: Tue, 30 Oct 2018 16:06:46 -0600 > > > or make the table per namespace. > > This will increase namespace create/destroy cost, so I'd rather not > for something like this.
[PATCH bpf 0/4] BPF fixes and tests
The series contains two fixes in BPF core and test cases. For details please see individual patches. Thanks! Daniel Borkmann (4): bpf: fix partial copy of map_ptr when dst is scalar bpf: don't set id on after map lookup with ptr_to_map_val return bpf: add various test cases to test_verifier bpf: test make sure to run unpriv test cases in test_verifier include/linux/bpf_verifier.h| 3 + kernel/bpf/verifier.c | 21 +- tools/testing/selftests/bpf/test_verifier.c | 321 +--- 3 files changed, 305 insertions(+), 40 deletions(-) -- 2.9.5
[PATCH bpf 1/4] bpf: fix partial copy of map_ptr when dst is scalar
ALU operations on pointers such as scalar_reg += map_value_ptr are handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr and range in the register state share a union, so transferring state through dst_reg->range = ptr_reg->range is just buggy as any new map_ptr in the dst_reg is then truncated (or null) for subsequent checks. Fix this by adding a raw member and use it for copying state over to dst_reg. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Daniel Borkmann Cc: Edward Cree Acked-by: Alexei Starovoitov --- include/linux/bpf_verifier.h | 3 +++ kernel/bpf/verifier.c| 10 ++ 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 9e8056e..d93e897 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -51,6 +51,9 @@ struct bpf_reg_state { * PTR_TO_MAP_VALUE_OR_NULL */ struct bpf_map *map_ptr; + + /* Max size from any of the above. */ + unsigned long raw; }; /* Fixed part of pointer offset, pointer types only */ s32 off; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 171a2c8..774fa40 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3046,7 +3046,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, dst_reg->umax_value = umax_ptr; dst_reg->var_off = ptr_reg->var_off; dst_reg->off = ptr_reg->off + smin_val; - dst_reg->range = ptr_reg->range; + dst_reg->raw = ptr_reg->raw; break; } /* A new variable offset is created. Note that off_reg->off @@ -3076,10 +3076,11 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, } dst_reg->var_off = tnum_add(ptr_reg->var_off, off_reg->var_off); dst_reg->off = ptr_reg->off; + dst_reg->raw = ptr_reg->raw; if (reg_is_pkt_pointer(ptr_reg)) { dst_reg->id = ++env->id_gen; /* something was added to pkt_ptr, set range to zero */ - dst_reg->range = 0; + dst_reg->raw = 0; } break; case BPF_SUB: @@ -3108,7 +3109,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, dst_reg->var_off = ptr_reg->var_off; dst_reg->id = ptr_reg->id; dst_reg->off = ptr_reg->off - smin_val; - dst_reg->range = ptr_reg->range; + dst_reg->raw = ptr_reg->raw; break; } /* A new variable offset is created. If the subtrahend is known @@ -3134,11 +3135,12 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, } dst_reg->var_off = tnum_sub(ptr_reg->var_off, off_reg->var_off); dst_reg->off = ptr_reg->off; + dst_reg->raw = ptr_reg->raw; if (reg_is_pkt_pointer(ptr_reg)) { dst_reg->id = ++env->id_gen; /* something was added to pkt_ptr, set range to zero */ if (smin_val < 0) - dst_reg->range = 0; + dst_reg->raw = 0; } break; case BPF_AND: -- 2.9.5
[PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return
In the verifier there is no such semantics where registers with PTR_TO_MAP_VALUE type have an id assigned to them. This is only used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the test against NULL has been pattern matched and type transformed into PTR_TO_MAP_VALUE. Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE") Signed-off-by: Daniel Borkmann Cc: Roman Gushchin Acked-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 774fa40..1971ca32 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2852,10 +2852,6 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn regs[BPF_REG_0].type = NOT_INIT; } else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL || fn->ret_type == RET_PTR_TO_MAP_VALUE) { - if (fn->ret_type == RET_PTR_TO_MAP_VALUE) - regs[BPF_REG_0].type = PTR_TO_MAP_VALUE; - else - regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL; /* There is no offset yet applied, variable or fixed */ mark_reg_known_zero(env, regs, BPF_REG_0); /* remember map_ptr, so that check_map_access() @@ -2868,7 +2864,12 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return -EINVAL; } regs[BPF_REG_0].map_ptr = meta.map_ptr; - regs[BPF_REG_0].id = ++env->id_gen; + if (fn->ret_type == RET_PTR_TO_MAP_VALUE) { + regs[BPF_REG_0].type = PTR_TO_MAP_VALUE; + } else { + regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL; + regs[BPF_REG_0].id = ++env->id_gen; + } } else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) { int id = acquire_reference_state(env, insn_idx); if (id < 0) -- 2.9.5
[PATCH bpf 4/4] bpf: test make sure to run unpriv test cases in test_verifier
Right now unprivileged tests are never executed as a BPF test run, only loaded. Allow for running them as well so that we can check the outcome and probe for regressions. Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- tools/testing/selftests/bpf/test_verifier.c | 71 - 1 file changed, 40 insertions(+), 31 deletions(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 4c7445d..6f61df6 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -76,7 +76,7 @@ struct bpf_test { int fixup_percpu_cgroup_storage[MAX_FIXUPS]; const char *errstr; const char *errstr_unpriv; - uint32_t retval; + uint32_t retval, retval_unpriv; enum { UNDEF, ACCEPT, @@ -3084,6 +3084,8 @@ static struct bpf_test tests[] = { .fixup_prog1 = { 2 }, .result = ACCEPT, .retval = 42, + /* Verifier rewrite for unpriv skips tail call here. */ + .retval_unpriv = 2, }, { "stack pointer arithmetic", @@ -14149,6 +14151,33 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_map_type prog_type, } } +static int set_admin(bool admin) +{ + cap_t caps; + const cap_value_t cap_val = CAP_SYS_ADMIN; + int ret = -1; + + caps = cap_get_proc(); + if (!caps) { + perror("cap_get_proc"); + return -1; + } + if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val, + admin ? CAP_SET : CAP_CLEAR)) { + perror("cap_set_flag"); + goto out; + } + if (cap_set_proc(caps)) { + perror("cap_set_proc"); + goto out; + } + ret = 0; +out: + if (cap_free(caps)) + perror("cap_free"); + return ret; +} + static void do_test_single(struct bpf_test *test, bool unpriv, int *passes, int *errors) { @@ -14157,6 +14186,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv, struct bpf_insn *prog = test->insns; int map_fds[MAX_NR_MAPS]; const char *expected_err; + uint32_t expected_val; uint32_t retval; int i, err; @@ -14176,6 +14206,8 @@ static void do_test_single(struct bpf_test *test, bool unpriv, test->result_unpriv : test->result; expected_err = unpriv && test->errstr_unpriv ? test->errstr_unpriv : test->errstr; + expected_val = unpriv && test->retval_unpriv ? + test->retval_unpriv : test->retval; reject_from_alignment = fd_prog < 0 && (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) && @@ -14209,16 +14241,20 @@ static void do_test_single(struct bpf_test *test, bool unpriv, __u8 tmp[TEST_DATA_LEN << 2]; __u32 size_tmp = sizeof(tmp); + if (unpriv) + set_admin(true); err = bpf_prog_test_run(fd_prog, 1, test->data, sizeof(test->data), tmp, &size_tmp, &retval, NULL); + if (unpriv) + set_admin(false); if (err && errno != 524/*ENOTSUPP*/ && errno != EPERM) { printf("Unexpected bpf_prog_test_run error\n"); goto fail_log; } - if (!err && retval != test->retval && - test->retval != POINTER_VALUE) { - printf("FAIL retval %d != %d\n", retval, test->retval); + if (!err && retval != expected_val && + expected_val != POINTER_VALUE) { + printf("FAIL retval %d != %d\n", retval, expected_val); goto fail_log; } } @@ -14261,33 +14297,6 @@ static bool is_admin(void) return (sysadmin == CAP_SET); } -static int set_admin(bool admin) -{ - cap_t caps; - const cap_value_t cap_val = CAP_SYS_ADMIN; - int ret = -1; - - caps = cap_get_proc(); - if (!caps) { - perror("cap_get_proc"); - return -1; - } - if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val, - admin ? CAP_SET : CAP_CLEAR)) { - perror("cap_set_flag"); - goto out; - } - if (cap_set_proc(caps)) { - perror("cap_set_proc"); - goto out; - } - ret = 0; -out: - if (cap_free(caps)) - perror("cap_free"); - return ret; -} - static void get_unpriv_disabled() { char buf[2]; -- 2.9.5
[PATCH bpf 3/4] bpf: add various test cases to test_verifier
Add some more map related test cases to test_verifier kselftest to improve test coverage. Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov --- tools/testing/selftests/bpf/test_verifier.c | 250 1 file changed, 250 insertions(+) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 36f3d30..4c7445d 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -6455,6 +6455,256 @@ static struct bpf_test tests[] = { .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, { + "map access: known scalar += value_ptr", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, +BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), + BPF_MOV64_IMM(BPF_REG_1, 4), + BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0), + BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .fixup_map_array_48b = { 3 }, + .result = ACCEPT, + .retval = 1, + }, + { + "map access: value_ptr += known scalar", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, +BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3), + BPF_MOV64_IMM(BPF_REG_1, 4), + BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1), + BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .fixup_map_array_48b = { 3 }, + .result = ACCEPT, + .retval = 1, + }, + { + "map access: unknown scalar += value_ptr", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, +BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4), + BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0), + BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf), + BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0), + BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .fixup_map_array_48b = { 3 }, + .result = ACCEPT, + .retval = 1, + }, + { + "map access: value_ptr += unknown scalar", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, +BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4), + BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0), + BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf), + BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1), + BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .fixup_map_array_48b = { 3 }, + .result = ACCEPT, + .retval = 1, + }, + { + "map access: value_ptr += value_ptr", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), +
[RFC 1/2] net: Add new socket-option SO_DELAYED_BIND
And store it as a flag in the sk_flags. Signed-off-by: Christoph Paasch --- arch/alpha/include/uapi/asm/socket.h | 2 ++ arch/ia64/include/uapi/asm/socket.h | 2 ++ arch/mips/include/uapi/asm/socket.h | 2 ++ arch/parisc/include/uapi/asm/socket.h | 2 ++ arch/s390/include/uapi/asm/socket.h | 2 ++ arch/sparc/include/uapi/asm/socket.h | 2 ++ arch/xtensa/include/uapi/asm/socket.h | 2 ++ include/net/sock.h| 1 + include/uapi/asm-generic/socket.h | 2 ++ net/core/sock.c | 21 + 10 files changed, 38 insertions(+) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index 065fb372e355..add6aca13b53 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -115,4 +115,6 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND62 + #endif /* _UAPI_ASM_SOCKET_H */ diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h index c872c4e6bafb..98a86f406601 100644 --- a/arch/ia64/include/uapi/asm/socket.h +++ b/arch/ia64/include/uapi/asm/socket.h @@ -117,4 +117,6 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND62 + #endif /* _ASM_IA64_SOCKET_H */ diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 71370fb3ceef..f84bd74d58ee 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -126,4 +126,6 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND62 + #endif /* _UAPI_ASM_SOCKET_H */ diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 061b9cf2a779..8fe20a7abf6e 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -107,4 +107,6 @@ #define SO_TXTIME 0x4036 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND0x4037 + #endif /* _UAPI_ASM_SOCKET_H */ diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h index 39d901476ee5..c00b10909a72 100644 --- a/arch/s390/include/uapi/asm/socket.h +++ b/arch/s390/include/uapi/asm/socket.h @@ -114,4 +114,6 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND62 + #endif /* _ASM_SOCKET_H */ diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 7ea35e5601b6..0825db0c9f46 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -104,6 +104,8 @@ #define SO_TXTIME 0x003f #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND0x0040 + /* Security levels - as per NRL IPv6 - don't actually do anything */ #define SO_SECURITY_AUTHENTICATION 0x5001 #define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002 diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h index 1de07a7f7680..cd4d91e982d5 100644 --- a/arch/xtensa/include/uapi/asm/socket.h +++ b/arch/xtensa/include/uapi/asm/socket.h @@ -119,4 +119,6 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND62 + #endif /* _XTENSA_SOCKET_H */ diff --git a/include/net/sock.h b/include/net/sock.h index f665d74ae509..16fbe54cf519 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -801,6 +801,7 @@ enum sock_flags { SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ SOCK_TXTIME, SOCK_XDP, /* XDP is attached */ + SOCK_DELAYED_BIND, }; #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)) diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index a12692e5f7a8..653f1f65a311 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -110,4 +110,6 @@ #define SO_TXTIME 61 #define SCM_TXTIME SO_TXTIME +#define SO_DELAYED_BIND62 + #endif /* __ASM_GENERIC_SOCKET_H */ diff --git a/net/core/sock.c b/net/core/sock.c index 6fcc4bc07d19..343baa820cf2 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1047,6 +1047,23 @@ int sock_setsockopt(struct socket *sock, int level, int optname, } break; + case SO_DELAYED_BIND: + if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) { + if (sk->sk_protocol != IPPROTO_UDP) + ret = -ENOTSUPP; + } else { + ret = -ENOTSUPP; + } + + if (!ret) { + if (val < 0 || val > 1) +
[RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
Implementations of Quic might want to create a separate socket for each Quic-connection by creating a connected UDP-socket. To achieve that on the server-side, a "master-socket" needs to wait for incoming new connections and then creates a new socket that will be a connected UDP-socket. To create that latter one, the server needs to first bind() and then connect(). However, after the bind() the server might already receive traffic on that new socket that is unrelated to the Quic-connection at hand. Only after the connect() a full 4-tuple match is happening. So, one can't really create this kind of a server that has a connected UDP-socket per Quic connection. So, what is needed is an "atomic bind & connect" that basically prevents any incoming traffic until the connect() call has been issued at which point the full 4-tuple is known. This patchset implements this functionality and exposes a socket-option to do this. Usage would be: int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); int val = 1; setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val)); bind(fd, (struct sockaddr *)&src, sizeof(src)); /* At this point, incoming traffic will never match on this socket */ connect(fd, (struct sockaddr *)&dst, sizeof(dst)); /* Only now incoming traffic will reach the socket */ There is literally an infinite number of ways on how to implement it, which is why I first send it out as an RFC. With this approach here I chose the least invasive one, just preventing the match on the incoming path. The reason for choosing a SOL_SOCKET socket-option and not at the SOL_UDP-level is because that functionality actually could be useful for other protocols as well. E.g., TCP wants to better use the full 4-tuple space by binding to the source-IP and the destination-IP at the same time. Feedback is very welcome! Christoph Paasch (2): net: Add new socket-option SO_DELAYED_BIND udp: Support SO_DELAYED_BIND arch/alpha/include/uapi/asm/socket.h | 2 ++ arch/ia64/include/uapi/asm/socket.h | 2 ++ arch/mips/include/uapi/asm/socket.h | 2 ++ arch/parisc/include/uapi/asm/socket.h | 2 ++ arch/s390/include/uapi/asm/socket.h | 2 ++ arch/sparc/include/uapi/asm/socket.h | 2 ++ arch/xtensa/include/uapi/asm/socket.h | 2 ++ include/net/sock.h| 1 + include/uapi/asm-generic/socket.h | 2 ++ net/core/sock.c | 21 + net/ipv4/datagram.c | 1 + net/ipv4/udp.c| 3 +++ 12 files changed, 42 insertions(+) -- 2.16.2
[RFC 2/2] udp: Support SO_DELAYED_BIND
For UDP, there is only a single socket-hash table, the udptable. We want to prevent incoming segments to match on this socket when SO_DELAYED_BIND is set. Thus, when computing the score for unconnected sockets, we simply prevent the match as long as the flag is set. Signed-off-by: Christoph Paasch --- net/ipv4/datagram.c | 1 + net/ipv4/udp.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c index 300921417f89..9bf0e0d2ea33 100644 --- a/net/ipv4/datagram.c +++ b/net/ipv4/datagram.c @@ -78,6 +78,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len inet->inet_id = jiffies; sk_dst_set(sk, &rt->dst); + sock_reset_flag(sk, SOCK_DELAYED_BIND); err = 0; out: return err; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ca3ed931f2a9..fb55f925342b 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -408,6 +408,9 @@ static int compute_score(struct sock *sk, struct net *net, score += 4; } + if (sock_flag(sk, SOCK_DELAYED_BIND)) + return -1; + if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; return score; -- 2.16.2
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze: W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: On 10/31/2018 02:57 PM, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total == enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s -- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s -- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) And wondering if there is something that can be improved here. Some more informations / counters / stats and perf top below: Perf top flame graph: https://uploadfiles.io/7zo6u System configuration(long): cat /sys/devices/system/node/node1/cpulist 14-27,42-55 cat /sys/class/net/enp175s0f0/device/numa_node 1 cat /sys/class/net/enp175s0f1/device/numa_node 1 ip -s -d link ls dev enp175s0f0 6: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 184142375840858 141347715974 2 2806325 0 85050528 TX: bytes packets errors dropped carrier collsns 99270697277430 172227994003 0 0 0 0 ip -s -d link ls dev enp175s0f1 7: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 99686284170801 173507590134 61 669685 0 100304421 TX: bytes packets errors dropped carrier collsns 184435107970545 142383178304 0 0 0 0 ./softnet.sh cpu total dropped squeezed collision rps flow_limit PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 26.78% [kernel] [k] queued_spin_lock_slowpath This is highly suspect. A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on. perf report: https://ufile.io/rqp0h With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping, so I do not know what could request a spinlock contention. And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on local cpu for the network controller for 28 RX+TX queues per nic .
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: On 10/31/2018 02:57 PM, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total == enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s -- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s -- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) And wondering if there is something that can be improved here. Some more informations / counters / stats and perf top below: Perf top flame graph: https://uploadfiles.io/7zo6u System configuration(long): cat /sys/devices/system/node/node1/cpulist 14-27,42-55 cat /sys/class/net/enp175s0f0/device/numa_node 1 cat /sys/class/net/enp175s0f1/device/numa_node 1 ip -s -d link ls dev enp175s0f0 6: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 184142375840858 141347715974 2 2806325 0 85050528 TX: bytes packets errors dropped carrier collsns 99270697277430 172227994003 0 0 0 0 ip -s -d link ls dev enp175s0f1 7: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 99686284170801 173507590134 61 669685 0 100304421 TX: bytes packets errors dropped carrier collsns 184435107970545 142383178304 0 0 0 0 ./softnet.sh cpu total dropped squeezed collision rps flow_limit PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 26.78% [kernel] [k] queued_spin_lock_slowpath This is highly suspect. A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on. perf report: https://ufile.io/rqp0h With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping, so I do not know what could request a spinlock contention.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
On 10/31/2018 02:57 PM, Paweł Staszewski wrote: > Hi > > So maybee someone will be interested how linux kernel handles normal traffic > (not pktgen :) ) > > > Server HW configuration: > > CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz > > NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) > > > Server software: > > FRR - as routing daemon > > enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) > > enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) > > > Maximum traffic that server can handle: > > Bandwidth > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > input: /proc/net/dev type: rate > \ iface Rx Tx Total > == > enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 > Gb/s > enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 > Gb/s > -- > total: 66.58 Gb/s 65.67 Gb/s 132.25 > Gb/s > > > Packets per second: > > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help > input: /proc/net/dev type: rate > - iface Rx Tx Total > == > enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s > enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s > -- > total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s > > > After reaching that limits nics on the upstream side (more RX traffic) start > to drop packets > > > I just dont understand that server can't handle more bandwidth (~40Gbit/s is > limit where all cpu's are 100% util) - where pps on RX side are increasing. > > Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit > - and also when testing with pktgen i can reach more bw and pps (like 4x more > comparing to normal internet traffic) > > And wondering if there is something that can be improved here. > > > > Some more informations / counters / stats and perf top below: > > Perf top flame graph: > > https://uploadfiles.io/7zo6u > > > > System configuration(long): > > > cat /sys/devices/system/node/node1/cpulist > 14-27,42-55 > cat /sys/class/net/enp175s0f0/device/numa_node > 1 > cat /sys/class/net/enp175s0f1/device/numa_node > 1 > > > > > > ip -s -d link ls dev enp175s0f0 > 6: enp175s0f0: mtu 1500 qdisc mq state UP > mode DEFAULT group default qlen 8192 > link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 > addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 > gso_max_segs 65535 > RX: bytes packets errors dropped overrun mcast > 184142375840858 141347715974 2 2806325 0 85050528 > TX: bytes packets errors dropped carrier collsns > 99270697277430 172227994003 0 0 0 0 > > ip -s -d link ls dev enp175s0f1 > 7: enp175s0f1: mtu 1500 qdisc mq state UP > mode DEFAULT group default qlen 8192 > link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 > addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 > gso_max_segs 65535 > RX: bytes packets errors dropped overrun mcast > 99686284170801 173507590134 61 669685 0 100304421 > TX: bytes packets errors dropped carrier collsns > 184435107970545 142383178304 0 0 0 0 > > > ./softnet.sh > cpu total dropped squeezed collision rps flow_limit > > > > > PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles], > (all, 56 CPUs) > --- > > 26.78% [kernel] [k] queued_spin_lock_slowpath This is highly suspect. A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on. With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping, so I do not know what could request a spinlock contention.
RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> -Original Message- > From: Richard Cochran [mailto:richardcoch...@gmail.com] > Sent: Wednesday, October 31, 2018 2:17 PM > To: Miroslav Lichvar > Cc: Keller, Jacob E ; netdev@vger.kernel.org; > intel-wired- > l...@lists.osuosl.org > Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime > > On Wed, Oct 31, 2018 at 03:49:35PM +0100, Miroslav Lichvar wrote: > > > > How about separating the PHC timestamp from the ptp_system_timestamp > > structure and use NULL to indicate we don't want to read the system > > clock? A gettimex64(ptp, ts, NULL) call would be equal to > > gettime64(ptp, ts). > > Doesn't sound too bad to me. > > Thanks, > Richard Yep, this seems fine to me as well. Regards, Jake
Re: [PATCH bpf] libbpf: Fix compile error in libbpf_attach_type_by_name
On 10/31/2018 09:49 PM, Arnaldo Carvalho de Melo wrote: > Em Wed, Oct 31, 2018 at 12:57:18PM -0700, Andrey Ignatov escreveu: >> Arnaldo Carvalho de Melo reported build error in libbpf when clang >> version 3.8.1-24 (tags/RELEASE_381/final) is used: >> >> libbpf.c:2201:36: error: comparison of constant -22 with expression of >> type 'const enum bpf_attach_type' is always false >> [-Werror,-Wtautological-constant-out-of-range-compare] >> if (section_names[i].attach_type == -EINVAL) >> ^ ~~~ >> 1 error generated. >> >> Fix the error by keeping "is_attachable" property of a program in a >> separate struct field instead of trying to use attach_type itself. > > Thanks, now it builds in all the previously failing systems: > > # export PERF_TARBALL=http://192.168.86.4/perf/perf-4.19.0.tar.xz > # dm debian:9 fedora:25 fedora:26 fedora:27 ubuntu:16.04 ubuntu:17.10 >1 debian:9: Ok gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 >clang version 3.8.1-24 (tags/RELEASE_381/final) >2 fedora:25 : Ok gcc (GCC) 6.4.1 20170727 (Red Hat 6.4.1-1) >clang version 3.9.1 (tags/RELEASE_391/final) >3 fedora:26 : Ok gcc (GCC) 7.3.1 20180130 (Red Hat 7.3.1-2) >clang version 4.0.1 (tags/RELEASE_401/final) >4 fedora:27 : Ok gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6) >clang version 5.0.2 (tags/RELEASE_502/final) >5 ubuntu:16.04: Ok gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 > 20160609 clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final) >6 ubuntu:17.10: Ok gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0 >clang version 4.0.1-6 (tags/RELEASE_401/final) > # > > Tested-by: Arnaldo Carvalho de Melo Thanks everyone, applied to bpf tree.
Re: Ethernet on my CycloneV broke since 4.9.124
Hi Clement, On 10/31/2018 10:36 AM, Clément Péron wrote: > Hi Dinh, > > On Wed, 31 Oct 2018 at 15:42, Dinh Nguyen wrote: >> >> Hi Clement, >> >> On 10/31/2018 08:01 AM, Clément Péron wrote: >>> Hi, >>> >>> The patch "net: stmmac: socfpga: add additional ocp reset line for >>> Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV >>> board. >>> >>> When I boot i have this issue : >>> >>> socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2 >>> socfpga-dwmac: probe of ff702000.ethernet failed with error -2 >>> >>> Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the >>> issue. >>> >> >> Are you sure? I just booted v4.9.124 and did not see any errors. The >> error should not appear because the commit is using >> devm_reset_control_get_optional(). > > I'm booting on 4.9.130 actually, Agree with you that > devm_reset_control_get_optional should not failed but checking other > usage of this helper > https://elixir.bootlin.com/linux/v4.9.135/source/drivers/i2c/busses/i2c-mv64xxx.c#L824 > https://elixir.bootlin.com/linux/v4.9.135/source/drivers/crypto/sunxi-ss/sun4i-ss-core.c#L259 > Show that they don't check for errors except for PROBE_DEFER > I made a mistake, I was booting linux-next. I am seeing the error with v4.9.124. It's due to this commit not getting backported: "bb475230b8e59a reset: make optional functions really optional" I have backported the patch and is available here if you like to take a look: https://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux.git/log/?h=v4.9.124_optional_reset Dinh
Re: Latest net-next kernel 4.19.0+
W dniu 30.10.2018 o 15:16, Eric Dumazet pisze: On 10/30/2018 01:09 AM, Paweł Staszewski wrote: W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: Indeed this is a bug. I would expect it to produce frequent errors though as many odd-length packets would trigger it. Do you have RXFCS? Regardless, how frequently do you see the problem? Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted. No RXFCS And this trace is rly frequently like once per 3/4 seconds like below: [28965.776864] vlan1490: hw csum failure Might be vlan related. Can you first check this : diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device *netdev, skb->ip_summed = CHECKSUM_COMPLETE; skb->csum = csum_unfold((__force __sum16)cqe->check_sum); if (network_depth > ETH_HLEN) - /* CQE csum is calculated from the IP header and does -* not cover VLAN headers (if present). This will add -* the checksum manually. -*/ - skb->csum = csum_partial(skb->data + ETH_HLEN, -network_depth - ETH_HLEN, -skb->csum); + /* Temporary debugging */ + skb->ip_summed = CHECKSUM_NONE; if (unlikely(netdev->features & NETIF_F_RXFCS)) skb->csum = csum_add(skb->csum, (__force __wsum)mlx5e_get_fcs(skb)); Ok thanks - will try it.
Re: Latest net-next kernel 4.19.0+
W dniu 31.10.2018 o 22:05, Saeed Mahameed pisze: On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote: On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet wrote: On 10/30/2018 01:09 AM, Paweł Staszewski wrote: W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: Indeed this is a bug. I would expect it to produce frequent errors though as many odd-length packets would trigger it. Do you have RXFCS? Regardless, how frequently do you see the problem? Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted. No RXFCS Same with Pawel, RXFCS is disabled by default. And this trace is rly frequently like once per 3/4 seconds like below: [28965.776864] vlan1490: hw csum failure Might be vlan related. Hi Pawel, is the vlan stripping offload disabled or enabled in your case ? To verify: ethtool -k | grep rx-vlan-offload rx-vlan-offload: on To set: ethtool -K rxvlan on/off Enabled: ethtool -k enp175s0f0 Features for enp175s0f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: on fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off rx-all: off tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: on [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] if the vlan offload is off then it will trigger the mlx5e vlan csum adjustment code pointed out by Eric. Anyhow, it should work in both cases, but i am trying to narrow down the possibilities. Also could it be a double tagged packet ? no double tagged packets there Unlike Pawel's case, we don't use vlan at all, maybe this is why we see it much less frequently than Pawel. Also, it is probably not specific to mlx5, as there is another report which is probably a non-mlx5 driver. Cong, How often does this happen ? can you some how verify if the problematic packet has extra end padding after the ip payload ? It would be cool if we had a feature in kernel to store such SKB in memory when such issue occurs, and let the user dump it later (via tcpdump) and send the dump to the vendor for debug so we could just replay and see what happens. Thanks.
Re: Latest net-next kernel 4.19.0+
On Wed, Oct 31, 2018 at 2:05 PM Saeed Mahameed wrote: > > Cong, How often does this happen ? can you some how verify if the > problematic packet has extra end padding after the ip payload ? For us, we need 10+ hours to get one warning. This is also why we never capture the packet that causes this warning. > > It would be cool if we had a feature in kernel to store such SKB in > memory when such issue occurs, and let the user dump it later (via > tcpdump) and send the dump to the vendor for debug so we could just > replay and see what happens. > Yeah, the warning kinda sucks, it tells almost nothing, the SKB should be dumped up on this warning.
Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
On Wed, Oct 31, 2018 at 03:49:35PM +0100, Miroslav Lichvar wrote: > > How about separating the PHC timestamp from the ptp_system_timestamp > structure and use NULL to indicate we don't want to read the system > clock? A gettimex64(ptp, ts, NULL) call would be equal to > gettime64(ptp, ts). Doesn't sound too bad to me. Thanks, Richard
Re: Latest net-next kernel 4.19.0+
On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote: > On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet > wrote: > > > > > > > > On 10/30/2018 01:09 AM, Paweł Staszewski wrote: > > > > > > > > > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: > > > > > > > > On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: > > > > > > > > > Indeed this is a bug. I would expect it to produce frequent > > > > > errors > > > > > though as many odd-length > > > > > packets would trigger it. Do you have RXFCS? Regardless, how > > > > > frequently do you see the problem? > > > > > > > > > > > > > Old kernels (before 88078d98d1bb) were simply resetting > > > > ip_summed to CHECKSUM_NONE > > > > > > > > And before your fix (commit d55bef5059dd057bd), mlx5 bug was > > > > canceling the bug you fixed. > > > > > > > > So we now need to also fix mlx5. > > > > > > > > And of course use skb_header_pointer() in mlx5e_get_fcs() as I > > > > mentioned earlier, > > > > plus __get_unaligned_cpu32() as you hinted. > > > > > > > > > > > > > > > > > > > > > > No RXFCS > > > Same with Pawel, RXFCS is disabled by default. > > > > > > > > And this trace is rly frequently like once per 3/4 seconds > > > like below: > > > [28965.776864] vlan1490: hw csum failure > > > > Might be vlan related. > Hi Pawel, is the vlan stripping offload disabled or enabled in your case ? To verify: ethtool -k | grep rx-vlan-offload rx-vlan-offload: on To set: ethtool -K rxvlan on/off if the vlan offload is off then it will trigger the mlx5e vlan csum adjustment code pointed out by Eric. Anyhow, it should work in both cases, but i am trying to narrow down the possibilities. Also could it be a double tagged packet ? > Unlike Pawel's case, we don't use vlan at all, maybe this is why we > see > it much less frequently than Pawel. > > Also, it is probably not specific to mlx5, as there is another report > which > is probably a non-mlx5 driver. > Cong, How often does this happen ? can you some how verify if the problematic packet has extra end padding after the ip payload ? It would be cool if we had a feature in kernel to store such SKB in memory when such issue occurs, and let the user dump it later (via tcpdump) and send the dump to the vendor for debug so we could just replay and see what happens. > Thanks.
Re: [PATCH bpf] libbpf: Fix compile error in libbpf_attach_type_by_name
Em Wed, Oct 31, 2018 at 12:57:18PM -0700, Andrey Ignatov escreveu: > Arnaldo Carvalho de Melo reported build error in libbpf when clang > version 3.8.1-24 (tags/RELEASE_381/final) is used: > > libbpf.c:2201:36: error: comparison of constant -22 with expression of > type 'const enum bpf_attach_type' is always false > [-Werror,-Wtautological-constant-out-of-range-compare] > if (section_names[i].attach_type == -EINVAL) > ^ ~~~ > 1 error generated. > > Fix the error by keeping "is_attachable" property of a program in a > separate struct field instead of trying to use attach_type itself. Thanks, now it builds in all the previously failing systems: # export PERF_TARBALL=http://192.168.86.4/perf/perf-4.19.0.tar.xz # dm debian:9 fedora:25 fedora:26 fedora:27 ubuntu:16.04 ubuntu:17.10 1 debian:9: Ok gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 clang version 3.8.1-24 (tags/RELEASE_381/final) 2 fedora:25 : Ok gcc (GCC) 6.4.1 20170727 (Red Hat 6.4.1-1) clang version 3.9.1 (tags/RELEASE_391/final) 3 fedora:26 : Ok gcc (GCC) 7.3.1 20180130 (Red Hat 7.3.1-2) clang version 4.0.1 (tags/RELEASE_401/final) 4 fedora:27 : Ok gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6) clang version 5.0.2 (tags/RELEASE_502/final) 5 ubuntu:16.04: Ok gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final) 6 ubuntu:17.10: Ok gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0 clang version 4.0.1-6 (tags/RELEASE_401/final) # Tested-by: Arnaldo Carvalho de Melo I also have it tentatively applied to my perf/urgent branch, that I'll push upstream soon. - Arnaldo > Fixes: commit 956b620fcf0b ("libbpf: Introduce libbpf_attach_type_by_name") > Reported-by: Arnaldo Carvalho de Melo > Signed-off-by: Andrey Ignatov > --- > tools/lib/bpf/libbpf.c | 13 +++-- > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c > index b607be7236d3..d6e62e90e8d4 100644 > --- a/tools/lib/bpf/libbpf.c > +++ b/tools/lib/bpf/libbpf.c > @@ -2084,19 +2084,19 @@ void bpf_program__set_expected_attach_type(struct > bpf_program *prog, > prog->expected_attach_type = type; > } > > -#define BPF_PROG_SEC_IMPL(string, ptype, eatype, atype) \ > - { string, sizeof(string) - 1, ptype, eatype, atype } > +#define BPF_PROG_SEC_IMPL(string, ptype, eatype, is_attachable, atype) \ > + { string, sizeof(string) - 1, ptype, eatype, is_attachable, atype } > > /* Programs that can NOT be attached. */ > -#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, > -EINVAL) > +#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 0, 0) > > /* Programs that can be attached. */ > #define BPF_APROG_SEC(string, ptype, atype) \ > - BPF_PROG_SEC_IMPL(string, ptype, 0, atype) > + BPF_PROG_SEC_IMPL(string, ptype, 0, 1, atype) > > /* Programs that must specify expected attach type at load time. */ > #define BPF_EAPROG_SEC(string, ptype, eatype) \ > - BPF_PROG_SEC_IMPL(string, ptype, eatype, eatype) > + BPF_PROG_SEC_IMPL(string, ptype, eatype, 1, eatype) > > /* Programs that can be attached but attach type can't be identified by > section > * name. Kept for backward compatibility. > @@ -2108,6 +2108,7 @@ static const struct { > size_t len; > enum bpf_prog_type prog_type; > enum bpf_attach_type expected_attach_type; > + int is_attachable; > enum bpf_attach_type attach_type; > } section_names[] = { > BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER), > @@ -2198,7 +2199,7 @@ int libbpf_attach_type_by_name(const char *name, > for (i = 0; i < ARRAY_SIZE(section_names); i++) { > if (strncmp(name, section_names[i].sec, section_names[i].len)) > continue; > - if (section_names[i].attach_type == -EINVAL) > + if (!section_names[i].is_attachable) > return -EINVAL; > *attach_type = section_names[i].attach_type; > return 0; > -- > 2.17.1
Re: [PATCH net] openvswitch: Fix push/pop ethernet validation
On 10/31/2018 10:52 AM, Jaime Caamaño Ruiz wrote: When there are both pop and push ethernet header actions among the actions to be applied to a packet, an unexpected EINVAL (Invalid argument) error is obtained. This is due to mac_proto not being reset correctly when those actions are validated. Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions") Signed-off-by: Jaime Caamaño Ruiz --- net/openvswitch/flow_netlink.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index a70097ecf33c..865ecef68196 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -3030,7 +3030,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr, * is already present */ if (mac_proto != MAC_PROTO_NONE) return -EINVAL; - mac_proto = MAC_PROTO_NONE; + mac_proto = MAC_PROTO_ETHERNET; break; case OVS_ACTION_ATTR_POP_ETH: @@ -3038,7 +3038,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr, return -EINVAL; if (vlan_tci & htons(VLAN_TAG_PRESENT)) return -EINVAL; - mac_proto = MAC_PROTO_ETHERNET; + mac_proto = MAC_PROTO_NONE; break; case OVS_ACTION_ATTR_PUSH_NSH: Thanks Jaime! Tested-by: Greg Rose Reviewed-by: Greg Rose
Re: [PATCH net 0/4] mlxsw: Enable minimum shaper on MC TCs
From: Ido Schimmel Date: Wed, 31 Oct 2018 09:56:41 + > Petr says: > > An MC-aware mode was introduced in commit 7b8195306694 ("mlxsw: > spectrum: Configure MC-aware mode on mlxsw ports"). In MC-aware mode, > BUM traffic gets a special treatment by being assigned to a separate set > of traffic classes 8..15. Pairs of TCs 0 and 8, 1 and 9, etc., are then > configured to strictly prioritize the lower-numbered ones. The intention > is to prevent BUM traffic from flooding the switch and push out all UC > traffic, which would otherwise happen, and instead give UC traffic > precedence. > > However strictly prioritizing UC traffic has the effect that UC overload > pushes out all BUM traffic, such as legitimate ARP queries. These > packets are kept in queues for a while, but under sustained UC overload, > their lifetime eventually expires and these packets are dropped. That is > detrimental to network performance as well. > > In this patchset, MC TCs (8..15) are configured with minimum shaper of > 200Mbps (a minimum permitted value) to allow a trickle of necessary > control traffic to get through. > > First in patch #1, the QEEC register is extended with fields necessary > to configure the minimum shaper. > > In patch #2, minimum shaper is enabled on TCs 8..15. > > In patches #3 and #4, first the MC-awareness test is tweaked to support > the minimum shaper, and then a new test is introduced to test that MC > traffic behaves well under UC overload. Series applied, thanks.
[PATCH bpf] libbpf: Fix compile error in libbpf_attach_type_by_name
Arnaldo Carvalho de Melo reported build error in libbpf when clang version 3.8.1-24 (tags/RELEASE_381/final) is used: libbpf.c:2201:36: error: comparison of constant -22 with expression of type 'const enum bpf_attach_type' is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (section_names[i].attach_type == -EINVAL) ^ ~~~ 1 error generated. Fix the error by keeping "is_attachable" property of a program in a separate struct field instead of trying to use attach_type itself. Fixes: commit 956b620fcf0b ("libbpf: Introduce libbpf_attach_type_by_name") Reported-by: Arnaldo Carvalho de Melo Signed-off-by: Andrey Ignatov --- tools/lib/bpf/libbpf.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index b607be7236d3..d6e62e90e8d4 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -2084,19 +2084,19 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog, prog->expected_attach_type = type; } -#define BPF_PROG_SEC_IMPL(string, ptype, eatype, atype) \ - { string, sizeof(string) - 1, ptype, eatype, atype } +#define BPF_PROG_SEC_IMPL(string, ptype, eatype, is_attachable, atype) \ + { string, sizeof(string) - 1, ptype, eatype, is_attachable, atype } /* Programs that can NOT be attached. */ -#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, -EINVAL) +#define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_IMPL(string, ptype, 0, 0, 0) /* Programs that can be attached. */ #define BPF_APROG_SEC(string, ptype, atype) \ - BPF_PROG_SEC_IMPL(string, ptype, 0, atype) + BPF_PROG_SEC_IMPL(string, ptype, 0, 1, atype) /* Programs that must specify expected attach type at load time. */ #define BPF_EAPROG_SEC(string, ptype, eatype) \ - BPF_PROG_SEC_IMPL(string, ptype, eatype, eatype) + BPF_PROG_SEC_IMPL(string, ptype, eatype, 1, eatype) /* Programs that can be attached but attach type can't be identified by section * name. Kept for backward compatibility. @@ -2108,6 +2108,7 @@ static const struct { size_t len; enum bpf_prog_type prog_type; enum bpf_attach_type expected_attach_type; + int is_attachable; enum bpf_attach_type attach_type; } section_names[] = { BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER), @@ -2198,7 +2199,7 @@ int libbpf_attach_type_by_name(const char *name, for (i = 0; i < ARRAY_SIZE(section_names); i++) { if (strncmp(name, section_names[i].sec, section_names[i].len)) continue; - if (section_names[i].attach_type == -EINVAL) + if (!section_names[i].is_attachable) return -EINVAL; *attach_type = section_names[i].attach_type; return 0; -- 2.17.1
Re: [PATCH net] net: dsa: microchip: initialize mutex before use
From: Date: Tue, 30 Oct 2018 16:45:49 -0700 > @@ -1206,6 +1201,12 @@ int ksz_switch_register(struct ksz_device *dev) > if (dev->pdata) > dev->chip_id = dev->pdata->chip_id; > > + /* mutex is used in next function call. */ > + mutex_init(&dev->reg_mutex); > + mutex_init(&dev->stats_mutex); > + mutex_init(&dev->alu_mutex); > + mutex_init(&dev->vlan_mutex); > + Please remove this comment, as per Andrew Lunn's feedback.
[net 5/8] fm10k: bump driver version to match out-of-tree release
From: Jacob Keller The upstream and out-of-tree drivers are once again at comparable functionality. It's been a while since we updated the upstream driver version, so bump it now. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 503bbc017792..5b2a50e5798f 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -11,7 +11,7 @@ #include "fm10k.h" -#define DRV_VERSION"0.23.4-k" +#define DRV_VERSION"0.26.1-k" #define DRV_SUMMARY"Intel(R) Ethernet Switch Host Interface Driver" const char fm10k_driver_version[] = DRV_VERSION; char fm10k_driver_name[] = "fm10k"; -- 2.17.2
[net 6/8] ixgbe/ixgbevf: fix XFRM_ALGO dependency
Based on the original work from Arnd Bergmann. When XFRM_ALGO is not enabled, the new ixgbe IPsec code produces a link error: drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.o: In function `ixgbe_ipsec_vf_add_sa': ixgbe_ipsec.c:(.text+0x1266): undefined reference to `xfrm_aead_get_byname' Simply selecting XFRM_ALGO from here causes circular dependencies, so to fix it, we probably want this slightly more complex solution that is similar to what other drivers with XFRM offload do: A separate Kconfig symbol now controls whether we include the IPsec offload code. To keep the old behavior, this is left as 'default y'. The dependency in XFRM_OFFLOAD still causes a circular dependency but is not actually needed because this symbol is not user visible, so removing that dependency on top makes it all work. CC: Arnd Bergmann CC: Shannon Nelson Fixes: eda0333ac293 ("ixgbe: add VF IPsec management") Signed-off-by: Jeff Kirsher Tested-by: Andrew Bowers --- drivers/net/ethernet/intel/Kconfig | 18 ++ drivers/net/ethernet/intel/ixgbe/Makefile | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 8 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +++--- drivers/net/ethernet/intel/ixgbevf/Makefile| 2 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 4 ++-- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- net/xfrm/Kconfig | 1 - 8 files changed, 30 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig index fd3373d82a9e..59e1bc0f609e 100644 --- a/drivers/net/ethernet/intel/Kconfig +++ b/drivers/net/ethernet/intel/Kconfig @@ -200,6 +200,15 @@ config IXGBE_DCB If unsure, say N. +config IXGBE_IPSEC + bool "IPSec XFRM cryptography-offload acceleration" + depends on IXGBE + depends on XFRM_OFFLOAD + default y + select XFRM_ALGO + ---help--- + Enable support for IPSec offload in ixgbe.ko + config IXGBEVF tristate "Intel(R) 10GbE PCI Express Virtual Function Ethernet support" depends on PCI_MSI @@ -217,6 +226,15 @@ config IXGBEVF will be called ixgbevf. MSI-X interrupt support is required for this driver to work correctly. +config IXGBEVF_IPSEC + bool "IPSec XFRM cryptography-offload acceleration" + depends on IXGBEVF + depends on XFRM_OFFLOAD + default y + select XFRM_ALGO + ---help--- + Enable support for IPSec offload in ixgbevf.ko + config I40E tristate "Intel(R) Ethernet Controller XL710 Family support" imply PTP_1588_CLOCK diff --git a/drivers/net/ethernet/intel/ixgbe/Makefile b/drivers/net/ethernet/intel/ixgbe/Makefile index ca6b0c458e4a..4fb0d9e3f2da 100644 --- a/drivers/net/ethernet/intel/ixgbe/Makefile +++ b/drivers/net/ethernet/intel/ixgbe/Makefile @@ -17,4 +17,4 @@ ixgbe-$(CONFIG_IXGBE_DCB) += ixgbe_dcb.o ixgbe_dcb_82598.o \ ixgbe-$(CONFIG_IXGBE_HWMON) += ixgbe_sysfs.o ixgbe-$(CONFIG_DEBUG_FS) += ixgbe_debugfs.o ixgbe-$(CONFIG_FCOE:m=y) += ixgbe_fcoe.o -ixgbe-$(CONFIG_XFRM_OFFLOAD) += ixgbe_ipsec.o +ixgbe-$(CONFIG_IXGBE_IPSEC) += ixgbe_ipsec.o diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index ec1b87cc4410..143bdd5ee2a0 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -769,9 +769,9 @@ struct ixgbe_adapter { #define IXGBE_RSS_KEY_SIZE 40 /* size of RSS Hash Key in bytes */ u32 *rss_key; -#ifdef CONFIG_XFRM_OFFLOAD +#ifdef CONFIG_IXGBE_IPSEC struct ixgbe_ipsec *ipsec; -#endif /* CONFIG_XFRM_OFFLOAD */ +#endif /* CONFIG_IXGBE_IPSEC */ /* AF_XDP zero-copy */ struct xdp_umem **xsk_umems; @@ -1008,7 +1008,7 @@ void ixgbe_store_key(struct ixgbe_adapter *adapter); void ixgbe_store_reta(struct ixgbe_adapter *adapter); s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, u32 lp_reg, u32 adv_sym, u32 adv_asm, u32 lp_sym, u32 lp_asm); -#ifdef CONFIG_XFRM_OFFLOAD +#ifdef CONFIG_IXGBE_IPSEC void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter); void ixgbe_stop_ipsec_offload(struct ixgbe_adapter *adapter); void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter); @@ -1036,5 +1036,5 @@ static inline int ixgbe_ipsec_vf_add_sa(struct ixgbe_adapter *adapter, u32 *mbuf, u32 vf) { return -EACCES; } static inline int ixgbe_ipsec_vf_del_sa(struct ixgbe_adapter *adapter, u32 *mbuf, u32 vf) { return -EACCES; } -#endif /* CONFIG_XFRM_OFFLOAD */ +#endif /* CONFIG_IXGBE_IPSEC */ #endif /* _IXGBE_H_ */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 0049a2becd7e..113b38e0defb 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@
[net 4/8] fm10k: add missing device IDs to the upstream driver
From: Jacob Keller The device IDs for the Ethernet SDI Adapter devices were never added to the upstream driver. The IDs are already in the pci.ids database, and are supported by the out-of-tree driver. Add the device IDs now, so that the upstream driver can recognize and load these devices. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 2 ++ drivers/net/ethernet/intel/fm10k/fm10k_type.h | 2 ++ 2 files changed, 4 insertions(+) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c index 02345d381303..e49fb51d3613 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c @@ -23,6 +23,8 @@ static const struct fm10k_info *fm10k_info_tbl[] = { */ static const struct pci_device_id fm10k_pci_tbl[] = { { PCI_VDEVICE(INTEL, FM10K_DEV_ID_PF), fm10k_device_pf }, + { PCI_VDEVICE(INTEL, FM10K_DEV_ID_SDI_FM10420_QDA2), fm10k_device_pf }, + { PCI_VDEVICE(INTEL, FM10K_DEV_ID_SDI_FM10420_DA2), fm10k_device_pf }, { PCI_VDEVICE(INTEL, FM10K_DEV_ID_VF), fm10k_device_vf }, /* required last entry */ { 0, } diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_type.h b/drivers/net/ethernet/intel/fm10k/fm10k_type.h index 3e608e493f9d..9fb9fca375e3 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_type.h +++ b/drivers/net/ethernet/intel/fm10k/fm10k_type.h @@ -15,6 +15,8 @@ struct fm10k_hw; #define FM10K_DEV_ID_PF0x15A4 #define FM10K_DEV_ID_VF0x15A5 +#define FM10K_DEV_ID_SDI_FM10420_QDA2 0x15D0 +#define FM10K_DEV_ID_SDI_FM10420_DA2 0x15D5 #define FM10K_MAX_QUEUES 256 #define FM10K_MAX_QUEUES_PF128 -- 2.17.2
[net 3/8] fm10k: ensure completer aborts are marked as non-fatal after a resume
From: Jacob Keller VF drivers can trigger PCIe completer aborts any time they read a queue that they don't own. Even in nominal circumstances, it is not possible to prevent the VF driver from reading queues it doesn't own. VF drivers may attempt to read queues it previously owned, but which it no longer does due to a PF reset. Normally these completer aborts aren't an issue. However, on some platforms these trigger machine check errors. This is true even if we lower their severity from fatal to non-fatal. Indeed, we already have code for lowering the severity. We could attempt to mask these errors conditionally around resets, which is the most common time they would occur. However this would essentially be a race between the PF and VF drivers, and we may still occasionally see machine check exceptions on these strictly configured platforms. Instead, mask the errors entirely any time we resume VFs. By doing so, we prevent the completer aborts from being sent to the parent PCIe device, and thus these strict platforms will not upgrade them into machine check errors. Additionally, we don't lose any information by masking these errors, because we'll still report VFs which attempt to access queues via the FUM_BAD_VF_QACCESS errors. Without this change, on platforms where completer aborts cause machine check exceptions, the VF reading queues it doesn't own could crash the host system. Masking the completer abort prevents this, so we should mask it for good, and not just around a PCIe reset. Otherwise malicious or misconfigured VFs could cause the host system to crash. Because we are masking the error entirely, there is little reason to also keep setting the severity bit, so that code is also removed. Signed-off-by: Jacob Keller Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 48 1 file changed, 28 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c index 74160c2095ee..5d4f1761dc0c 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c @@ -303,6 +303,28 @@ void fm10k_iov_suspend(struct pci_dev *pdev) } } +static void fm10k_mask_aer_comp_abort(struct pci_dev *pdev) +{ + u32 err_mask; + int pos; + + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR); + if (!pos) + return; + + /* Mask the completion abort bit in the ERR_UNCOR_MASK register, +* preventing the device from reporting these errors to the upstream +* PCIe root device. This avoids bringing down platforms which upgrade +* non-fatal completer aborts into machine check exceptions. Completer +* aborts can occur whenever a VF reads a queue it doesn't own. +*/ + pci_read_config_dword(pdev, pos + PCI_ERR_UNCOR_MASK, &err_mask); + err_mask |= PCI_ERR_UNC_COMP_ABORT; + pci_write_config_dword(pdev, pos + PCI_ERR_UNCOR_MASK, err_mask); + + mmiowb(); +} + int fm10k_iov_resume(struct pci_dev *pdev) { struct fm10k_intfc *interface = pci_get_drvdata(pdev); @@ -318,6 +340,12 @@ int fm10k_iov_resume(struct pci_dev *pdev) if (!iov_data) return -ENOMEM; + /* Lower severity of completer abort error reporting as +* the VFs can trigger this any time they read a queue +* that they don't own. +*/ + fm10k_mask_aer_comp_abort(pdev); + /* allocate hardware resources for the VFs */ hw->iov.ops.assign_resources(hw, num_vfs, num_vfs); @@ -461,20 +489,6 @@ void fm10k_iov_disable(struct pci_dev *pdev) fm10k_iov_free_data(pdev); } -static void fm10k_disable_aer_comp_abort(struct pci_dev *pdev) -{ - u32 err_sev; - int pos; - - pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR); - if (!pos) - return; - - pci_read_config_dword(pdev, pos + PCI_ERR_UNCOR_SEVER, &err_sev); - err_sev &= ~PCI_ERR_UNC_COMP_ABORT; - pci_write_config_dword(pdev, pos + PCI_ERR_UNCOR_SEVER, err_sev); -} - int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs) { int current_vfs = pci_num_vf(pdev); @@ -496,12 +510,6 @@ int fm10k_iov_configure(struct pci_dev *pdev, int num_vfs) /* allocate VFs if not already allocated */ if (num_vfs && num_vfs != current_vfs) { - /* Disable completer abort error reporting as -* the VFs can trigger this any time they read a queue -* that they don't own. -*/ - fm10k_disable_aer_comp_abort(pdev); - err = pci_enable_sriov(pdev, num_vfs); if (err) { dev_err(&pdev->dev, -- 2.17.2
[net 0/8][pull request] Intel Wired LAN Driver Updates 2018-10-31
This series contains a various collection of fixes. Miroslav Lichvar from Red Hat or should I say IBM now? Updates the PHC timecounter interval for igb so that it gets updated at least once every 550 seconds. Ngai-Mint provides a fix for fm10k to prevent a soft lockup or system crash by adding a new condition to determine if the SM mailbox is in the correct state before proceeding. Jake provides several fm10k fixes, first one marks complier aborts as non-fatal since on some platforms trigger machine check errors when the compile aborts. Added missing device ids to the in-kernel driver. Due to the recent fixes, bumped the driver version. I (Jeff Kirsher) fixed a XFRM_ALGO dependency for both ixgbe and ixgbevf. This fix was based on the original work from Arnd Bergmann, which only fixed ixgbe. Mitch provides a fix for i40e/avf to update the status codes, which resolves an issue between a mis-match between i40e and the iavf driver, which also supports the ice LAN driver. Radoslaw fixes the ixgbe where the driver is logging a message about spoofed packets detected when the VF is re-started with a different MAC address. The following are changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2: net: mvpp2: Fix affinity hint allocation and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE Jacob Keller (3): fm10k: ensure completer aborts are marked as non-fatal after a resume fm10k: add missing device IDs to the upstream driver fm10k: bump driver version to match out-of-tree release Jeff Kirsher (1): ixgbe/ixgbevf: fix XFRM_ALGO dependency Miroslav Lichvar (1): igb: shorten maximum PHC timecounter update interval Mitch Williams (1): i40e: Update status codes Ngai-Mint Kwan (1): fm10k: fix SM mailbox full condition Radoslaw Tyl (1): ixgbe: fix MAC anti-spoofing filter after VFLR drivers/net/ethernet/intel/Kconfig| 18 +++ drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 51 +++ drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 2 + drivers/net/ethernet/intel/fm10k/fm10k_type.h | 2 + .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 2 +- drivers/net/ethernet/intel/igb/igb_ptp.c | 8 ++- drivers/net/ethernet/intel/ixgbe/Makefile | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 8 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +-- .../net/ethernet/intel/ixgbe/ixgbe_sriov.c| 4 +- drivers/net/ethernet/intel/ixgbevf/Makefile | 2 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 4 +- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- include/linux/avf/virtchnl.h | 12 +++-- net/xfrm/Kconfig | 1 - 16 files changed, 85 insertions(+), 41 deletions(-) -- 2.17.2
[net 8/8] ixgbe: fix MAC anti-spoofing filter after VFLR
From: Radoslaw Tyl This change resolves a driver bug where the driver is logging a message that says "Spoofed packets detected". This can occur on the PF (host) when a VF has VLAN+MACVLAN enabled and is re-started with a different MAC address. MAC and VLAN anti-spoofing filters are to be enabled together. Signed-off-by: Radoslaw Tyl Tested-by: Andrew Bowers Acked-by: Piotr Skajewski Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index af25a8fffeb8..5dacfc870259 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -722,8 +722,10 @@ static inline void ixgbe_vf_reset_event(struct ixgbe_adapter *adapter, u32 vf) ixgbe_set_vmvir(adapter, vfinfo->pf_vlan, adapter->default_up, vf); - if (vfinfo->spoofchk_enabled) + if (vfinfo->spoofchk_enabled) { hw->mac.ops.set_vlan_anti_spoofing(hw, true, vf); + hw->mac.ops.set_mac_anti_spoofing(hw, true, vf); + } } /* reset multicast table array for vf */ -- 2.17.2
[net 1/8] igb: shorten maximum PHC timecounter update interval
From: Miroslav Lichvar The timecounter needs to be updated at least once per ~550 seconds in order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old timestamp. Since commit 500462a9d ("timers: Switch to a non-cascading wheel"), scheduling of delayed work seems to be less accurate and a requested delay of 540 seconds may actually be longer than 550 seconds. Shorten the delay to 480 seconds to be sure the timecounter is updated in time. This fixes an issue with HW timestamps on 82580/I350/I354 being off by ~1100 seconds for few seconds every ~9 minutes. Cc: Jacob Keller Cc: Richard Cochran Cc: Thomas Gleixner Signed-off-by: Miroslav Lichvar Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb_ptp.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c index 9f4d700e09df..29ced6b74d36 100644 --- a/drivers/net/ethernet/intel/igb/igb_ptp.c +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c @@ -51,9 +51,15 @@ * * The 40 bit 82580 SYSTIM overflows every * 2^40 * 10^-9 / 60 = 18.3 minutes. + * + * SYSTIM is converted to real time using a timecounter. As + * timecounter_cyc2time() allows old timestamps, the timecounter + * needs to be updated at least once per half of the SYSTIM interval. + * Scheduling of delayed work is not very accurate, so we aim for 8 + * minutes to be sure the actual interval is shorter than 9.16 minutes. */ -#define IGB_SYSTIM_OVERFLOW_PERIOD (HZ * 60 * 9) +#define IGB_SYSTIM_OVERFLOW_PERIOD (HZ * 60 * 8) #define IGB_PTP_TX_TIMEOUT (HZ * 15) #define INCPERIOD_82576BIT(E1000_TIMINCA_16NS_SHIFT) #define INCVALUE_82576_MASKGENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0) -- 2.17.2
Re: [Patch V5 net 00/11] Bugfix for the HNS3 driver
From: Huazhong Tan Date: Tue, 30 Oct 2018 21:50:42 +0800 > This patch series include bugfix for the HNS3 ethernet > controller driver. > > Change log: > V4->V5: > Fixes comments from Joe Perches & Sergei Shtylyov > V3->V4: > Fixes comments from Sergei Shtylyov > V2->V3: > Fixes comments from Sergei Shtylyov > V1->V2: > Fixes the compilation break reported by kbuild test robot > http://patchwork.ozlabs.org/patch/989818/ Series applied.
[net 2/8] fm10k: fix SM mailbox full condition
From: Ngai-Mint Kwan Current condition will always incorrectly report a full SM mailbox if an IES API application is not running. Due to this, the "fm10k_service_task" will be infinitely queued into the driver's workqueue. This, in turn, will cause a "kworker" thread to report 100% CPU utilization and might cause "soft lockup" events or system crashes. To fix this issue, a new condition is added to determine if the SM mailbox is in the correct state of FM10K_STATE_OPEN before proceeding. In other words, an instance of the IES API must be running. If there is, the remainder of the flow stays the same which is to determine if the SM mailbox capacity has been exceeded or not and take appropriate action. Signed-off-by: Ngai-Mint Kwan Signed-off-by: Jacob Keller Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c index e707d717012f..74160c2095ee 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c @@ -244,7 +244,8 @@ s32 fm10k_iov_mbx(struct fm10k_intfc *interface) } /* guarantee we have free space in the SM mailbox */ - if (!hw->mbx.ops.tx_ready(&hw->mbx, FM10K_VFMBX_MSG_MTU)) { + if (hw->mbx.state == FM10K_STATE_OPEN && + !hw->mbx.ops.tx_ready(&hw->mbx, FM10K_VFMBX_MSG_MTU)) { /* keep track of how many times this occurs */ interface->hw_sm_mbx_full++; -- 2.17.2
[net 7/8] i40e: Update status codes
From: Mitch Williams Add a few new status code which will be used by the ice driver, and rename a few to make them more consistent. Error code are mapped to similar values as in i40e_status.h, so as to be compatible with older VF drivers not using this status enum. Signed-off-by: Mitch Williams Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 2 +- include/linux/avf/virtchnl.h | 12 +--- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 81b0e1f8d14b..ac5698ed0b11 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -3674,7 +3674,7 @@ int i40e_vc_process_vf_msg(struct i40e_pf *pf, s16 vf_id, u32 v_opcode, dev_err(&pf->pdev->dev, "Invalid message from VF %d, opcode %d, len %d\n", local_vf_id, v_opcode, msglen); switch (ret) { - case VIRTCHNL_ERR_PARAM: + case VIRTCHNL_STATUS_ERR_PARAM: return -EPERM; default: return -EINVAL; diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h index 2c9756bd9c4c..b2488055fd1d 100644 --- a/include/linux/avf/virtchnl.h +++ b/include/linux/avf/virtchnl.h @@ -62,13 +62,19 @@ /* Error Codes */ enum virtchnl_status_code { VIRTCHNL_STATUS_SUCCESS = 0, - VIRTCHNL_ERR_PARAM = -5, + VIRTCHNL_STATUS_ERR_PARAM = -5, + VIRTCHNL_STATUS_ERR_NO_MEMORY = -18, VIRTCHNL_STATUS_ERR_OPCODE_MISMATCH = -38, VIRTCHNL_STATUS_ERR_CQP_COMPL_ERROR = -39, VIRTCHNL_STATUS_ERR_INVALID_VF_ID = -40, - VIRTCHNL_STATUS_NOT_SUPPORTED = -64, + VIRTCHNL_STATUS_ERR_ADMIN_QUEUE_ERROR = -53, + VIRTCHNL_STATUS_ERR_NOT_SUPPORTED = -64, }; +/* Backward compatibility */ +#define VIRTCHNL_ERR_PARAM VIRTCHNL_STATUS_ERR_PARAM +#define VIRTCHNL_STATUS_NOT_SUPPORTED VIRTCHNL_STATUS_ERR_NOT_SUPPORTED + #define VIRTCHNL_LINK_SPEED_100MB_SHIFT0x1 #define VIRTCHNL_LINK_SPEED_1000MB_SHIFT 0x2 #define VIRTCHNL_LINK_SPEED_10GB_SHIFT 0x3 @@ -831,7 +837,7 @@ virtchnl_vc_validate_vf_msg(struct virtchnl_version_info *ver, u32 v_opcode, case VIRTCHNL_OP_EVENT: case VIRTCHNL_OP_UNKNOWN: default: - return VIRTCHNL_ERR_PARAM; + return VIRTCHNL_STATUS_ERR_PARAM; } /* few more checks */ if (err_msg_format || valid_len != msglen) -- 2.17.2
Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
From: Eric Dumazet Date: Tue, 30 Oct 2018 00:57:25 -0700 > As shown by Dmitris, we need to use csum_block_add() instead of csum_add() > when adding the FCS contribution to skb csum. > > Before 4.18 (more exactly commit 88078d98d1bb "net: pskb_trim_rcsum() > and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away, > so RXFCS changes were ignored. > > Then before commit d55bef5059dd ("net: fix pskb_trim_rcsum_slow() with > odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling > each other. > > Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5. > > Note that this patch also rewrites mlx5e_get_fcs() to : > > - Use skb_header_pointer() instead of reinventing it. > - Use __get_unaligned_cpu32() to avoid possible non aligned accesses > as Dmitris pointed out. > > Fixes: 902a545904c7 ("net/mlx5e: When RXFCS is set, add FCS data into > checksum calculation") > Reported-by: Paweł Staszewski > Signed-off-by: Eric Dumazet Applied and queued up for -stable.
Re: [PATCH 2/2] net: drop a space before tabs
From: Bo YU Date: Mon, 29 Oct 2018 23:42:10 -0400 > Fix a warning from checkpatch.pl:'please no space before tabs' > in include/net/af_unix.h > > Signed-off-by: Bo YU Applied.
Re: [PATCH 1/2] net: add an identifier name for 'struct sock *'
From: Bo YU Date: Mon, 29 Oct 2018 23:42:09 -0400 > Fix a warning from checkpatch: > function definition argument 'struct sock *' should also have an > identifier name in include/net/af_unix.h. > > Signed-off-by: Bo YU Applied.
Re: [Patch net] net: make pskb_trim_rcsum_slow() robust
From: Cong Wang Date: Mon, 29 Oct 2018 17:35:15 -0700 > Most callers of pskb_trim_rcsum() simply drops the skb when > it fails, however, ip_check_defrag() still continues to pass > the skb up to stack. In that case, we should restore its previous > csum if __pskb_trim() fails. > > Found this during code review. > > Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are > friends") > Cc: Eric Dumazet > Signed-off-by: Cong Wang I kind of agree with Eric that we should make all callers, including ip_check_defrag(), fail just as with any memory allocation failure.
[PATCH iproute2 v2] Use libbsd for strlcpy if available
If libc does not provide strlcpy check for libbsd with pkg-config to avoid relying on inline version. Signed-off-by: Luca Boccassi --- Changed from -include /usr/include/bsd/string.h hack to HAVE_LIBBSD and proper includes in each file that uses strlcpy. The hack causes a compiler warning as ip/ipnetns.c defines _ATFILE_SOURCE without a value, but system headers use 1, so there's a mismatch. configure | 11 +-- genl/ctrl.c | 3 +++ ip/iplink.c | 3 +++ ip/ipnetns.c | 3 +++ ip/iproute_lwtunnel.c | 3 +++ ip/ipvrf.c| 3 +++ ip/ipxfrm.c | 3 +++ ip/tunnel.c | 3 +++ ip/xfrm_state.c | 3 +++ lib/bpf.c | 3 +++ lib/fs.c | 3 +++ lib/inet_proto.c | 3 +++ misc/ss.c | 3 +++ tc/em_ipset.c | 3 +++ tc/m_pedit.c | 3 +++ 15 files changed, 51 insertions(+), 2 deletions(-) diff --git a/configure b/configure index 744d6282..c5655978 100755 --- a/configure +++ b/configure @@ -330,8 +330,15 @@ EOF then echo "no" else - echo 'CFLAGS += -DNEED_STRLCPY' >>$CONFIG - echo "yes" + if ${PKG_CONFIG} libbsd --exists + then + echo 'CFLAGS += -DHAVE_LIBBSD' `${PKG_CONFIG} libbsd --cflags` >>$CONFIG + echo 'LDLIBS +=' `${PKG_CONFIG} libbsd --libs` >> $CONFIG + echo "no" + else + echo 'CFLAGS += -DNEED_STRLCPY' >>$CONFIG + echo "yes" + fi fi rm -f $TMPDIR/strtest.c $TMPDIR/strtest } diff --git a/genl/ctrl.c b/genl/ctrl.c index 616a..fef6aaa9 100644 --- a/genl/ctrl.c +++ b/genl/ctrl.c @@ -18,6 +18,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include "utils.h" #include "genl_utils.h" diff --git a/ip/iplink.c b/ip/iplink.c index b5519201..067f5409 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -24,6 +24,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/ip/ipnetns.c b/ip/ipnetns.c index 0eac18cf..da019d76 100644 --- a/ip/ipnetns.c +++ b/ip/ipnetns.c @@ -8,6 +8,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c index 8f497015..2285bc1d 100644 --- a/ip/iproute_lwtunnel.c +++ b/ip/iproute_lwtunnel.c @@ -16,6 +16,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/ip/ipvrf.c b/ip/ipvrf.c index 8a6b7f97..8572b4f2 100644 --- a/ip/ipvrf.c +++ b/ip/ipvrf.c @@ -21,6 +21,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c index 17ab4abe..b02f30a6 100644 --- a/ip/ipxfrm.c +++ b/ip/ipxfrm.c @@ -28,6 +28,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/ip/tunnel.c b/ip/tunnel.c index d0d55f37..73abb2e2 100644 --- a/ip/tunnel.c +++ b/ip/tunnel.c @@ -24,6 +24,9 @@ #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c index e8c01746..18e0c6fa 100644 --- a/ip/xfrm_state.c +++ b/ip/xfrm_state.c @@ -27,6 +27,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include "utils.h" #include "xfrm.h" diff --git a/lib/bpf.c b/lib/bpf.c index 45f279fa..35d7c45a 100644 --- a/lib/bpf.c +++ b/lib/bpf.c @@ -15,6 +15,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/lib/fs.c b/lib/fs.c index 86efd4ed..af36bea0 100644 --- a/lib/fs.c +++ b/lib/fs.c @@ -20,6 +20,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include diff --git a/lib/inet_proto.c b/lib/inet_proto.c index 0836a4c9..b379d8f8 100644 --- a/lib/inet_proto.c +++ b/lib/inet_proto.c @@ -18,6 +18,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include "rt_names.h" #include "utils.h" diff --git a/misc/ss.c b/misc/ss.c index 4d12fb5d..c472fbd9 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -19,6 +19,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include #include diff --git a/tc/em_ipset.c b/tc/em_ipset.c index 48b287f5..550b2101 100644 --- a/tc/em_ipset.c +++ b/tc/em_ipset.c @@ -20,6 +20,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include diff --git a/tc/m_pedit.c b/tc/m_pedit.c index 2aeb56d9..baacc80d 100644 --- a/tc/m_pedit.c +++ b/tc/m_pedit.c @@ -23,6 +23,9 @@ #include #include #include +#ifdef HAVE_LIBBSD +#include +#endif #include #include "utils.h" #include "tc_util.h" -- 2.19.1
Re: [PATCH iproute2] Use libbsd for strlcpy if available
On Wed, 2018-10-31 at 08:09 -0700, Stephen Hemminger wrote: > On Mon, 29 Oct 2018 10:46:50 + > Luca Boccassi wrote: > > > If libc does not provide strlcpy check for libbsd with pkg-config > > to > > avoid relying on inline version. > > > > Signed-off-by: Luca Boccassi > > --- > > This allows distro maintainers to be able to choose to reduce > > duplication and let this code be maintained in one place, in the > > external library. > > > > I like the idea, but it causes warnings on Debian testing, and maybe > other distros. > > ipnetns.c:2: warning: "_ATFILE_SOURCE" redefined > #define _ATFILE_SOURCE > > In file included from /usr/include/x86_64-linux-gnu/bits/libc-header- > start.h:33, > from /usr/include/string.h:26, > from /usr/include/bsd/string.h:30, > from : > /usr/include/features.h:326: note: this is the location of the > previous definition > # define _ATFILE_SOURCE 1 > > > Please figure out how to handle this and resubmit. SUSE open build > service might > also work to test multiple distro's Ah missed that. That happens because features.h defines _ATFILE_SOURCE to 1, but ip/ipnetns.c defines it without a value. According to the spec either way doesn't change the result. This happens because of the quick hack of using -include /usr/include/bsd/string.h which was, well, a quick hack and didn't require to add the include manually everywhere strlcpy was used, even in the future. But it has side effects like this. So I'll send v2 with a less hacky fix, which means defining HAVE_LIBBSD in configure and doing #ifdef HAVE_LIBBSD #include in every file. It also means that this needs to be done for every future use of strlcpy, or the build with libbsd will break. If you or David prefer the hacky way, I can instead send a v3 that does the quick hack, and also changes _ATFILE_SOURCE to 1 so that there is no complaint from the compiler, as the values will be the same. -- Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part
[PATCH net] openvswitch: Fix push/pop ethernet validation
When there are both pop and push ethernet header actions among the actions to be applied to a packet, an unexpected EINVAL (Invalid argument) error is obtained. This is due to mac_proto not being reset correctly when those actions are validated. Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions") Signed-off-by: Jaime Caamaño Ruiz --- net/openvswitch/flow_netlink.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index a70097ecf33c..865ecef68196 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -3030,7 +3030,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr, * is already present */ if (mac_proto != MAC_PROTO_NONE) return -EINVAL; - mac_proto = MAC_PROTO_NONE; + mac_proto = MAC_PROTO_ETHERNET; break; case OVS_ACTION_ATTR_POP_ETH: @@ -3038,7 +3038,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr, return -EINVAL; if (vlan_tci & htons(VLAN_TAG_PRESENT)) return -EINVAL; - mac_proto = MAC_PROTO_ETHERNET; + mac_proto = MAC_PROTO_NONE; break; case OVS_ACTION_ATTR_PUSH_NSH: -- 2.16.4
Re: [PATCH net-next 0/8] net: sched: pie: align PIE implementation with RFC 8033
net-next is closed, please resubmit this when net-next opens back up. Thank you.
Re: libbpf build failure on debian:9 with clang
Arnaldo Carvalho de Melo [Wed, 2018-10-31 07:12 -0700]: > 1740.66 debian:9 : FAIL gcc (Debian > 6.3.0-18+deb9u1) 6.3.0 20170516 > > The failure was with clang tho: > > clang version 3.8.1-24 (tags/RELEASE_381/final) > > With: > > gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) > > it built without any warnings/errors. > > CC /tmp/build/perf/libbpf.o > libbpf.c:2201:36: error: comparison of constant -22 with expression of type > 'const enum bpf_attach_type' is always false > [-Werror,-Wtautological-constant-out-of-range-compare] > if (section_names[i].attach_type == -EINVAL) > ^ ~~~ > 1 error generated. Hi Arnaldo, I have a clang version that I can reproduce this error with. Working on patch. Thank you for report! > CC /tmp/build/perf/help.o > mv: cannot stat '/tmp/build/perf/.libbpf.o.tmp': No such file or directory > /git/linux/tools/build/Makefile.build:96: recipe for target > '/tmp/build/perf/libbpf.o' failed > make[4]: *** [/tmp/build/perf/libbpf.o] Error 1 > > This is the cset: > > commit 956b620fcf0b64de403cd26a56bc41e6e4826ea6 > Author: Andrey Ignatov > Date: Wed Sep 26 15:24:53 2018 -0700 > > libbpf: Introduce libbpf_attach_type_by_name > > > > Tests are continuing, so far: > >143.53 alpine:3.4: Ok gcc (Alpine 5.3.0) 5.3.0 >258.62 alpine:3.5: Ok gcc (Alpine 6.2.1) 6.2.1 > 20160822 >351.62 alpine:3.6: Ok gcc (Alpine 6.3.0) 6.3.0 >451.68 alpine:3.7: Ok gcc (Alpine 6.4.0) 6.4.0 >549.38 alpine:3.8: Ok gcc (Alpine 6.4.0) 6.4.0 >679.07 alpine:edge : Ok gcc (Alpine 6.4.0) 6.4.0 >763.35 amazonlinux:1 : Ok gcc (GCC) 4.8.5 20150623 > (Red Hat 4.8.5-28) >859.65 amazonlinux:2 : Ok gcc (GCC) 7.3.1 20180303 > (Red Hat 7.3.1-5) >947.39 android-ndk:r12b-arm : Ok arm-linux-androideabi-gcc > (GCC) 4.9.x 20150123 (prerelease) > 1050.64 android-ndk:r15c-arm : Ok arm-linux-androideabi-gcc > (GCC) 4.9.x 20150123 (prerelease) > 1128.75 centos:5 : Ok gcc (GCC) 4.1.2 20080704 > (Red Hat 4.1.2-55) > 1233.26 centos:6 : Ok gcc (GCC) 4.4.7 20120313 > (Red Hat 4.4.7-23) > 1343.16 centos:7 : Ok gcc (GCC) 4.8.5 20150623 > (Red Hat 4.8.5-28) > 1473.61 clearlinux:latest : FAIL gcc (Clear Linux OS for > Intel Architecture) 8.2.1 20180502 > 1545.56 debian:7 : Ok gcc (Debian 4.7.2-5) 4.7.2 > 1645.53 debian:8 : Ok gcc (Debian > 4.9.2-10+deb8u1) 4.9.2 > 1740.66 debian:9 : FAIL gcc (Debian > 6.3.0-18+deb9u1) 6.3.0 20170516 > 18 113.19 debian:experimental : Ok gcc (Debian 8.2.0-8) 8.2.0 > 1941.48 debian:experimental-x-arm64 : Ok aarch64-linux-gnu-gcc > (Debian 8.2.0-7) 8.2.0 > 2041.51 debian:experimental-x-mips: Ok mips-linux-gnu-gcc (Debian > 8.2.0-7) 8.2.0 > 2140.09 debian:experimental-x-mips64 : Ok mips64-linux-gnuabi64-gcc > (Debian 8.1.0-12) 8.1.0 > 2242.17 debian:experimental-x-mipsel : Ok mipsel-linux-gnu-gcc > (Debian 8.2.0-7) 8.2.0 > 2340.02 fedora:20 : Ok gcc (GCC) 4.8.3 20140911 > (Red Hat 4.8.3-7) > 2445.47 fedora:21 : Ok gcc (GCC) 4.9.2 20150212 > (Red Hat 4.9.2-6) > 2541.64 fedora:22 : Ok gcc (GCC) 5.3.1 20160406 > (Red Hat 5.3.1-6) > 2643.60 fedora:23 : Ok gcc (GCC) 5.3.1 20160406 > (Red Hat 5.3.1-6) > 2744.04 fedora:24 : Ok gcc (GCC) 6.3.1 20161221 > (Red Hat 6.3.1-1) > 2837.21 fedora:24-x-ARC-uClibc: Ok arc-linux-gcc (ARCompact > ISA Linux uClibc toolchain 2017.09-rc2) 7.1.1 20170710 > > The problem with clearlinux is unrelated: > > clang-7: error: unknown argument: '-fno-semantic-interposition' > clang-7: error: unsupported argument '4' to option 'flto=' > clang-7: error: optimization flag '-ffat-lto-objects' is not supported > [-Werror,-Wignored-optimization-argument] > -- Andrey Ignatov
Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset
On 10/31/18 10:55 AM, David Ahern wrote: > I think the simplest fix for 4.20 is to break the loop if ret is non-0 - > restore the previous behavior. that is the only recourse. It has to bail if ret is non-0. Do you want to send a patch with that fix?
RE: [RFC PATCH 3/4] igb: add support for extended PHC gettime
> -Original Message- > From: Miroslav Lichvar [mailto:mlich...@redhat.com] > Sent: Wednesday, October 31, 2018 2:40 AM > To: Richard Cochran > Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Keller, Jacob E > > Subject: Re: [RFC PATCH 3/4] igb: add support for extended PHC gettime > > On Tue, Oct 30, 2018 at 07:29:16PM -0700, Richard Cochran wrote: > > On Fri, Oct 26, 2018 at 06:27:41PM +0200, Miroslav Lichvar wrote: > > > +static int igb_ptp_gettimex(struct ptp_clock_info *ptp, > > > + struct ptp_system_timestamp *sts) > > > +{ > > > + struct igb_adapter *igb = container_of(ptp, struct igb_adapter, > > > +ptp_caps); > > > + struct e1000_hw *hw = &igb->hw; > > > + unsigned long flags; > > > + u32 lo, hi; > > > + u64 ns; > > > + > > > + spin_lock_irqsave(&igb->tmreg_lock, flags); > > > + > > > + /* 82576 doesn't have SYSTIMR */ > > > + if (igb->hw.mac.type == e1000_82576) { > > > > Instead of if/then/else, can't you follow the pattern of providing > > different function flavors ... > > I can. I was just trying to minimize the amount of triplicated code. > In the next version I'll add a patch to deprecate the old gettime > functions, as Jacob suggested, and replace them with the extended > versions, so the amount of code will not change that much. > Excellent. -Jake > Thanks, > > -- > Miroslav Lichvar
RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> -Original Message- > From: Richard Cochran [mailto:richardcoch...@gmail.com] > Sent: Wednesday, October 31, 2018 7:40 AM > To: Miroslav Lichvar > Cc: Keller, Jacob E ; netdev@vger.kernel.org; > intel-wired- > l...@lists.osuosl.org > Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime > > On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote: > > I think there could be a flag in ptp_system_timestamp, or a parameter > > of gettimex64(), which would enable/disable reading of the system > > clock. > > I'm not a fan of functions that change their behavior based on flags > in their input parameters. > > Thanks, > Richard Neither am I. I do however want to find a solution that avoids having drivers needlessly duplicate almost the same functionality. Thanks, Jake
Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset
On 10/31/18 12:42 AM, Alexey Kodanev wrote: > cb->args[2] can store the pointer to the struct fib6_walker, > allocated in inet6_dump_fib(). On the next loop iteration in > rtnl_dump_all(), 'memset(&cb, 0, sizeof(cb->args))' can reset > that pointer, leaking the memory [1]. > > Fix it by calling cb->done, if it is set, before filling 'cb->args' > with zeros. > > Looks like the recent changes in rtnl_dump_all() contributed to > the appearance of this kmemleak [1], commit c63586dc9b3e ("net: > rtnl_dump_all needs to propagate error from dumpit function") > breaks the loop only on an error now. > ... It is more efficient to keep going. I think the simplest fix for 4.20 is to break the loop if ret is non-0 - restore the previous behavior. For net-next I think the done callback is not needed for ipv6; I think there is a simpler way to do it.
Re: [PATCH net-next 6/8] net: sched: pie: add mechanism to set PIE active/inactive
On Wed, 31 Oct 2018 21:49:30 +0530 Leslie Monis wrote: > diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c > index c84e91e..b68b367 100644 > --- a/net/sched/sch_pie.c > +++ b/net/sched/sch_pie.c > @@ -57,6 +57,7 @@ struct pie_vars { > psched_time_t dq_tstamp;/* drain rate */ > u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */ > u32 qlen_old; /* in bytes */ > + bool active;/* inactive/active */ > }; Current Linux best practice is to not use bool for true/false values in a structure. This is because the size of bool is not obvious and can cause padding. Recommend using u8 instead.
Re: [PATCH net-next 5/8] net: sched: pie: add more conditions to auto-tune alpha and beta
On Wed, 31 Oct 2018 21:49:29 +0530 Leslie Monis wrote: > From: "Mohit P. Tahiliani" > > The update in drop probability depends on the parameters > alpha and beta, which in turn reflect the current congestion > level. However, the previous if-else cases were recommended > when the supported bandwidth was up to 12 Mbps but, current > data links support a much higher bandwidth, and the > requirement for more bandwidth is in never-ending demand. > Hence, RFC 8033 suggests using more if-else cases for better > fine-tuning of parameters alpha and beta in order to control > the congestion as much as possible. > > Signed-off-by: Mohit P. Tahiliani > Signed-off-by: Dhaval Khandla > Signed-off-by: Hrishikesh Hiraskar > Signed-off-by: Manish Kumar B > Signed-off-by: Sachin D. Patil > Signed-off-by: Leslie Monis > --- > net/sched/sch_pie.c | 26 +++--- > 1 file changed, 23 insertions(+), 3 deletions(-) > > diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c > index f4e189a..c84e91e 100644 > --- a/net/sched/sch_pie.c > +++ b/net/sched/sch_pie.c > @@ -343,10 +343,30 @@ static void calculate_probability(struct Qdisc *sch) >* appropriately 2) scaling down by 16 to come to 0-2 range. >* Please see paper for details. >* > - * We scale alpha and beta differently depending on whether we are in > - * light, medium or high dropping mode. > + * We scale alpha and beta differently depending on how heavy the > + * congestion is. >*/ > - if (q->vars.prob < MAX_PROB / 100) { > + if (q->vars.prob < MAX_PROB / 100) { > + alpha = > + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15; > + beta = > + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15; > + } else if (q->vars.prob < MAX_PROB / 10) { > + alpha = > + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13; > + beta = > + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13; > + } else if (q->vars.prob < MAX_PROB / 1) { > + alpha = > + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11; > + beta = > + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11; > + } else if (q->vars.prob < MAX_PROB / 1000) { > + alpha = > + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9; > + beta = > + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9; > + } else if (q->vars.prob < MAX_PROB / 100) { > alpha = > (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 7; > beta = Seems like the if/else chain is getting long in the tail. Maybe a loop or table driven approach would be clearer.
Re: [PATCH net-next 7/8] net: sched: pie: add derandomization mechanism
On Wed, 31 Oct 2018 21:49:31 +0530 Leslie Monis wrote: > diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c > index b68b367..88e605c 100644 > --- a/net/sched/sch_pie.c > +++ b/net/sched/sch_pie.c > @@ -58,6 +58,7 @@ struct pie_vars { > u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */ > u32 qlen_old; /* in bytes */ > bool active;/* inactive/active */ > + u64 accu_prob; /* accumulated drop probability */ > }; > Although putting it at the end seems like a natural place, it creates holes in the c structure. My recommendation would be to put the new field after dq_tsstamp.
Re: [PATCH net-next 0/8] net: sched: pie: align PIE implementation with RFC 8033
On Wed, 31 Oct 2018 21:49:24 +0530 Leslie Monis wrote: > The current implementation of PIE queueing discipline is according to an IETF > draft [http://tools.ietf.org/html/draft-pan-aqm-pie-00] and the paper > [PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem]. > However, a lot of necessary modifications and enhancements have been proposed > in RFC 8033, which have not yet been incorporated in the source code of Linux > kernel. The following series of patches helps in achieving the same. > > This patch series includes: > > 1. Change the value of QUEUE_THRESHOLD > 2. Change the default value of pie_params->target > 3. Change the default value of pie_params->tupdate > 4. Change the initial value of pie_vars->burst_time > 5. Add more conditions to auto-tune alpha and beta > 6. Add mechanism to set PIE active/inactive > 7. Add a derandomization mechanism > 8. Update references > > Mohit P. Tahiliani (8): > net: sched: pie: change value of QUEUE_THRESHOLD > net: sched: pie: change default value of pie_params->target > net: sched: pie: change default value of pie_params->tupdate > net: sched: pie: change initial value of pie_vars->burst_time > net: sched: pie: add more conditions to auto-tune alpha and beta > net: sched: pie: add mechanism to set PIE active/inactive > net: sched: pie: add derandomization mechanism > net: sched: pie: update references > > net/sched/sch_pie.c | 77 > +++-- > 1 file changed, 63 insertions(+), 14 deletions(-) > Did you do performance tests? Often the RFC is out of date and the actual values are better than those in the standard.
[PATCH net-next 8/8] net: sched: pie: update references
From: "Mohit P. Tahiliani" RFC 8033 replaces the IETF draft for PIE Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index 88e605c..708e8ad 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -17,7 +17,7 @@ * University of Oslo, Norway. * * References: - * IETF draft submission: http://tools.ietf.org/html/draft-pan-aqm-pie-00 + * RFC 8033: https://tools.ietf.org/html/rfc8033 * IEEE Conference on High Performance Switching and Routing 2013 : * "PIE: A * Lightweight Control Scheme to Address the Bufferbloat Problem" */ -- 2.7.4
[PATCH net-next 7/8] net: sched: pie: add derandomization mechanism
From: "Mohit P. Tahiliani" Random dropping of packets to achieve latency control may introduce outlier situations where packets are dropped too close to each other or too far from each other. This can cause the real drop percentage to temporarily deviate from the intended drop probability. In certain scenarios, such as a small number of simultaneous TCP flows, these deviations can cause significant deviations in link utilization and queuing latency. RFC 8033 suggests using a derandomization mechanism to avoid these deviations. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index b68b367..88e605c 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -58,6 +58,7 @@ struct pie_vars { u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */ u32 qlen_old; /* in bytes */ bool active;/* inactive/active */ + u64 accu_prob; /* accumulated drop probability */ }; /* statistics gathering */ @@ -96,6 +97,7 @@ static void pie_vars_init(struct pie_vars *vars) /* default of 150 ms in pschedtime */ vars->burst_time = PSCHED_NS2TICKS(150 * NSEC_PER_MSEC); vars->active = true; + vars->accu_prob = 0; } static bool drop_early(struct Qdisc *sch, u32 packet_size) @@ -130,9 +132,21 @@ static bool drop_early(struct Qdisc *sch, u32 packet_size) else local_prob = q->vars.prob; + if (local_prob == 0) + q->vars.accu_prob = 0; + + q->vars.accu_prob += local_prob; + + if (q->vars.accu_prob < (MAX_PROB / 100) * 85) + return false; + if (q->vars.accu_prob >= ((u64)MAX_PROB * 17) / 2) + return true; + rnd = prandom_u32(); - if (rnd < local_prob) + if (rnd < local_prob) { + q->vars.accu_prob = 0; return true; + } return false; } @@ -181,6 +195,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, out: q->stats.dropped++; + q->vars.accu_prob = 0; return qdisc_drop(skb, sch, to_free); } -- 2.7.4
[PATCH net-next 5/8] net: sched: pie: add more conditions to auto-tune alpha and beta
From: "Mohit P. Tahiliani" The update in drop probability depends on the parameters alpha and beta, which in turn reflect the current congestion level. However, the previous if-else cases were recommended when the supported bandwidth was up to 12 Mbps but, current data links support a much higher bandwidth, and the requirement for more bandwidth is in never-ending demand. Hence, RFC 8033 suggests using more if-else cases for better fine-tuning of parameters alpha and beta in order to control the congestion as much as possible. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index f4e189a..c84e91e 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -343,10 +343,30 @@ static void calculate_probability(struct Qdisc *sch) * appropriately 2) scaling down by 16 to come to 0-2 range. * Please see paper for details. * -* We scale alpha and beta differently depending on whether we are in -* light, medium or high dropping mode. +* We scale alpha and beta differently depending on how heavy the +* congestion is. */ - if (q->vars.prob < MAX_PROB / 100) { + if (q->vars.prob < MAX_PROB / 100) { + alpha = + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15; + beta = + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 15; + } else if (q->vars.prob < MAX_PROB / 10) { + alpha = + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13; + beta = + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 13; + } else if (q->vars.prob < MAX_PROB / 1) { + alpha = + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11; + beta = + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 11; + } else if (q->vars.prob < MAX_PROB / 1000) { + alpha = + (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9; + beta = + (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 9; + } else if (q->vars.prob < MAX_PROB / 100) { alpha = (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 7; beta = -- 2.7.4
[PATCH net-next 6/8] net: sched: pie: add mechanism to set PIE active/inactive
From: "Mohit P. Tahiliani" To overcome unnecessary packet drops due to a spurious uptick in queuing latency caused by fluctuations in a network, PIE can choose to be active only when the queue occupancy is over a certain threshold. RFC 8033 suggests the value of this threshold be 1/3 of the tail drop threshold. PIE becomes inactive when the congestion ends i.e., when the drop probability reaches 0, and the current and previous latency samples are all below half of the target queue delay. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 22 ++ 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index c84e91e..b68b367 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -57,6 +57,7 @@ struct pie_vars { psched_time_t dq_tstamp;/* drain rate */ u32 avg_dq_rate;/* bytes per pschedtime tick,scaled */ u32 qlen_old; /* in bytes */ + bool active;/* inactive/active */ }; /* statistics gathering */ @@ -94,6 +95,7 @@ static void pie_vars_init(struct pie_vars *vars) vars->avg_dq_rate = 0; /* default of 150 ms in pschedtime */ vars->burst_time = PSCHED_NS2TICKS(150 * NSEC_PER_MSEC); + vars->active = true; } static bool drop_early(struct Qdisc *sch, u32 packet_size) @@ -141,12 +143,23 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct pie_sched_data *q = qdisc_priv(sch); bool enqueue = false; + if (!q->vars.active && qdisc_qlen(sch) >= sch->limit / 3) { + /* If the queue occupancy is over 1/3 of the tail drop +* threshold, turn on PIE. +*/ + pie_vars_init(&q->vars); + q->vars.prob = 0; + q->vars.qdelay_old = 0; + q->vars.dq_count = 0; + q->vars.dq_tstamp = psched_get_time(); + } + if (unlikely(qdisc_qlen(sch) >= sch->limit)) { q->stats.overlimit++; goto out; } - if (!drop_early(sch, skb->len)) { + if (!q->vars.active || !drop_early(sch, skb->len)) { enqueue = true; } else if (q->params.ecn && (q->vars.prob <= MAX_PROB / 10) && INET_ECN_set_ce(skb)) { @@ -431,7 +444,7 @@ static void calculate_probability(struct Qdisc *sch) q->vars.qdelay = qdelay; q->vars.qlen_old = qlen; - /* We restart the measurement cycle if the following conditions are met + /* We turn off PIE if the following conditions are met * 1. If the delay has been low for 2 consecutive Tupdate periods * 2. Calculated drop probability is zero * 3. We have atleast one estimate for the avg_dq_rate ie., @@ -441,7 +454,7 @@ static void calculate_probability(struct Qdisc *sch) (q->vars.qdelay_old < q->params.target / 2) && q->vars.prob == 0 && q->vars.avg_dq_rate > 0) - pie_vars_init(&q->vars); + q->vars.active = false; } static void pie_timer(struct timer_list *t) @@ -451,7 +464,8 @@ static void pie_timer(struct timer_list *t) spinlock_t *root_lock = qdisc_lock(qdisc_root_sleeping(sch)); spin_lock(root_lock); - calculate_probability(sch); + if (q->vars.active) + calculate_probability(sch); /* reset the timer to fire after 'tupdate'. tupdate is in jiffies. */ if (q->params.tupdate) -- 2.7.4
[PATCH net-next 4/8] net: sched: pie: change initial value of pie_vars->burst_time
From: "Mohit P. Tahiliani" RFC 8033 suggests an initial value of 150 milliseconds for the maximum time allowed for a burst of packets instead of 100 milliseconds. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index 9912d9cc..f4e189a 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -92,8 +92,8 @@ static void pie_vars_init(struct pie_vars *vars) { vars->dq_count = DQCOUNT_INVALID; vars->avg_dq_rate = 0; - /* default of 100 ms in pschedtime */ - vars->burst_time = PSCHED_NS2TICKS(100 * NSEC_PER_MSEC); + /* default of 150 ms in pschedtime */ + vars->burst_time = PSCHED_NS2TICKS(150 * NSEC_PER_MSEC); } static bool drop_early(struct Qdisc *sch, u32 packet_size) -- 2.7.4
[PATCH net-next 3/8] net: sched: pie: change default value of pie_params->tupdate
From: "Mohit P. Tahiliani" RFC 8033 suggests a default value of 15 milliseconds for the update interval instead of 30 milliseconds. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index c5d6d6b..9912d9cc 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -81,7 +81,7 @@ static void pie_params_init(struct pie_params *params) { params->alpha = 2; params->beta = 20; - params->tupdate = usecs_to_jiffies(30 * USEC_PER_MSEC); /* 30 ms */ + params->tupdate = usecs_to_jiffies(15 * USEC_PER_MSEC); /* 15 ms */ params->limit = 1000; /* default of 1000 packets */ params->target = PSCHED_NS2TICKS(15 * NSEC_PER_MSEC); /* 15 ms */ params->ecn = false; -- 2.7.4
[PATCH net-next 2/8] net: sched: pie: change default value of pie_params->target
From: "Mohit P. Tahiliani" RFC 8033 suggests a default value of 15 milliseconds for the target queue delay instead of 20 milliseconds. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index 56c9e4d..c5d6d6b 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -83,7 +83,7 @@ static void pie_params_init(struct pie_params *params) params->beta = 20; params->tupdate = usecs_to_jiffies(30 * USEC_PER_MSEC); /* 30 ms */ params->limit = 1000; /* default of 1000 packets */ - params->target = PSCHED_NS2TICKS(20 * NSEC_PER_MSEC); /* 20 ms */ + params->target = PSCHED_NS2TICKS(15 * NSEC_PER_MSEC); /* 15 ms */ params->ecn = false; params->bytemode = false; } -- 2.7.4
[PATCH net-next 1/8] net: sched: pie: change value of QUEUE_THRESHOLD
From: "Mohit P. Tahiliani" RFC 8033 recommends a value of 16384 bytes for the queue threshold instead of 1 bytes. Signed-off-by: Mohit P. Tahiliani Signed-off-by: Dhaval Khandla Signed-off-by: Hrishikesh Hiraskar Signed-off-by: Manish Kumar B Signed-off-by: Sachin D. Patil Signed-off-by: Leslie Monis --- net/sched/sch_pie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index d142937..56c9e4d 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -31,7 +31,7 @@ #include #include -#define QUEUE_THRESHOLD 1 +#define QUEUE_THRESHOLD 16384 /* 16KB i.e. 2^14 bytes */ #define DQCOUNT_INVALID -1 #define MAX_PROB 0x #define PIE_SCALE 8 -- 2.7.4
[PATCH net-next 0/8] net: sched: pie: align PIE implementation with RFC 8033
The current implementation of PIE queueing discipline is according to an IETF draft [http://tools.ietf.org/html/draft-pan-aqm-pie-00] and the paper [PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem]. However, a lot of necessary modifications and enhancements have been proposed in RFC 8033, which have not yet been incorporated in the source code of Linux kernel. The following series of patches helps in achieving the same. This patch series includes: 1. Change the value of QUEUE_THRESHOLD 2. Change the default value of pie_params->target 3. Change the default value of pie_params->tupdate 4. Change the initial value of pie_vars->burst_time 5. Add more conditions to auto-tune alpha and beta 6. Add mechanism to set PIE active/inactive 7. Add a derandomization mechanism 8. Update references Mohit P. Tahiliani (8): net: sched: pie: change value of QUEUE_THRESHOLD net: sched: pie: change default value of pie_params->target net: sched: pie: change default value of pie_params->tupdate net: sched: pie: change initial value of pie_vars->burst_time net: sched: pie: add more conditions to auto-tune alpha and beta net: sched: pie: add mechanism to set PIE active/inactive net: sched: pie: add derandomization mechanism net: sched: pie: update references net/sched/sch_pie.c | 77 +++-- 1 file changed, 63 insertions(+), 14 deletions(-) -- 2.7.4
[PATCH v2 net 2/3] net: do not abort bulk send on BQL status
Before calling dev_hard_start_xmit(), upper layers tried to cook optimal skb list based on BQL budget. Problem is that GSO packets can end up comsuming more than the BQL budget. Breaking the loop is not useful, since requeued packets are ahead of any packets still in the qdisc. It is also more expensive, since next TX completion will push these packets later, while skbs are not in cpu caches. It is also a behavior difference with TSO packets, that can break the BQL limit by a large amount. Note that drivers should use __netdev_tx_sent_queue() in order to have optimal xmit_more support, and avoid useless atomic operations as shown in the following patch. Signed-off-by: Eric Dumazet --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 77d43ae2a7bbe1267f8430d5c35637d1984f463c..0ffcbdd55fa9ee545c807f2ed3fc178830e3075a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3272,7 +3272,7 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *first, struct net_device *de } skb = next; - if (netif_xmit_stopped(txq) && skb) { + if (netif_tx_queue_stopped(txq) && skb) { rc = NETDEV_TX_BUSY; break; } -- 2.19.1.930.g4563a0d9d0-goog
[PATCH v2 net 3/3] net/mlx4_en: use __netdev_tx_sent_queue()
doorbell only depends on xmit_more and netif_tx_queue_stopped() Using __netdev_tx_sent_queue() avoids messing with BQL stop flag, and is more generic. This patch increases performance on GSO workload by keeping doorbells to the minimum required. Signed-off-by: Eric Dumazet Cc: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index 1857ee0f0871d48285a6d3711f7c3e9a1e08a05f..6f5153afcab4dfc331c099da854c54f1b9500887 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -1006,7 +1006,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) ring->packets++; } ring->bytes += tx_info->nr_bytes; - netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes); AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len); if (tx_info->inl) @@ -1044,7 +1043,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev) netif_tx_stop_queue(ring->tx_queue); ring->queue_stopped++; } - send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue); + + send_doorbell = __netdev_tx_sent_queue(ring->tx_queue, + tx_info->nr_bytes, + skb->xmit_more); real_size = (real_size / 16) & 0x3f; -- 2.19.1.930.g4563a0d9d0-goog
[PATCH v2 net 1/3] net: bql: add __netdev_tx_sent_queue()
When qdisc_run() tries to use BQL budget to bulk-dequeue a batch of packets, GSO can later transform this list in another list of skbs, and each skb is sent to device ndo_start_xmit(), one at a time, with skb->xmit_more being set to one but for last skb. Problem is that very often, BQL limit is hit in the middle of the packet train, forcing dev_hard_start_xmit() to stop the bulk send and requeue the end of the list. BQL role is to avoid head of line blocking, making sure a qdisc can deliver high priority packets before low priority ones. But there is no way requeued packets can be bypassed by fresh packets in the qdisc. Aborting the bulk send increases TX softirqs, and hot cache lines (after skb_segment()) are wasted. Note that for TSO packets, we never split a packet in the middle because of BQL limit being hit. Drivers should be able to update BQL counters without flipping/caring about BQL status, if the current skb has xmit_more set. Upper layers are ultimately responsible to stop sending another packet train when BQL limit is hit. Code template in a driver might look like the following : send_doorbell = __netdev_tx_sent_queue(tx_queue, nr_bytes, skb->xmit_more); Note that __netdev_tx_sent_queue() use is not mandatory, since following patch will change dev_hard_start_xmit() to not care about BQL status. But it is highly recommended so that xmit_more full benefits can be reached (less doorbells sent, and less atomic operations as well) Signed-off-by: Eric Dumazet --- include/linux/netdevice.h | 20 1 file changed, 20 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index dc1d9ed33b3192e9406b17c3107b3235b28ff1b9..857f8abf7b91bc79731873fc8f68e31f6bff4d03 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3190,6 +3190,26 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, #endif } +/* Variant of netdev_tx_sent_queue() for drivers that are aware + * that they should not test BQL status themselves. + * We do want to change __QUEUE_STATE_STACK_XOFF only for the last + * skb of a batch. + * Returns true if the doorbell must be used to kick the NIC. + */ +static inline bool __netdev_tx_sent_queue(struct netdev_queue *dev_queue, + unsigned int bytes, + bool xmit_more) +{ + if (xmit_more) { +#ifdef CONFIG_BQL + dql_queued(&dev_queue->dql, bytes); +#endif + return netif_tx_queue_stopped(dev_queue); + } + netdev_tx_sent_queue(dev_queue, bytes); + return true; +} + /** * netdev_sent_queue - report the number of bytes queued to hardware * @dev: network device -- 2.19.1.930.g4563a0d9d0-goog
[PATCH v2 net 0/3] net: bql: better deal with GSO
While BQL bulk dequeue works well for TSO packets, it is not very efficient as soon as GSO is involved. On a GSO only workload (UDP or TCP), this patch series can save about 8 % of cpu cycles on a 40Gbit mlx4 NIC, by keeping optimal batching, and avoiding expensive doorbells, qdisc requeues and reschedules. This patch series : - Add __netdev_tx_sent_queue() so that drivers can implement efficient BQL and xmit_more support. - Implement a work around in dev_hard_start_xmit() for drivers not using __netdev_tx_sent_queue() - changes mlx4 to use __netdev_tx_sent_queue() v2: Tariq and Willem feedback addressed. added __netdev_tx_sent_queue() (Willem suggestion) Eric Dumazet (3): net: bql: add __netdev_tx_sent_queue() net: do not abort bulk send on BQL status net/mlx4_en: use __netdev_tx_sent_queue() drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 -- include/linux/netdevice.h | 20 net/core/dev.c | 2 +- 3 files changed, 25 insertions(+), 3 deletions(-) -- 2.19.1.930.g4563a0d9d0-goog
Re: Ethernet on my CycloneV broke since 4.9.124
Hi Dinh, On Wed, 31 Oct 2018 at 15:42, Dinh Nguyen wrote: > > Hi Clement, > > On 10/31/2018 08:01 AM, Clément Péron wrote: > > Hi, > > > > The patch "net: stmmac: socfpga: add additional ocp reset line for > > Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV > > board. > > > > When I boot i have this issue : > > > > socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2 > > socfpga-dwmac: probe of ff702000.ethernet failed with error -2 > > > > Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the > > issue. > > > > Are you sure? I just booted v4.9.124 and did not see any errors. The > error should not appear because the commit is using > devm_reset_control_get_optional(). I'm booting on 4.9.130 actually, Agree with you that devm_reset_control_get_optional should not failed but checking other usage of this helper https://elixir.bootlin.com/linux/v4.9.135/source/drivers/i2c/busses/i2c-mv64xxx.c#L824 https://elixir.bootlin.com/linux/v4.9.135/source/drivers/crypto/sunxi-ss/sun4i-ss-core.c#L259 Show that they don't check for errors except for PROBE_DEFER Thanks, Clement > > Dinh
Re: [PATCH iproute2] bpf: check map symbol type properly with newer llvm compiler
On Mon, 29 Oct 2018 15:32:03 -0700 Yonghong Song wrote: > With llvm 7.0 or earlier, the map symbol type is STT_NOTYPE. > -bash-4.4$ cat t.c > __attribute__((section("maps"))) int g; > -bash-4.4$ clang -target bpf -O2 -c t.c > -bash-4.4$ readelf -s t.o > > Symbol table '.symtab' contains 2 entries: > Num:Value Size TypeBind Vis Ndx Name >0: 0 NOTYPE LOCAL DEFAULT UND >1: 0 NOTYPE GLOBAL DEFAULT3 g > -bash-4.4$ > > The following llvm commit enables BPF target to generate > proper symbol type and size. > commit bf6ec206615b9718869d48b4e5400d0c6e3638dd > Author: Yonghong Song > Date: Wed Sep 19 16:04:13 2018 + > > [bpf] Symbol sizes and types in object file > > Clang-compiled object files currently don't include the symbol sizes and > types. Some tools however need that information. For example, > ctfconvert > uses that information to generate FreeBSD's CTF representation from ELF > files. > With this patch, symbol sizes and types are included in object files. > > Signed-off-by: Paul Chaignon > Reported-by: Yutaro Hayakawa > > Hence, for llvm 8.0.0 (currently trunk), symbol type will be not NOTYPE, but > OBJECT. > -bash-4.4$ clang -target bpf -O2 -c t.c > -bash-4.4$ readelf -s t.o > > Symbol table '.symtab' contains 3 entries: > Num:Value Size TypeBind Vis Ndx Name >0: 0 NOTYPE LOCAL DEFAULT UND >1: 0 FILELOCAL DEFAULT ABS t.c >2: 4 OBJECT GLOBAL DEFAULT3 g > -bash-4.4$ > > This patch makes sure bpf library accepts both NOTYPE and OBJECT types > of global map symbols. > > Signed-off-by: Yonghong Song Looks good, applied.
Re: [PATCH iproute] ss: Actually print left delimiter for columns
On Mon, 29 Oct 2018 23:04:25 +0100 Stefano Brivio wrote: > While rendering columns, we use a local variable to keep track of the > field currently being printed, without touching current_field, which is > used for buffering. > > Use the right pointer to access the left delimiter for the current column, > instead of always printing the left delimiter for the last buffered field, > which is usually an empty string. > > This fixes an issue especially visible on narrow terminals, where some > columns might be displayed without separation. > > Reported-by: YoyPa > Fixes: 691bd854bf4a ("ss: Buffer raw fields first, then render them as a > table") > Signed-off-by: Stefano Brivio > Tested-by: YoyPa Looks good, applied.
Re: [PATCH 3/4] net: macb: Add pm runtime support
Hi Andrew, On Wed, Oct 31, 2018 at 8:24 PM Andrew Lunn wrote: > > On Wed, Oct 31, 2018 at 09:10:22AM +0530, Harini Katakam wrote: > > From: Harini Katakam > > > > Add runtime pm functions and move clock handling there. > > If device is suspended and not a wake device, then return from > > mdio read/write functions without performing any action because > > the clocks are not active. > > > > Signed-off-by: Shubhrajyoti Datta > > Signed-off-by: Harini Katakam > > --- > > Changes from RFC: > > Updated pm get sync/put sync calls. > > Removed unecessary clk up in mdio helpers. > > This last bit has me worried. > > The MDIO bus is a shared bus with a life of its own. You can have > multiple PHYs and switches on it. The PHYs can for a different > Ethernet MAC. Switch drivers will expect to be able to address the > switch when the interface is down. > > The FEC driver did something similar for a while. I had to make MDIO > read/write runtime PM aware, otherwise i could not access the Ethernet > switch hanging of its MDIO bus. Yes, I understand. But I thought I'd handle it when we have a separate MDIO bus driver. As of now, with macb, I do not see a way, for ex., for MAC1 to access MDIO0 when MAC0 is suspended. However, I will work a bit more on this solution. With the clk up, I've noticed that there is atleast one PHY status poll that ends up bringing the clock up after the MAC has been suspended. Of course phy is supended immediately after that but it just delays this full power down process. Regards, Harini
Re: [PATCH iproute2] Use libbsd for strlcpy if available
On Mon, 29 Oct 2018 10:46:50 + Luca Boccassi wrote: > If libc does not provide strlcpy check for libbsd with pkg-config to > avoid relying on inline version. > > Signed-off-by: Luca Boccassi > --- > This allows distro maintainers to be able to choose to reduce > duplication and let this code be maintained in one place, in the > external library. > I like the idea, but it causes warnings on Debian testing, and maybe other distros. ipnetns.c:2: warning: "_ATFILE_SOURCE" redefined #define _ATFILE_SOURCE In file included from /usr/include/x86_64-linux-gnu/bits/libc-header-start.h:33, from /usr/include/string.h:26, from /usr/include/bsd/string.h:30, from : /usr/include/features.h:326: note: this is the location of the previous definition # define _ATFILE_SOURCE 1 Please figure out how to handle this and resubmit. SUSE open build service might also work to test multiple distro's
Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
On Wed, Oct 31, 2018 at 07:40:03AM -0700, Richard Cochran wrote: > On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote: > > I think there could be a flag in ptp_system_timestamp, or a parameter > > of gettimex64(), which would enable/disable reading of the system > > clock. > > I'm not a fan of functions that change their behavior based on flags > in their input parameters. How about separating the PHC timestamp from the ptp_system_timestamp structure and use NULL to indicate we don't want to read the system clock? A gettimex64(ptp, ts, NULL) call would be equal to gettime64(ptp, ts). struct ptp_system_timestamp { struct timespec64 pre_ts; struct timespec64 post_ts; }; int (*gettimex64)(struct ptp_clock_info *ptp, struct timespec64 *ts, struct ptp_system_timestamp *sts); -- Miroslav Lichvar
Re: Ethernet on my CycloneV broke since 4.9.124
Hi Clement, On 10/31/2018 08:01 AM, Clément Péron wrote: > Hi, > > The patch "net: stmmac: socfpga: add additional ocp reset line for > Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV > board. > > When I boot i have this issue : > > socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2 > socfpga-dwmac: probe of ff702000.ethernet failed with error -2 > > Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the issue. > Are you sure? I just booted v4.9.124 and did not see any errors. The error should not appear because the commit is using devm_reset_control_get_optional(). Dinh
Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote: > I think there could be a flag in ptp_system_timestamp, or a parameter > of gettimex64(), which would enable/disable reading of the system > clock. I'm not a fan of functions that change their behavior based on flags in their input parameters. Thanks, Richard
libbpf build failure on debian:9 with clang
1740.66 debian:9 : FAIL gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 The failure was with clang tho: clang version 3.8.1-24 (tags/RELEASE_381/final) With: gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) it built without any warnings/errors. CC /tmp/build/perf/libbpf.o libbpf.c:2201:36: error: comparison of constant -22 with expression of type 'const enum bpf_attach_type' is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (section_names[i].attach_type == -EINVAL) ^ ~~~ 1 error generated. CC /tmp/build/perf/help.o mv: cannot stat '/tmp/build/perf/.libbpf.o.tmp': No such file or directory /git/linux/tools/build/Makefile.build:96: recipe for target '/tmp/build/perf/libbpf.o' failed make[4]: *** [/tmp/build/perf/libbpf.o] Error 1 This is the cset: commit 956b620fcf0b64de403cd26a56bc41e6e4826ea6 Author: Andrey Ignatov Date: Wed Sep 26 15:24:53 2018 -0700 libbpf: Introduce libbpf_attach_type_by_name Tests are continuing, so far: 143.53 alpine:3.4: Ok gcc (Alpine 5.3.0) 5.3.0 258.62 alpine:3.5: Ok gcc (Alpine 6.2.1) 6.2.1 20160822 351.62 alpine:3.6: Ok gcc (Alpine 6.3.0) 6.3.0 451.68 alpine:3.7: Ok gcc (Alpine 6.4.0) 6.4.0 549.38 alpine:3.8: Ok gcc (Alpine 6.4.0) 6.4.0 679.07 alpine:edge : Ok gcc (Alpine 6.4.0) 6.4.0 763.35 amazonlinux:1 : Ok gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28) 859.65 amazonlinux:2 : Ok gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) 947.39 android-ndk:r12b-arm : Ok arm-linux-androideabi-gcc (GCC) 4.9.x 20150123 (prerelease) 1050.64 android-ndk:r15c-arm : Ok arm-linux-androideabi-gcc (GCC) 4.9.x 20150123 (prerelease) 1128.75 centos:5 : Ok gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-55) 1233.26 centos:6 : Ok gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23) 1343.16 centos:7 : Ok gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28) 1473.61 clearlinux:latest : FAIL gcc (Clear Linux OS for Intel Architecture) 8.2.1 20180502 1545.56 debian:7 : Ok gcc (Debian 4.7.2-5) 4.7.2 1645.53 debian:8 : Ok gcc (Debian 4.9.2-10+deb8u1) 4.9.2 1740.66 debian:9 : FAIL gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 18 113.19 debian:experimental : Ok gcc (Debian 8.2.0-8) 8.2.0 1941.48 debian:experimental-x-arm64 : Ok aarch64-linux-gnu-gcc (Debian 8.2.0-7) 8.2.0 2041.51 debian:experimental-x-mips: Ok mips-linux-gnu-gcc (Debian 8.2.0-7) 8.2.0 2140.09 debian:experimental-x-mips64 : Ok mips64-linux-gnuabi64-gcc (Debian 8.1.0-12) 8.1.0 2242.17 debian:experimental-x-mipsel : Ok mipsel-linux-gnu-gcc (Debian 8.2.0-7) 8.2.0 2340.02 fedora:20 : Ok gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7) 2445.47 fedora:21 : Ok gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6) 2541.64 fedora:22 : Ok gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6) 2643.60 fedora:23 : Ok gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6) 2744.04 fedora:24 : Ok gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1) 2837.21 fedora:24-x-ARC-uClibc: Ok arc-linux-gcc (ARCompact ISA Linux uClibc toolchain 2017.09-rc2) 7.1.1 20170710 The problem with clearlinux is unrelated: clang-7: error: unknown argument: '-fno-semantic-interposition' clang-7: error: unsupported argument '4' to option 'flto=' clang-7: error: optimization flag '-ffat-lto-objects' is not supported [-Werror,-Wignored-optimization-argument]
Re: [PATCH net 3/3] net/mlx4_en: use netdev_tx_sent_queue_more()
On Wed, Oct 31, 2018 at 4:37 AM Tariq Toukan wrote: > > > > On 30/10/2018 1:25 AM, Eric Dumazet wrote: > > This patch has two changes : > > > > 1) Use netdev_tx_sent_queue_more() for skbs with xmit_more > > This avoids mangling BQL status, since we only need to > > take care of it for the last skb of the batch. > > > > 2) doorbel only depends on xmit_more and netif_tx_queue_stopped() > > > >While not strictly necessary after 1), it is more consistent > >this way. > > > > Signed-off-by: Eric Dumazet > > Cc: Tariq Toukan > > --- > > drivers/net/ethernet/mellanox/mlx4/en_tx.c | 10 -- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > index > > 1857ee0f0871d48285a6d3711f7c3e9a1e08a05f..3acce02ade6a115881ecd72e4710e332d3f380cb > > 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c > > @@ -1006,7 +1006,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct > > net_device *dev) > > ring->packets++; > > } > > ring->bytes += tx_info->nr_bytes; > > - netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes); > > AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len); > > > > if (tx_info->inl) > > @@ -1044,7 +1043,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct > > net_device *dev) > > netif_tx_stop_queue(ring->tx_queue); > > ring->queue_stopped++; > > } > > - send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue); > > + > > + if (skb->xmit_more) { > > + netdev_tx_sent_queue_more(ring->tx_queue, tx_info->nr_bytes); > > + send_doorbell = netif_tx_queue_stopped(ring->tx_queue); > > + } else { > > + netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes); > > + send_doorbell = true; > > + } > > > > real_size = (real_size / 16) & 0x3f; > > > > > > The drivers' code template would be nicer if we unify the two functions > netdev_tx_sent_queue/netdev_tx_sent_queue_more to a single one with a > parameter. > > Currently, all drivers that would want to benefit from this optimization > will have to repeat these if/else blocks. I can add a helper sure, but I can not change drivers that I am not able to test. So I can not change existing helper. This patch series shows the problem and fixes one driver, a common helper can be added when a second driver is updated, there is no hurry.
Re: [PATCH net 1/3] net: bql: add netdev_tx_sent_queue_more() helper
On Wed, Oct 31, 2018 at 4:30 AM Tariq Toukan wrote: > > > > On 30/10/2018 1:25 AM, Eric Dumazet wrote: > > When qdisc_run() tries to use BQL budget to bulk-dequeue a batch > > of packets, GSO can later transform this list in another list > > of skbs, and each skb is sent to device ndo_start_xmit(), > > one at a time, with skb->xmit_more being set to one but > > for last skb. > > > > Problem is that very often, BQL limit is hit in the middle of > > the packet train, forcing dev_hard_start_xmit() to stop the > > bulk send and requeue the end of the list. > > > > BQL role is to avoid head of line blocking, making sure > > a qdisc can deliver high priority packets before low priority ones. > > > > But there is no way requeued packets can be bypassed by fresh > > packets in the qdisc. > > > > Aborting the bulk send increases TX softirqs, and hot cache > > lines (after skb_segment()) are wasted. > > > > Note that for TSO packets, we never split a packet in the middle > > because of BQL limit being hit. > > > > Drivers should be able to update BQL counters without > > flipping/caring about BQL status, if the current skb > > has xmit_more set. > > > > Upper layers are ultimately responsible to stop sending another > > packet train when BQL limit is hit. > > > > Code template in a driver might look like the following : > > > > if (skb->xmit_more) { > > netdev_tx_sent_queue_more(tx_queue, nr_bytes); > > send_doorbell = netif_tx_queue_stopped(tx_queue); > > } else { > > netdev_tx_sent_queue(tx_queue, nr_bytes); > > send_doorbell = true; > > } > > > > Hi Eric, > Nice optimization. > > I thought of another way of implementing it, by just extending the > existing netdev_tx_sent_queue function with a new xmit_more parameter, > that the driver passes. > Something like this: > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index d837dad24b4c..feb9cbcb5759 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3129,12 +3129,12 @@ static inline void > netdev_txq_bql_complete_prefetchw(struct netdev_queue *dev_qu > } > > static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, > - unsigned int bytes) > + unsigned int bytes, bool more) > { > #ifdef CONFIG_BQL > dql_queued(&dev_queue->dql, bytes); > > - if (likely(dql_avail(&dev_queue->dql) >= 0)) > + if (more || likely(dql_avail(&dev_queue->dql) >= 0)) > return; > > set_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state); > > > This unifies and simplifies both the stack and driver code, as the new > suggested function netdev_tx_sent_queue_more can become a private case > of the existing one. Yes I thought of this, but it is too risky/invasive and I prefer adding an opt-in mechanism so that patch authors can test their patch. Then when/if all drivers are updated, we can remove the legacy stuff or rename everything. mlx4 was the only NIC I could reasonably test myself. Another suggestion from Willem is to add the following helper, returning a boolean (doorbell needed) +static inline bool __netdev_tx_sent_queue(struct netdev_queue *dev_queue, + unsigned int bytes, bool xmit_more) +{ + if (xmit_more) { + netdev_tx_sent_queue_more(ring->tx_queue, tx_info->nr_bytes); + return netif_tx_queue_stopped(dev_queue); + } else { + netdev_tx_sent_queue(dev_queue, bytes); + return true; + } +} +
Re: [PATCH net] net/mlx4_en: add a missing include
On Tue, 2018-10-30 at 00:18 -0700, Eric Dumazet wrote: > Abdul Haleem reported a build error on ppc : > > drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct > iphdr` declared inside parameter list [enabled by default] >struct iphdr *iph) > ^ > drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is > only this definition or declaration, which is probably not what you want > [enabled by default] > drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function > get_fixed_ipv4_csum: > drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing > pointer to incomplete type > __u8 ipproto = iph->protocol; > ^ > > Fixes: 55469bc6b577 ("drivers: net: remove inclusion when > not needed") > Signed-off-by: Eric Dumazet > Reported-by: Abdul Haleem > --- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index > 5a6d0919533d6e0e619927abd753c5d07ed95dac..db00bf1c23f5ad31d64652ddc8bee32e2e7534c8 > 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -43,6 +43,7 @@ > #include > #include > > +#include > #if IS_ENABLED(CONFIG_IPV6) > #include > #endif Tested-by: Abdul Haleem -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [PATCH v4.14-stable] sch_netem: restore skb->dev after dequeuing from the rbtree
On Tue, Oct 30, 2018 at 12:12:51PM -0700, Eduardo Valentin wrote: Greg, On Thu, Oct 18, 2018 at 03:43:48PM -0700, David Miller wrote: From: Christoph Paasch Date: Thu, 18 Oct 2018 13:38:40 -0700 > Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got > backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the > v4.14.x-tree. > > However, the backport does not include the changes in sch_netem.c > > We need these, as otherwise the skb->dev pointer is not set when > dequeueing from the netem rbtree, resulting in a panic: ... > Fixes: 6b921536f170 ("net: sk_buff rbnode reorg") > Cc: Stephen Hemminger > Cc: Eric Dumazet > Cc: Soheil Hassas Yeganeh > Cc: Wei Wang > Cc: Willem de Bruijn > Signed-off-by: Christoph Paasch > --- > > Notes: > This patch should only make it into v4.14-stable as that's the only branch where > the offending commit has been backported to. Greg, please queue up. Are you planing to queue this one ? Looks to me it was a miss on the backport. It seams that the backport was touching different files, and missed the change on net/sched/sch_netem.c. So, to me, even if this patch may not follow the strictly the rules of stable, as it is not a patch in upstream, seams to be a needed change, even if it is specific to stable linux-4.14.y. I've queued this patch for 4.14. -- Thanks, Sasha
Ethernet on my CycloneV broke since 4.9.124
Hi, The patch "net: stmmac: socfpga: add additional ocp reset line for Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV board. When I boot i have this issue : socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2 socfpga-dwmac: probe of ff702000.ethernet failed with error -2 Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the issue. Thanks, Clement
Re: [PATCH net] net: sched: Remove TCA_OPTIONS from policy
> Il 26 ottobre 2018 alle 20.19 Cong Wang ha scritto: > > On Fri, Oct 26, 2018 at 4:35 AM Marco Berizzi wrote: > > > Apologies for bothering you again. > > I applied your patch to 4.19, but after issuing this > > command: > > > > root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1 > > root@Calimero:~# ping 10.81.104.1 > > PING 10.81.104.1 (10.81.104.1) 56(84) bytes of data. > > ^C > > --- 10.81.104.1 ping statistics --- > > 2 packets transmitted, 0 received, 100% packet loss, time 1001ms > > > > I'm losing ipv4 connectivity. > > If I remove the qdisc everything is going to work again: > > Did this really work before? > > You specify a default class without adding it, so the packets are dropped. > > How would you expect this to work? :) :-) yes indeed. Apologies for the noise.