Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbertwrote: > On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend > wrote: > > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > >> > >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed > >> wrote: > >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet > >>> wrote: > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > > wrote: > >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> [...] > > Only if a qdisc is present and pressure is high enough. > > But in a forwarding setup, we likely receive at a lower rate than the > NIC can transmit. > >> > >> Yes, I can confirm this happens in my experiments. > >> > > >>> > >>> Jesper has a similar Idea to make the qdisc think it is under > >>> pressure, when the device TX ring is idle most of the time, i think > >>> his idea can come in handy here. I am not fully involved in the > >>> details, maybe he can elaborate more. > >>> > >>> But if it works, it will be transparent to napi, and xmit more will > >>> happen by design. > >> > >> Yes. I have some ideas around getting more bulking going from the qdisc > >> layer, by having the drivers provide some feedback to the qdisc layer > >> indicating xmit_more should be possible. This will be a topic at the > >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > >> challenge people to come up with a good solution ;-) > >> > > > > One thing I've noticed but haven't yet actually analyzed much is if > > I shrink the nic descriptor ring size to only be slightly larger than > > the qdisc layer bulking size I get more bulking and better perf numbers. > > At least on microbenchmarks. The reason being the nic pushes back more > > on the qdisc. So maybe a case for making the ring size in the NIC some > > factor of the expected number of queues feeding the descriptor ring. > > I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this). > BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple. Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 1/2] macb: Add 1588 support in Cadence GEM.
Hi, On Tue, Sep 6, 2016 at 9:18 PM, Richard Cochranwrote: > >> +#define GEM_TISUBN 0x01bc /* 1588 Timer Increment Sub-ns */ > > This regsiter does not exist. Looking at > >Zynq-7000 AP SoC Technical Reference Manual >UG585 (v1.10) February 23, 2015 > > starting on page 1273 we see: > > udp_csum_errors 0x01B0 32 ro0x UDP checksum error > timer_strobe_s 0x01C8 32 rw0x 1588 timer sync strobe seconds > timer_strobe_ns 0x01CC 32 mixed 0x 1588 timer sync strobe > nanoseconds > timer_s 0x01D0 32 rw0x 1588 timer seconds > timer_ns0x01D4 32 mixed 0x 1588 timer nanoseconds > timer_adjust0x01D8 32 mixed 0x 1588 timer adjust > timer_incr 0x01DC 32 mixed 0x 1588 timer increment > > There is no register at 0x1BC. > >> +#define GEM_TSH 0x01c0 /* 1588 Timer Seconds High */ > > This one doesn't exist either. What is going on here? I cant be sure of the version of Cadence GEM used in SAMA5D2 but these registers (sub ns increments alteast) only exist in the IP version used in Zynq Ultrascale+ MPSoC. >> + /* get GEM internal time */ >> + sech = gem_readl(bp, TSH); >> + secl = gem_readl(bp, TSL); > > Does reading TSH latch the time? The TRM is silent about that, and > most other designs latch on reading the LSB. No, it does not latch the time. When doing a read + adjust + write, this will mean there's room for some error. Although when writing, the write to MSB and LSB registers was made atomic. This bug fix came only in the most recent version of the IP, am afraid. Regards, Harini
[PATCH net-next] ipv4: accept u8 in IP_TOS ancillary data
From: Eric DumazetIn commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data") Francesco added IP_TOS values specified as integer. However, kernel sends to userspace (at recvmsg() time) an IP_TOS value in a single byte, when IP_RECVTOS is set on the socket. It can be very useful to reflect all ancillary options as given by the kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with EINVAL after Francesco patch. So this patch extends IP_TOS ancillary to accept an u8, so that an UDP server can simply reuse same ancillary block without having to mangle it. Jesper can then augment https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c to add TOS reflection ;) Fixes: f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data") Signed-off-by: Eric Dumazet Cc: Francesco Fusco Cc: Jesper Dangaard Brouer --- net/ipv4/ip_sockglue.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 71a52f4d4cffba2db9353f43dc817689bf4fab10..af4919792b6a812041dcb18ff30aa8b27482c7a2 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -284,9 +284,12 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc, ipc->ttl = val; break; case IP_TOS: - if (cmsg->cmsg_len != CMSG_LEN(sizeof(int))) + if (cmsg->cmsg_len == CMSG_LEN(sizeof(int))) + val = *(int *)CMSG_DATA(cmsg); + else if (cmsg->cmsg_len == CMSG_LEN(sizeof(u8))) + val = *(u8 *)CMSG_DATA(cmsg); + else return -EINVAL; - val = *(int *)CMSG_DATA(cmsg); if (val < 0 || val > 255) return -EINVAL; ipc->tos = val;
[PATCH] Bluetooth: fix kzalloc-simple.cocci warnings
net/bluetooth/mgmt.c:905:6-13: WARNING: kzalloc should be used for rp, instead of kmalloc/memset Use kzalloc rather than kmalloc followed by memset with 0 This considers some simple cases that are common and easy to validate Note in particular that there are no ...s in the rule, so all of the matched code has to be contiguous Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci CC: Michał NarajowskiSigned-off-by: Fengguang Wu --- mgmt.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) --- a/net/bluetooth/mgmt.c +++ b/net/bluetooth/mgmt.c @@ -902,12 +902,10 @@ static int read_ext_controller_info(stru eir_len = eir_append_data(buff, eir_len, EIR_NAME_SHORT, hdev->short_name, name_len); - rp = kmalloc(sizeof(*rp) + eir_len, GFP_KERNEL); + rp = kzalloc(sizeof(*rp) + eir_len, GFP_KERNEL); if (!rp) return -ENOMEM; - memset(rp, 0, sizeof(*rp) + eir_len); - rp->eir_len = cpu_to_le16(eir_len); memcpy(rp->eir, buff, eir_len);
Re: [PATCH 2/2 v3] net: smsc911x: request and deassert optional RESET GPIO
From: Linus WalleijDate: Wed, 7 Sep 2016 15:53:42 +0200 > On some systems (such as the Qualcomm APQ8060 Dragonboard) the > RESET signal of the SMSC911x is not pulled up by a resistor (or > the internal pull-up that will pull it up if the pin is not > even connected) but instead connected to a GPIO line, so that > the operating system must explicitly deassert RESET before use. > > Support this in the SMSC911x driver so this ethernet connector > can be used on such targets. > > Notice that we request the line to go logical low (deassert) > whilst the line on the actual component is active low. This > is managed in the respective hardware description when > specifying the GPIO line with e.g. device tree or ACPI. With > device tree it looks like this in one case: > > reset-gpios = < 30 GPIO_ACTIVE_LOW>; > > Which means that logically requesting the RESET line to be > deasserted will result in the line being driven high, taking > the device out of reset. > > Cc: Jeremy Linton > Signed-off-by: Linus Walleij Applied.
Re: [PATCH 1/2 v3] net: smsc911x: augment device tree bindings
From: Linus WalleijDate: Wed, 7 Sep 2016 15:53:31 +0200 > This adds device tree bindings for: > > - An optional GPIO line for releasing the RESET signal to the > SMSC911x devices > > - An optional PME (power management event) interrupt line that > can be utilized to wake up the system on network activity. > This signal exist on all the SMSC911x devices, it is just not > very often routed. > > Both these lines are routed to the SoC on the Qualcomm APQ8060 > Dragonboard and thus needs to be bound in the device tree. > > Cc: devicet...@vger.kernel.org > Cc: Jeremy Linton > Signed-off-by: Linus Walleij Applied.
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabendwrote: > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed >> wrote: >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > wrote: >> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> [...] Only if a qdisc is present and pressure is high enough. But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit. >> >> Yes, I can confirm this happens in my experiments. >> >>> >>> Jesper has a similar Idea to make the qdisc think it is under >>> pressure, when the device TX ring is idle most of the time, i think >>> his idea can come in handy here. I am not fully involved in the >>> details, maybe he can elaborate more. >>> >>> But if it works, it will be transparent to napi, and xmit more will >>> happen by design. >> >> Yes. I have some ideas around getting more bulking going from the qdisc >> layer, by having the drivers provide some feedback to the qdisc layer >> indicating xmit_more should be possible. This will be a topic at the >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully >> challenge people to come up with a good solution ;-) >> > > One thing I've noticed but haven't yet actually analyzed much is if > I shrink the nic descriptor ring size to only be slightly larger than > the qdisc layer bulking size I get more bulking and better perf numbers. > At least on microbenchmarks. The reason being the nic pushes back more > on the qdisc. So maybe a case for making the ring size in the NIC some > factor of the expected number of queues feeding the descriptor ring. > BQL is not helping with that? Tom > .John
RE: [PATCH net-next 0/3] r8152: configuration setting
David Miller [mailto:da...@davemloft.net] > Sent: Thursday, September 08, 2016 8:38 AM [...] > By forcing a certain mode via a Kconfig value, you are basically making it > impossible for distributions to do something reasonable here. The request is always from some manufacturers, not end users. They always ask the driver to control everything. And I don't think the end user knows or cares about how the device is set. That is why I try to satisfy them via Kconfig. I think you have rejected this way. I would think if there is any other method. Thanks. Best Regards, Hayes
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: > > On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed> wrote: >> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: >>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet wrote: > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > [...] >>> >>> Only if a qdisc is present and pressure is high enough. >>> >>> But in a forwarding setup, we likely receive at a lower rate than the >>> NIC can transmit. > > Yes, I can confirm this happens in my experiments. > >>> >> >> Jesper has a similar Idea to make the qdisc think it is under >> pressure, when the device TX ring is idle most of the time, i think >> his idea can come in handy here. I am not fully involved in the >> details, maybe he can elaborate more. >> >> But if it works, it will be transparent to napi, and xmit more will >> happen by design. > > Yes. I have some ideas around getting more bulking going from the qdisc > layer, by having the drivers provide some feedback to the qdisc layer > indicating xmit_more should be possible. This will be a topic at the > Network Performance Workshop[1] at NetDev 1.2, I have will hopefully > challenge people to come up with a good solution ;-) > One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring. .John
RE: [PATCH net-next 0/3] r8152: configuration setting
Bjørn Mork [mailto:bj...@mork.no] > Sent: Wednesday, September 07, 2016 9:51 PM [...] > So this adds a lot of code to work around the issues you introduced by > unnecessarily blacklisting the CDC ECM configuration earlier, and still > makes the r8152 driver handle the device even in ECM mode. I suggest to use vendor mode only, but some people ask me to submit such patches. If these patches are rejected, I have enough reasons to tell them it is unacceptable rather than I don't do it. > Just remove the completely unnecessary blacklist, and let the cdc_ether > driver handle the device if the user selects the ECM configuration. > That't how the USB system works. There is no need for any code in r8152 > to do that. The pure cdc_ether driver couldn't change the speed of the ethernet, because it doesn't know how to access the PHY of the device. Therefore, I add relative code in r8152 driver. Best Regards, Hayes
[PATCHv2 iproute2] ip route: check ftell, fseek return value
ftell() may return -1 in error case, which is not handled and therefore pass a negative offset to fseek(). The return code of fseek() is also not checked. Reported-by: Phil SutterSigned-off-by: Hangbin Liu --- ip/iproute.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/ip/iproute.c b/ip/iproute.c index 3da23af..c06a474 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -1860,6 +1860,11 @@ static int iproute_restore(void) exit(-1); pos = ftell(stdin); + if (pos == -1) { + perror("Failed to restore: ftell"); + exit(-1); + } + for (prio = 0; prio < 3; prio++) { int err; @@ -1867,7 +1872,10 @@ static int iproute_restore(void) if (err) exit(err); - fseek(stdin, pos, SEEK_SET); + if (fseek(stdin, pos, SEEK_SET) == -1) { + perror("Failed to restore: fseek"); + exit(-1); + } } exit(0); -- 2.5.5
Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it
On 2016/09/08 3:22, Michal Soltys wrote: ... > 4.7.2 > git describe on that commit suggests it's been available since 4.6.x > > What I did in details: > > ip li add name port1b type veth peer name port1e > ip li add br0 type bridge > ip li set dev br0 type bridge vlan_default_pvid 0 > ip li set dev br0 type bridge vlan_filtering 1 > bridge vlan add dev br0 vid 10 self > bridge vlan add dev br0 vid 250 untagged pvid self > ip li add link br0 name vlan10 type vlan id 10 > ip li set port1b master br0 > > At this point br0.vlan10 had outdated mac after br0 took port1b's one as > its own. If the mac address of lower device is changed while vlan device is down, the address will be synchronized when vlan device becomes up. Please try "ip li set vlan10 up". -- Toshiaki Makita
[PATCH] net_namespace: fixed net_device reference leak
During namespace cleanup, if ‘dst’ subsystem is holding a reference to the loopback interface in the namespace, it does not get released. This is because in the case where the net_device held by 'dst' is same as the namespace's loopback net_device, current code first does a ’dev_hold’ on the same device followed by a ‘dev_put’ on the same device resulting in a no-op. This change fixes this leak by assigning the initial namespace’s loopback device to the ‘dst’’ before releasing the reference to the network device being unregistered. Additional reference: https://github.com/docker/docker/issues/5618 Signed-off-by: Jojy G Varghese--- net/core/dst.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dst.c b/net/core/dst.c index a1656e3..a18e8ea 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -433,7 +433,7 @@ static void dst_ifdown(struct dst_entry *dst, struct net_device *dev, dst->input = dst_discard; dst->output = dst_discard_out; } else { - dst->dev = dev_net(dst->dev)->loopback_dev; + dst->dev = init_net.loopback_dev; dev_hold(dst->dev); dev_put(dev); } -- 1.8.3.1
Re: Minimum MTU Mess
> This is definitely going to require a few passes... (Working my way > through every driver with an ndo_change_mtu wired up right now to > see just how crazy this might get). It might be something Coccinelle can help you with. Try describing the transformation you want to do, to their mailing list, and they might come up with a script for you. Andrew
Re: [PATCH net-next v22 3/3] openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes
On Wed, Sep 7, 2016 at 9:56 AM, Eric Garverwrote: > Add support for 802.1ad including the ability to push and pop double > tagged vlans. Add support for 802.1ad to netlink parsing and flow > conversion. Uses double nested encap attributes to represent double > tagged vlan. Inner TPID encoded along with ctci in nested attributes. > > This is based on Thomas F Herbert's original v20 patch. I made some > small clean ups and bug fixes. > > Signed-off-by: Thomas F Herbert > Signed-off-by: Eric Garver > --- Thanks for finishing the work. Acked-by: Pravin B Shelar
Re: [ovs-dev] [PATCH net-next v22 1/3] openvswitch: 802.1ad uapi changes.
On Wed, Sep 7, 2016 at 9:56 AM, Eric Garverwrote: > From: Thomas F Herbert > > openvswitch: Add support for 8021.AD > > Change the description of the VLAN tpid field. > > Signed-off-by: Thomas F Herbert Acked-by: Pravin B Shelar
Re: [ovs-dev] [PATCH net-next v22 2/3] vlan: Check for vlan ethernet types for 8021.q or 802.1ad
On Wed, Sep 7, 2016 at 9:56 AM, Eric Garverwrote: > This is to simplify using double tagged vlans. This function allows all > valid vlan ethertypes to be checked in a single function call. > Also replace some instances that check for both ETH_P_8021Q and > ETH_P_8021AD. > > Patch based on one originally by Thomas F Herbert. > > Signed-off-by: Thomas F Herbert > Signed-off-by: Eric Garver Acked-by: Pravin B Shelar
Re: [PATCH v2 net-next 0/2] qed*: Debug data collection
From: Tomer TayarDate: Wed, 7 Sep 2016 16:36:23 +0300 > This patch series adds the support of debug data collection in the qed driver, > and the means to extract it in the qede driver via the get_regs operation. ... > Please consider applying this to 'net-next'. Series applied, thank you.
Re: [PATCH] kaweth: remove obsolete debugging statements
From: Oliver NeukumDate: Wed, 7 Sep 2016 15:27:09 +0200 > SOme statements in the driver only served to inform > which functions were entered. Ftrace can do that just as good without > needing memory. Remove the statements. > > Signed-off-by: Oliver Neukum Applied to net-next, thanks Oliver.
Re: [PATCH] qed: mark symbols static where possible
From: Yuval MintzDate: Wed, 7 Sep 2016 11:55:34 + >> We get a few warnings when building kernel with W=1: >> drivers/net/ethernet/qlogic/qed/qed_l2.c:112:5: warning: no previous >> prototype for 'qed_sp_vport_start' [-Wmissing-prototypes] >> >> >> In fact, these functions are only used in the file in which they are >> declared and don't need a declaration, but can be made static. >> so this patch marks these functions with 'static'. >> >> Signed-off-by: Baoyou Xie > > Most of this changes are correct. I have 2 reservations, though: > > 1. qed_get_vport_stats() is in use in net-next by qed_main.c starting with > 6c75424612a7 ("qed: Add support for NCSI statistics."), so we shouldn't > make it static. > [This raises the question of which repository was this patch built against?] > > 2. Regarding the changes in qed_cxt.c - some of the functions you're turning > into static are going to be used be other files in our pending RoCE driver > submissions [you could say those ARE really missing their prototypes]. > I don't have a real objection to this change - I'm just stating that if you'll > change those now to static, we'll probably 'revert' the change in the near > future. Baoyou, please respin this against net-next, thanks.
Re: [PATCH] qed: add missing header dependencies
From: Baoyou XieDate: Wed, 7 Sep 2016 19:07:00 +0800 > We get 4 warnings when building kernel with W=1: > drivers/net/ethernet/qlogic/qed/qed_selftest.c:6:5: warning: no previous > prototype for 'qed_selftest_memory' [-Wmissing-prototypes] > drivers/net/ethernet/qlogic/qed/qed_selftest.c:19:5: warning: no previous > prototype for 'qed_selftest_interrupt' [-Wmissing-prototypes] > drivers/net/ethernet/qlogic/qed/qed_selftest.c:32:5: warning: no previous > prototype for 'qed_selftest_register' [-Wmissing-prototypes] > drivers/net/ethernet/qlogic/qed/qed_selftest.c:55:5: warning: no previous > prototype for 'qed_selftest_clock' [-Wmissing-prototypes] > > In fact, these functions are declared in qed_selftest.h, so this patch > add missing header dependencies. > > Signed-off-by: Baoyou Xie Applied.
Re: [PATCH net-next 0/3] r8152: configuration setting
From: Hayes WangDate: Wed, 7 Sep 2016 16:12:19 +0800 > Some people prefer to use ECM mode rather than vendor mode. Therefore, I add > CONFIG_RTL8152_CONFIG_VALUE in Kconfig. Then, the users could choose the USB > configuration value which they want. The default is to support vendor mode > only. By forcing a certain mode via a Kconfig value, you are basically making it impossible for distributions to do something reasonable here. This needs to be somehow a runtime thing, and I'm pretty sure we've had many other situations of USB devices which want to be run by different "personality" drivers in the past. Perhaps we can do something generically for USB devices for this kind of situation and make r8152 use it. Thanks.
Re: [PATCH net-next V6 0/4] net/sched: ip tunnel metadata set/release/classify by using TC
From: Hadar Hen ZionDate: Wed, 7 Sep 2016 11:08:02 +0300 > This patchset introduces ip tunnel manipulation support using the TC > subsystem. Please address the feedback given by Eric Dumazet for patch #4, thank you.
Re: [PATCH net-next v2] netlink: don't forget to release a rhashtable_iter structure
From: Andrei VaginDate: Tue, 6 Sep 2016 21:31:17 -0700 > This bug was detected by kmemleak: > unreferenced object 0x8804269cc3c0 (size 64): > comm "criu", pid 1042, jiffies 4294907360 (age 13.713s) > hex dump (first 32 bytes): > a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00 .2., > 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de > backtrace: > [] kmemleak_alloc+0x4a/0xa0 > [] kmem_cache_alloc_trace+0x10f/0x280 > [] __netlink_diag_dump+0x26c/0x290 [netlink_diag] > > v2: don't remove a reference on a rhashtable_iter structure to > release it from netlink_diag_dump_done > > Cc: Herbert Xu > Fixes: ad202074320c ("netlink: Use rhashtable walk interface in diag dump") > Signed-off-by: Andrei Vagin Applied, thanks.
Re: [PATCH net-next] MAINTAINERS: Update CPMAC email address
From: Florian FainelliDate: Tue, 6 Sep 2016 20:22:21 -0700 > Signed-off-by: Florian Fainelli Since MAINTAINERS entry accuracy is very important for bug reporting and patch CC:'ing, I tend to apply these kind of patches always to 'net'. And I have done so in this case too. Thanks.
Re: [PATCH net-next 1/2] tcp: measure rwnd-limited time
From: "Francis Y. Yan"Date: Tue, 6 Sep 2016 18:32:40 -0700 > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index 7be9b12..f5b588e 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -176,6 +176,7 @@ struct tcp_sock { >* were acked. >*/ > struct u64_stats_sync syncp; /* protects 64bit vars (cf tcp_get_info()) > */ > + seqcount_t seqcnt; /* proctects rwnd-limited-related vars, etc. */ > > u32 snd_una;/* First byte we want an ack for*/ > u32 snd_sml;/* Last byte of the most recently transmitted > small packet */ > @@ -204,6 +205,8 @@ struct tcp_sock { > > u32 window_clamp; /* Maximal window to advertise */ > u32 rcv_ssthresh; /* Current window clamp */ > + struct skb_mstamp rwnd_limited_ts; /* Last timestamp limited by rwnd */ > + u64 rwnd_limited; /* Total time (us) limited by rwnd */ > > /* Information of the most recently (s)acked skb */ > struct tcp_rack { I understand completely the usefulness of this change, but wow that is a lot of new TCP socket space taken up just to export some time values in tcp_info for debugging and statistics.
Re: [PATCH net-next] net: xfrm: Change u32 sysctl entries to use proc_douintvec
From: Subash Abhinov KasiviswanathanDate: Tue, 6 Sep 2016 18:09:31 -0600 > proc_dointvec limits the values to INT_MAX in u32 sysctl entries. > proc_douintvec allows to write upto UINT_MAX. > > Signed-off-by: Subash Abhinov Kasiviswanathan I am assuming Steffen will pick this up.
Re: [PATCH net-next 0/1] rxrpc: Local abort tracepoint
From: David HowellsDate: Wed, 07 Sep 2016 16:25:50 +0100 > > Here's a patch that adds a tracepoint that allows local aborts to be > debugged. This needs to be applied on top of the just-posted call refcount > overhaul patch. > > The patches can be found here also: > > > http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite > > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160904-2 Also pulled, thanks a lot David.
Re: [PATCH net-next 0/8] rxrpc: Overhaul call refcounting
From: David Howells <dhowe...@redhat.com> Date: Wed, 07 Sep 2016 16:22:26 +0100 > > Here's a set of mostly small patches leading up to one big one. > > The big patch at the end of the series overhauls how rxrpc_call refcounting > is handled, making it more sane so that calls bound to user IDs are _only_ > released from socket operations or kernel API functions. Further, the > patch stops calls from holding refs on their parent socket - which can > prevent the socket from being cleaned up. > > The second largest patch improves the call tracking tracepoint by providing > extra information about the situation in which gets and puts occur. This > allows distinctions to be drawn between refs held by the socket user ID > tree, refs held by the work queue (to be implemented by a future patch) and > other refs. > > The other patches include a couple of cleanups and some simple alterations > to avoid NULL pointer dereferences in the big patch. > > The patches can be found here also (non-terminally on the branch): > > > http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite > > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-rewrite-20160907-1 Pulled.
Re: XPS configuration question (on tg3)
On Thu, 2016-09-08 at 01:45 +0200, Michal Soltys wrote: > > Keep in mind that TCP stack can send packets, responding to incoming > > ACK. > > > > So you might check that incoming ACK are handled by the 'right' cpu. > > > > Without RFS, there is no such guarantee. > Did some more testing today, indeed RFS helped with TCP flows, but > it got me wondering: > > In scenario such as: XPS off, RFS/RPS off, irqs pinned, process > transfering data pinned - one tx queue was chosen (through hash) and > consistently persisted throughout the whole transfer. No exceptions, at > least none in the tests I did. It depends if at least one packet for each flow sits in a qdisc/NIC queue. TCP has this Out Of Order transmit logic preventing a TX queue change, even if the process doing the sendmsg() is migrated. git grep -n ooo_okay > > When XPS is getting enabled, the only thing that changes is that instead > of using hash to select one of available queues, the cpu running process > is specifically told which queue it can use (and eventually selects one > through hash if more than one is available). Shouldn't the choice > persist throughout the transfer as well then ? Sure, if the process doing the sendmsg() sticks to one cpu, and this cpu is the one handling incoming ACK packets as well.
Re: XPS configuration question (on tg3)
On 2016-09-07 02:19, Eric Dumazet wrote: > On Tue, 2016-09-06 at 23:00 +0200, Michal Soltys wrote: >> On 2016-09-06 22:21, Alexander Duyck wrote: >> > On Tue, Sep 6, 2016 at 11:46 AM, Michal Soltyswrote: >> >> Hi, >> >> >> >> I've been testing different configurations and I didn't manage to get XPS >> >> to "behave" correctly - so I'm probably misunderstanding or forgetting >> >> something. The nic in question (under tg3 driver - BCM5720 and BCM5719 >> >> models) was configured to 3 tx and 4 rx queues. 3 irqs were shared (tx >> >> and rx), 1 was unused (this got me scratching my head a bit) and the >> >> remaining one was for the last rx (though due to another bug recently >> >> fixed the 4th rx queue was inconfigurable on receive side). The names >> >> were: eth1b-0, eth1b-txrx-1, eth1b-txrx-2, eth1b-txrx-3, eth1b-rx-4. >> >> >> >> The XPS was configured as: >> >> >> >> echo f >/sys/class/net/eth1b/queues/tx-0/xps_cpus >> >> echo f0 >/sys/class/net/eth1b/queues/tx-1/xps_cpus >> >> echo ff00 >/sys/class/net/eth1b/queues/tx-2/xps_cpus >> >> >> >> So as far as I understand - cpus 0-3 should be allowed to use tx-0 queue >> >> only, 4-7 tx-1 and 8-15 tx-2. >> >> >> >> Just in case rx side could get in the way as far as flows go, relevant >> >> irqs were pinned to specific cpus - txrx-1 to 2, txrx-2 to 4, txrx-3 to >> >> 10 - falling into groups defined by the above masks. >> >> >> >> I tested both with mq and multiq scheduler, essentially either this: >> >> >> >> qdisc mq 2: root >> >> qdisc pfifo_fast 0: parent 2:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> qdisc pfifo_fast 0: parent 2:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> qdisc pfifo_fast 0: parent 2:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 >> >> 1 1 1 >> >> >> >> or this (for the record, skbaction queue_mapping was behaving correctly >> >> with the one below): >> >> >> >> qdisc multiq 3: root refcnt 6 bands 3/5 >> >> qdisc pfifo_fast 31: parent 3:1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> qdisc pfifo_fast 32: parent 3:2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> qdisc pfifo_fast 33: parent 3:3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 >> >> 1 1 1 1 >> >> >> >> Now, do I understand correctly, that under the above setup - commands >> >> such as >> >> >> >> taskset 400 nc -p $prt host_ip 12345 > >> or >> >> yancat -i /dev/zero -o t:host_ip:12345 -u 10 -U 10 >> >> >> >> ITOW - pinning simple nc command on cpu #10 (or using a tool that >> >> supports affinity by itself) and sending data to some other host on the >> >> net - should *always* use tx-2 queue ? >> >> I also tested variation such as: taskset 400 nc -l -p host_ip 12345 >> >> > >> >> >> In my case, what queue it used was basically random (on top of that it >> >> sometimes changed the used queue mid-transfer) what could be easily >> >> confirmed through both /proc/interrupts and tc -s qdisc show. And I'm a >> >> bit at loss now, as I though xps configuration should be absolute. >> >> >> >> Well, I'd be greatful for some pointers / hints. >> > >> > So it sounds like you have everything configured correctly. The one >> > question I would have is if we are certain the CPU pinning is working >> > for the application. You might try using something like perf to >> > verify what is running on CPU 10, and what is running on the CPUs that >> > the queues are associated with. >> > >> >> I did verify with 'top' in this case. I'll double check tommorow just to >> be sure. Other than testing, there was nothing else running on the machine. >> >> > Also after you have configured things you may want to double check and >> > verify the xps_cpus value is still set. I know under some >> > circumstances the value can be reset by a device driver if the number >> > of queues changes, or if the interface toggles between being >> > administratively up/down. >> >> Hmm, none of this was happening during tests. >> >> Are there any other circumstances where xps settings could be ignored or >> changed during the test (that is during the actual transfer, not between >> separate attempts) ? >> >> One thing I'm a bit afraid is that kernel was not exactly the newest >> (3.16), maybe I'm missing some crucial fixes, though xps was added much >> earlier than that. Either way, I'll try to redo tests with current >> kernel tommorow. >> > > Keep in mind that TCP stack can send packets, responding to incoming > ACK. > > So you might check that incoming ACK are handled by the 'right' cpu. > > Without RFS, there is no such guarantee. > > echo 32768 >/proc/sys/net/core/rps_sock_flow_entries > echo 8192 >/sys/class/net/eth1/queues/rx-0/rps_flow_cnt > echo 8192 >/sys/class/net/eth1/queues/rx-1/rps_flow_cnt > echo 8192 >/sys/class/net/eth1/queues/rx-2/rps_flow_cnt > echo 8192 >/sys/class/net/eth1/queues/rx-3/rps_flow_cnt > Did some more testing today, indeed RFS helped with TCP flows, but it got me
Re: Minimum MTU Mess
On Wed, Sep 07, 2016 at 01:35:35PM -0700, Stephen Hemminger wrote: > On Wed, 7 Sep 2016 15:53:56 -0400 > Jarod Wilsonwrote: > > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -6466,9 +6466,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu) > > if (new_mtu == dev->mtu) > > return 0; > > > > - /* MTU must be positive.*/ > > - if (new_mtu < 0) > > + if (new_mtu < dev->min_mtu) { > > + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n", > > + new_mtu, dev->min_mtu); > > return -EINVAL; > > + } > > + > > + if (new_mtu > dev->max_mtu) { > > + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", > > + new_mtu, dev->min_mtu); > > + return -EINVAL; > > + } > > > > Maybe don't log something that can be triggered from a user program. > Or at least rate limit it. Yeah, I was a little bit on the fence on whether to log anything, make it netdev_err, netdev_dbg, or what. Quite a few drivers have a netdev_err for failed MTU changes, while others also have netdev_info spew for successful MTU changes. Maybe a rate-limited netdev_err is the way to go here. -- Jarod Wilson ja...@redhat.com
Re: Minimum MTU Mess
On Wed, Sep 07, 2016 at 10:31:12PM +0200, Andrew Lunn wrote: > Hi Jarod > > > - /* MTU must be positive.*/ > > - if (new_mtu < 0) > > + if (new_mtu < dev->min_mtu) { > > + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n", > > + new_mtu, dev->min_mtu); > > return -EINVAL; > > + } > > + > > + if (new_mtu > dev->max_mtu) { > > + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", > > + new_mtu, dev->min_mtu); > > + return -EINVAL; > > + } > > I doubt you can make such a big change like this in one go. Can you > really guarantee all interfaces, of what ever type, will have some > value for dev->min_mtu and dev->max_mtu? What may fly is something > more like: > > > + if (dev->max_mtu && new_mtu > dev->max_mtu) { > > + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", > > + new_mtu, dev->min_mtu); > > + return -EINVAL; > > + } > > Maybe in a few cycles you can add a WARN_ON(!dev->max_mtu), and a few > cycles after that go with (new_mtu > dev->max_mtu). My local tree actually has if (dev->max_mtu > 0 && new_mtu > dev->max_mtu) since just after I'd sent my mail for exactly that reason, though looking at alloc_netdev_mqs(), it does seem we're at least guaranteed the value will be 0 if not otherwise initialized, so your version looks perfectly fine, and in the min_mtu case, without any additional handling, things behave exactly as they did before. This is definitely going to require a few passes... (Working my way through every driver with an ndo_change_mtu wired up right now to see just how crazy this might get). -- Jarod Wilson ja...@redhat.com
Re: [PATCH 2/2 v3] net: smsc911x: request and deassert optional RESET GPIO
Hi, On 09/07/2016 08:53 AM, Linus Walleij wrote: On some systems (such as the Qualcomm APQ8060 Dragonboard) the RESET signal of the SMSC911x is not pulled up by a resistor (or the internal pull-up that will pull it up if the pin is not even connected) but instead connected to a GPIO line, so that the operating system must explicitly deassert RESET before use. Support this in the SMSC911x driver so this ethernet connector can be used on such targets. Notice that we request the line to go logical low (deassert) whilst the line on the actual component is active low. This is managed in the respective hardware description when specifying the GPIO line with e.g. device tree or ACPI. With device tree it looks like this in one case: reset-gpios = < 30 GPIO_ACTIVE_LOW>; Which means that logically requesting the RESET line to be deasserted will result in the line being driven high, taking the device out of reset. Cc: Jeremy LintonSigned-off-by: Linus Walleij Thanks for clarifying the GPIO_ACTIVE_LOW state on this pin. I've reviewed this patch, and tested that it doesn't have any affect on a JunoR2 (ACPI or DT) system. I've got some personal reservations about making changes for a single board that might better be handled in firmware. But, I don't think those should be an excuse to keep that hardware from working. Particularly since this is a fairly trivial change. So FWIW, Reviewed-by: Jeremy Linton Thanks,
Re: [Patch v6] net: ethernet: xilinx: Enable emaclite for MIPS
From: Zubair Lutfullah KakakhelDate: Mon, 5 Sep 2016 13:07:54 +0100 > The MIPS based xilfpga platform uses this driver. > Enable it for MIPS > > Signed-off-by: Zubair Lutfullah Kakakhel > > --- > V1 -> V6 are from a series that has gotten too big. > So I have split this patch and am sending it separately. What tree to you expect this patch to be applied to?
[PATCH net-next] bpf: fix range propagation on direct packet access
LLVM can generate code that tests for direct packet access via skb->data/data_end in a way that currently gets rejected by the verifier, example: [...] 7: (61) r3 = *(u32 *)(r6 +80) 8: (61) r9 = *(u32 *)(r6 +76) 9: (bf) r2 = r9 10: (07) r2 += 54 11: (3d) if r3 >= r2 goto pc+12 R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp 12: (18) r4 = 0xff7a 14: (05) goto pc+430 [...] from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp 24: (7b) *(u64 *)(r10 -40) = r1 25: (b7) r1 = 0 26: (63) *(u32 *)(r6 +56) = r1 27: (b7) r2 = 40 28: (71) r8 = *(u8 *)(r9 +20) invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0) The reason why this gets rejected despite a proper test is that we currently call find_good_pkt_pointers() only in case where we detect tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and derived, for example, from a register of type pkt(id=Y,off=0,r=0) pointing to skb->data. find_good_pkt_pointers() then fills the range in the current branch to pkt(id=Y,off=0,r=Z) on success. For above case, we need to extend that to recognize pkt_end >= rX pattern and mark the other branch that is taken on success with the appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers(). Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the only two practical options to test for from what LLVM could have generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=) that we would need to take into account as well. After the fix: [...] 7: (61) r3 = *(u32 *)(r6 +80) 8: (61) r9 = *(u32 *)(r6 +76) 9: (bf) r2 = r9 10: (07) r2 += 54 11: (3d) if r3 >= r2 goto pc+12 R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp 12: (18) r4 = 0xff7a 14: (05) goto pc+430 [...] from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp 24: (7b) *(u64 *)(r10 -40) = r1 25: (b7) r1 = 0 26: (63) *(u32 *)(r6 +56) = r1 27: (b7) r2 = 40 28: (71) r8 = *(u8 *)(r9 +20) 29: (bf) r1 = r8 30: (25) if r8 > 0x3c goto pc+47 R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56 R9=pkt(id=0,off=0,r=54) R10=fp 31: (b7) r1 = 1 [...] Verifier test cases are also added in this work, one that demonstrates the mentioned example here and one that tries a bad packet access for the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0), pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two with both test variants (>, >=). Fixes: 969bf05eb3ce ("bpf: direct packet access") Signed-off-by: Daniel BorkmannAcked-by: Alexei Starovoitov --- Targetting at net-next as this should be just fine. kernel/bpf/verifier.c | 55 +--- samples/bpf/test_verifier.c | 102 2 files changed, 142 insertions(+), 15 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 48c2705..90493a6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1637,21 +1637,42 @@ static int check_alu_op(struct verifier_env *env, struct bpf_insn *insn) return 0; } -static void find_good_pkt_pointers(struct verifier_env *env, - struct reg_state *dst_reg) +static void find_good_pkt_pointers(struct verifier_state *state, + const struct reg_state *dst_reg) { - struct verifier_state *state = >cur_state; struct reg_state *regs = state->regs, *reg; int i; - /* r2 = r3; -* r2 += 8 -* if (r2 > pkt_end) goto somewhere -* r2 == dst_reg, pkt_end == src_reg, -* r2=pkt(id=n,off=8,r=0) -* r3=pkt(id=n,off=0,r=0) -* find register r3 and mark its range as r3=pkt(id=n,off=0,r=8) -* so that range of bytes [r3, r3 + 8) is safe to access + + /* LLVM can generate two kind of checks: +* +* Type 1: +* +* r2 = r3; +* r2 += 8; +* if (r2 > pkt_end) goto +* +* +* Where: +* r2 == dst_reg, pkt_end == src_reg +* r2=pkt(id=n,off=8,r=0) +* r3=pkt(id=n,off=0,r=0) +* +* Type 2: +* +* r2 = r3; +* r2 += 8; +* if (pkt_end >= r2) goto +* +* +* Where: +* pkt_end == dst_reg, r2 == src_reg +* r2=pkt(id=n,off=8,r=0) +* r3=pkt(id=n,off=0,r=0) +* +* Find register r3 and mark its range as r3=pkt(id=n,off=0,r=8) +* so that range of bytes [r3, r3 + 8) is safe to access. */ + for (i = 0; i < MAX_BPF_REG; i++) if (regs[i].type == PTR_TO_PACKET && regs[i].id == dst_reg->id)
Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support
From: Or GerlitzDate: Wed, 7 Sep 2016 23:51:56 +0300 > On Wed, Sep 7, 2016 at 11:27 PM, David Miller wrote: >> From: Saeed Mahameed >> Date: Wed, 7 Sep 2016 23:16:13 +0300 >> >>> Hi Dave, Sorry to bother, but i would like to drop this series for now >>> Huy is working to define and come up with a better mechanism to >>> enable/disable the new DCBX hybrid mode he is adding. >> >> I can't "just drop" any change that's in my tree already. > > Dave, I don't see it @ your kernel.org clone but it's not in > patchworks either, so I guess you picked that into your other clone. Aha, I marked it "changes requested" in patchwork and did not apply it. My bad.
Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue
On Wed, 2016-09-07 at 15:26 -0700, Stephen Hemminger wrote: > How much does this grow the size of tcp socket structure? This actually shrinks it by 8 bytes, or more on debug kernels where sizeof(spinlock_t) > 4 Before : struct sk_buff_head out_of_order_queue; // At least 24 bytes on 64bit After : 2 pointers ( 16 bytes on 64bit )
Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue
On Wed, 07 Sep 2016 14:49:28 -0700 Eric Dumazetwrote: > From: Yaogong Wang > > Over the years, TCP BDP has increased by several orders of magnitude, > and some people are considering to reach the 2 Gbytes limit. > > Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 > MSS. > > In presence of packet losses (or reorders), TCP stores incoming packets > into an out of order queue, and number of skbs sitting there waiting for > the missing packets to be received can be in the 10^5 range. > > Most packets are appended to the tail of this queue, and when > packets can finally be transferred to receive queue, we scan the queue > from its head. > > However, in presence of heavy losses, we might have to find an arbitrary > point in this queue, involving a linear scan for every incoming packet, > throwing away cpu caches. > > This patch converts it to a RB tree, to get bounded latencies. > > Yaogong wrote a preliminary patch about 2 years ago. > Eric did the rebase, added ofo_last_skb cache, polishing and tests. > > Tested with network dropping between 1 and 10 % packets, with good > success (about 30 % increase of throughput in stress tests) > > Next step would be to also use an RB tree for the write queue at sender > side ;) > > Signed-off-by: Yaogong Wang > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng > Cc: Neal Cardwell > Cc: Ilpo Järvinen How much does this grow the size of tcp socket structure?
Re: [iovisor-dev] [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-devwrote: > On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed wrote: >> From: Rana Shahout >> >> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. >> >> When XDP is on we make sure to change channels RQs type to >> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to >> ensure "page per packet". >> >> On XDP set, we fail if HW LRO is set and request from user to turn it >> off. Since on ConnectX4-LX HW LRO is always on by default, this will be >> annoying, but we prefer not to enforce LRO off from XDP set function. >> >> Full channels reset (close/open) is required only when setting XDP >> on/off. >> >> When XDP set is called just to exchange programs, we will update >> each RQ xdp program on the fly and for synchronization with current >> data path RX activity of that RQ, we temporally disable that RQ and >> ensure RX path is not running, quickly update and re-enable that RQ, >> for that we do: >> - rq.state = disabled >> - napi_synnchronize >> - xchg(rq->xdp_prg) >> - rq.state = enabled >> - napi_schedule // Just in case we've missed an IRQ >> >> Packet rate performance testing was done with pktgen 64B packets and on >> TX side and, TC drop action on RX side compared to XDP fast drop. >> >> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz >> >> Comparison is done between: >> 1. Baseline, Before this patch with TC drop action >> 2. This patch with TC drop action >> 3. This patch with XDP RX fast drop >> >> StreamsBaseline(TC drop)TC dropXDP fast Drop >> -- >> 1 5.51Mpps5.14Mpps 13.5Mpps > > This (13.5 M PPS) is less than 50% of the result we presented @ the > XDP summit which was obtained by Rana. Please see if/how much does > this grows if you use more sender threads, but all of them to xmit the > same stream/flows, so we're on one ring. That (XDP with single RX ring > getting packets from N remote TX rings) would be your canonical > base-line for any further numbers. > I used N TX senders sending 48Mpps to a single RX core. The single RX core could handle only 13.5Mpps. The implementation here is different from the one we presented at the summit, before, it was with striding RQ, now it is regular linked list RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings handles only 1K). In striding RQ we register only 16 HW descriptors for every 32K packets. I.e for every 32K packets we access the HW only 16 times. on the other hand, regular RQ will access the HW (register descriptors) once per packet, i.e we write to HW 1K time for 1K packets. i think this explains the difference. the catch here is that we can't use striding RQ for XDP, bummer! As i said, we will have the full and final performance results on V1. This is just a RFC with barely quick and dirty testing. > ___ > iovisor-dev mailing list > iovisor-...@lists.iovisor.org > https://lists.iovisor.org/mailman/listinfo/iovisor-dev
[PATCH net-next] tcp: use an RB tree for ooo receive queue
From: Yaogong WangOver the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang Signed-off-by: Eric Dumazet Cc: Yuchung Cheng Cc: Neal Cardwell Cc: Ilpo Järvinen --- include/linux/skbuff.h |2 include/linux/tcp.h |7 include/net/tcp.h|2 net/core/skbuff.c| 19 ++ net/ipv4/tcp.c |4 net/ipv4/tcp_input.c | 330 + net/ipv4/tcp_ipv4.c |2 net/ipv4/tcp_minisocks.c |1 8 files changed, 218 insertions(+), 149 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index cfb7219be665..4c5662f05bda 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2402,6 +2402,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list) kfree_skb(skb); } +void skb_rbtree_purge(struct rb_root *root); + void *netdev_alloc_frag(unsigned int fragsz); struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 7be9b1242354..c723a465125d 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -281,10 +281,9 @@ struct tcp_sock { struct sk_buff* lost_skb_hint; struct sk_buff *retransmit_skb_hint; - /* OOO segments go in this list. Note that socket lock must be held, -* as we do not use sk_buff_head lock. -*/ - struct sk_buff_head out_of_order_queue; + /* OOO segments go in this rbtree. Socket lock must be held. */ + struct rb_root out_of_order_queue; + struct sk_buff *ooo_last_skb; /* cache rb_last(out_of_order_queue) */ /* SACKs data, these 2 need to be together (see tcp_options_write) */ struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */ diff --git a/include/net/tcp.h b/include/net/tcp.h index d6ae36512429..fdfbedd61c67 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -640,7 +640,7 @@ static inline void tcp_fast_path_check(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - if (skb_queue_empty(>out_of_order_queue) && + if (RB_EMPTY_ROOT(>out_of_order_queue) && tp->rcv_wnd && atomic_read(>sk_rmem_alloc) < sk->sk_rcvbuf && !tp->urg_data) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3864b4b68fa1..1e329d411242 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2445,6 +2445,25 @@ void skb_queue_purge(struct sk_buff_head *list) EXPORT_SYMBOL(skb_queue_purge); /** + * skb_rbtree_purge - empty a skb rbtree + * @root: root of the rbtree to empty + * + * Delete all buffers on an _buff rbtree. Each buffer is removed from + * the list and one reference dropped. This function does not take + * any lock. Synchronization should be handled by the caller (e.g., TCP + * out-of-order queue is protected by the socket lock). + */ +void skb_rbtree_purge(struct rb_root *root) +{ + struct sk_buff *skb, *next; + + rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode) + kfree_skb(skb); + + *root = RB_ROOT; +} + +/** * skb_queue_head - queue a buffer at the list head * @list: list to use * @newsk: buffer to queue diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 77311a92275c..a13fcb369f52 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -380,7 +380,7 @@ void tcp_init_sock(struct sock *sk) struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); - __skb_queue_head_init(>out_of_order_queue); + tp->out_of_order_queue = RB_ROOT; tcp_init_xmit_timers(sk); tcp_prequeue_init(tp); INIT_LIST_HEAD(>tsq_node); @@ -2243,7 +2243,7 @@ int
Re: [PATCH v5 2/6] clk: gxbb: expose MPLL2 clock for use by DT
On 09/06, Martin Blumenstingl wrote: > This exposes the MPLL2 clock as this is one of the input clocks of the > ethernet controller's internal mux. > > Signed-off-by: Martin Blumenstingl> --- Acked-by: Stephen Boyd -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: ptp
On Wed, Sep 07, 2016 at 01:55:19PM -0700, Daniel Walker wrote: > So the code only allows second granularity negative updates, No. > or the seconds > component is the only part which needs to actually be negative ? Yes. See the function, clockadj_step, in linuxptp/clockadj.c for a working user space example. Thanks, Richard
[PATCH net-next] macsec: set network devtype
The netdevice type structure for macsec was being defined but never used. To set the network device type the macro SET_NETDEV_DEVTYPE must be called. Compile tested only, I don't use macsec. Signed-off-by: Stephen Hemminger--- a/drivers/net/macsec.c 2016-08-25 17:20:20.671031972 -0700 +++ b/drivers/net/macsec.c 2016-09-07 14:03:01.676238097 -0700 @@ -2973,6 +2973,7 @@ static void macsec_setup(struct net_devi dev->priv_flags |= IFF_NO_QUEUE; dev->netdev_ops = _netdev_ops; dev->destructor = macsec_free_netdev; + SET_NETDEV_DEVTYPE(dev, _type); eth_zero_addr(dev->broadcast); }
Re: ptp
On 09/07/2016 01:48 PM, Richard Cochran wrote: On Wed, Sep 07, 2016 at 01:40:59PM -0700, Daniel Walker wrote: There is a test (below) , which prevents negative nanosecond updates. The code below would force a negative update to always return more than NSEC_PER_SEC. It should be using abs() instead which would return the value desired. No. This: /* * The value of a timeval is the sum of its fields, but the * field tv_usec must always be non-negative. */ HTH, Richard So the code only allows second granularity negative updates, or the seconds component is the only part which needs to actually be negative ? Daniel
[PATCH net-next] rtnetlink: remove unused ifla_stats_policy
This structure is defined but never used. Flagged with W=1 Signed-off-by: Stephen Hemminger--- a/net/core/rtnetlink.c 2016-09-07 13:55:17.030107744 -0700 +++ b/net/core/rtnetlink.c 2016-09-07 13:55:45.190236855 -0700 @@ -3669,10 +3669,6 @@ nla_put_failure: return -EMSGSIZE; } -static const struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = { - [IFLA_STATS_LINK_64]= { .len = sizeof(struct rtnl_link_stats64) }, -}; - static size_t if_nlmsg_stats_size(const struct net_device *dev, u32 filter_mask) {
Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameedwrote: > From: Rana Shahout > > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. > > When XDP is on we make sure to change channels RQs type to > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to > ensure "page per packet". > > On XDP set, we fail if HW LRO is set and request from user to turn it > off. Since on ConnectX4-LX HW LRO is always on by default, this will be > annoying, but we prefer not to enforce LRO off from XDP set function. > > Full channels reset (close/open) is required only when setting XDP > on/off. > > When XDP set is called just to exchange programs, we will update > each RQ xdp program on the fly and for synchronization with current > data path RX activity of that RQ, we temporally disable that RQ and > ensure RX path is not running, quickly update and re-enable that RQ, > for that we do: > - rq.state = disabled > - napi_synnchronize > - xchg(rq->xdp_prg) > - rq.state = enabled > - napi_schedule // Just in case we've missed an IRQ > > Packet rate performance testing was done with pktgen 64B packets and on > TX side and, TC drop action on RX side compared to XDP fast drop. > > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > > Comparison is done between: > 1. Baseline, Before this patch with TC drop action > 2. This patch with TC drop action > 3. This patch with XDP RX fast drop > > StreamsBaseline(TC drop)TC dropXDP fast Drop > -- > 1 5.51Mpps5.14Mpps 13.5Mpps This (13.5 M PPS) is less than 50% of the result we presented @ the XDP summit which was obtained by Rana. Please see if/how much does this grows if you use more sender threads, but all of them to xmit the same stream/flows, so we're on one ring. That (XDP with single RX ring getting packets from N remote TX rings) would be your canonical base-line for any further numbers.
Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support
On Wed, Sep 7, 2016 at 11:27 PM, David Millerwrote: > From: Saeed Mahameed > Date: Wed, 7 Sep 2016 23:16:13 +0300 > >> Hi Dave, Sorry to bother, but i would like to drop this series for now >> Huy is working to define and come up with a better mechanism to >> enable/disable the new DCBX hybrid mode he is adding. > > I can't "just drop" any change that's in my tree already. Dave, I don't see it @ your kernel.org clone but it's not in patchworks either, so I guess you picked that into your other clone. If indeed you took it (and I guess this is the case), please note that was done in spite of a pending reviewer comment (by me) on the actual necessity of patch #3. If its not possible to revert them on the non kernel.org clone before you push it to kernel.org, I guess we need to send you a revert [1] http://marc.info/?l=linux-netdev=147317917622462=2 > It is part of the permanent GIT record as I never rebase my trees for > any reason whatsoever. > So the best you can do is send me a revert patch.
Re: ptp
On Wed, Sep 07, 2016 at 01:40:59PM -0700, Daniel Walker wrote: > There is a test (below) , which prevents negative nanosecond updates. The > code below would force a negative update to always return more than > NSEC_PER_SEC. It should be using abs() instead which would return the value > desired. No. This: /* * The value of a timeval is the sum of its fields, but the * field tv_usec must always be non-negative. */ HTH, Richard
Re: Minimum MTU Mess
On Wed, 7 Sep 2016 15:53:56 -0400 Jarod Wilsonwrote: > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6466,9 +6466,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu) > if (new_mtu == dev->mtu) > return 0; > > - /* MTU must be positive.*/ > - if (new_mtu < 0) > + if (new_mtu < dev->min_mtu) { > + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n", > +new_mtu, dev->min_mtu); > return -EINVAL; > + } > + > + if (new_mtu > dev->max_mtu) { > + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", > +new_mtu, dev->min_mtu); > + return -EINVAL; > + } > Maybe don't log something that can be triggered from a user program. Or at least rate limit it.
Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support
From: Saeed MahameedDate: Wed, 7 Sep 2016 23:16:13 +0300 > Hi Dave, Sorry to bother, but i would like to drop this series for now > Huy is working to define and come up with a better mechanism to > enable/disable the new DCBX hybrid mode he is adding. I can't "just drop" any change that's in my tree already. It is part of the permanent GIT record as I never rebase my trees for any reason whatsoever. So the best you can do is send me a revert patch.
Re: Minimum MTU Mess
Hi Jarod > - /* MTU must be positive.*/ > - if (new_mtu < 0) > + if (new_mtu < dev->min_mtu) { > + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n", > +new_mtu, dev->min_mtu); > return -EINVAL; > + } > + > + if (new_mtu > dev->max_mtu) { > + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", > +new_mtu, dev->min_mtu); > + return -EINVAL; > + } I doubt you can make such a big change like this in one go. Can you really guarantee all interfaces, of what ever type, will have some value for dev->min_mtu and dev->max_mtu? What may fly is something more like: > + if (dev->max_mtu && new_mtu > dev->max_mtu) { > + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", > +new_mtu, dev->min_mtu); > + return -EINVAL; > + } Maybe in a few cycles you can add a WARN_ON(!dev->max_mtu), and a few cycles after that go with (new_mtu > dev->max_mtu). Andrew
Re: [PATCH net-next V2 0/6] Mellanox 100G mlx5 DCBX CEE and firmware support
On Tue, Sep 6, 2016 at 7:04 PM, Saeed Mahameedwrote: > Hi Dave, > Hi Dave, Sorry to bother, but i would like to drop this series for now Huy is working to define and come up with a better mechanism to enable/disable the new DCBX hybrid mode he is adding. Thanks for understanding, Saeed. > This series from Huy provides mlx5 DCBX updates to support DCBX CEE > API and DCBX firmware/host modes support. > > 1st patch ensures the dcbnl_rtnl_ops is published only when the qos > capability bits is on. > > 2nd patch adds the support for CEE interfaces into mlx5 dcbnl_rtnl_ops. > > 3rd patch refactors ETS query to read ETS configuration directly from > firmware rather > than having a software shadow to it. The existing IEEE interfaces stays the > same. > > 4th patch adds the support for MLX5_REG_DCBX_PARAM and MLX5_REG_DCBX_APP > firmware > commands to manipulate mlx5 DCBX mode. > > 5th patch adds the driver support for the new DCBX firmware. > This ensures the backward compatibility versus the old and new firmware. > With the new DCBX firmware, qos settings can be controlled by either firmware > or software depending on the DCBX mode. > > 6th patch adds support for module events log. > > Changes since V1: > 1. Add qos capability check > 2. In port module events eqe structure, change rsvd_n to reserved_at_n to be > consistent with mlx5_ifc.h > 3. Improve commit messages > 4. Drop DCBX private flags patch > 5. Add patch to check for qos capability bit check before exposing dcb > interfaces > 6. Replace memset with static array initialization > > Thanks, > Saeed. > > Huy Nguyen (6): > net/mlx5e: Add qos capability check > net/mlx5e: Support DCBX CEE API > net/mlx5e: Read ETS settings directly from firmware > net/mlx5: Add DCBX firmware commands support > net/mlx5e: ConnectX-4 firmware support for DCBX > net/mlx5: Add handling for port module event > > drivers/net/ethernet/mellanox/mlx5/core/en.h | 36 +- > drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 483 > - > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 27 +- > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 + > .../net/ethernet/mellanox/mlx5/core/mlx5_core.h| 1 + > drivers/net/ethernet/mellanox/mlx5/core/port.c | 148 +++ > include/linux/mlx5/device.h| 11 + > include/linux/mlx5/driver.h| 7 + > include/linux/mlx5/mlx5_ifc.h | 3 +- > include/linux/mlx5/port.h | 6 + > 10 files changed, 698 insertions(+), 36 deletions(-) > > -- > 2.7.4 >
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazetwrote: > On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote: > >> Jesper has a similar Idea to make the qdisc think it is under >> pressure, when the device >> TX ring is idle most of the time, i think his idea can come in handy here. >> I am not fully involved in the details, maybe he can elaborate more. >> >> But if it works, it will be transparent to napi, and xmit more will >> happen by design. > > I do not think qdisc is relevant here. > > Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool), > because only this layer can know if more packets are to come. > > > What I am saying is that regardless of skb->xmit_more being set or not, > (for example if no qdisc is even used) > a NAPI driver can arm a bit asking the doorbell being sent at the end of > NAPI. > > I am not saying this must be done, only that the idea could be extended > to non XDP world, if we care enough. > Yes, and i am just trying to suggest Ideas that do not require communication between RX (NAPI) and TX. The problem here is the synchronization (TX doorbell from RX) which is not as simple as atomic operation for some drivers. How about RX bulking ? it also can help here, since for the forwarding case, the forwarding path will be able to process bulk of RX SKBs and can bulk xmit the portion of SKBs that will be forwarded. As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if you are joining of course). Thanks Saeed.
Re: [PATCH net] tcp: fastopen: avoid negative sk_forward_alloc
On 09/07/2016 10:34 AM, Eric Dumazet wrote: From: Eric DumazetWhen DATA and/or FIN are carried in a SYN/ACK message or SYN message, we append an skb in socket receive queue, but we forget to call sk_forced_mem_schedule(). Effect is that the socket has a negative sk->sk_forward_alloc as long as the message is not read by the application. Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp fin memory accounting") Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Signed-off-by: Eric Dumazet --- net/ipv4/tcp_fastopen.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 62a5751d4fe1..4e777a3243f9 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -150,6 +150,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) tp->segs_in = 0; tcp_segs_in(tp, skb); __skb_pull(skb, tcp_hdrlen(skb)); + sk_forced_mem_schedule(sk, skb->truesize); skb_set_owner_r(skb, sk); TCP_SKB_CB(skb)->seq++; The change makes sense to me. Thanks Eric! Reviewed-by: Josh Hunt
Re: Minimum MTU Mess
On Tue, Sep 06, 2016 at 04:55:29PM -0700, David Miller wrote: > From: Jarod Wilson> Date: Fri, 2 Sep 2016 13:07:42 -0400 > > > In any case, the number of "mtu < 68" and "#define FOO_MIN_MTU 68", or > > variations thereof, under drivers/net/ is kind of crazy. > > Agreed, we can have a default and let the different cases provide > overrides. > > Mostly what to do here is a function of the hardware though. So I've been tinkering with this some, and it looks like having both centralized min and max checking could be useful here. I'm hacking away at drivers now, but the basis of all this would potentially look about like the patch below, and each device would have to set dev->m{in,ax}_mtu one way or another. Drivers using alloc_etherdev and/or ether_setup would get the "default" values, and then they can be overridden. Probably need something to make sure dev->max_mtu isn't set to 0 though... Possibly on the right track here, or might there be a better way to approach this? diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h index 117d02e..864d6f2 100644 --- a/include/uapi/linux/if_ether.h +++ b/include/uapi/linux/if_ether.h @@ -35,6 +35,8 @@ #define ETH_FRAME_LEN 1514/* Max. octets in frame sans FCS */ #define ETH_FCS_LEN4 /* Octets in the FCS */ +#define ETH_MIN_MTU68 /* Min IPv4 MTU per RFC791 */ + /* * These are the defined Ethernet Protocol ID's. */ diff --git a/net/core/dev.c b/net/core/dev.c index dd6ce59..d20fd5d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6466,9 +6466,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu) if (new_mtu == dev->mtu) return 0; - /* MTU must be positive.*/ - if (new_mtu < 0) + if (new_mtu < dev->min_mtu) { + netdev_err(dev, "Invalid MTU %d requested, hw min %d\n", + new_mtu, dev->min_mtu); return -EINVAL; + } + + if (new_mtu > dev->max_mtu) { + netdev_err(dev, "Invalid MTU %d requested, hw max %d\n", + new_mtu, dev->min_mtu); + return -EINVAL; + } if (!netif_device_present(dev)) return -ENODEV; diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 66dff5e..2be0203 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -322,8 +322,6 @@ EXPORT_SYMBOL(eth_mac_addr); */ int eth_change_mtu(struct net_device *dev, int new_mtu) { - if (new_mtu < 68 || new_mtu > ETH_DATA_LEN) - return -EINVAL; dev->mtu = new_mtu; return 0; } @@ -357,6 +355,8 @@ void ether_setup(struct net_device *dev) dev->type = ARPHRD_ETHER; dev->hard_header_len= ETH_HLEN; dev->mtu= ETH_DATA_LEN; + dev->min_mtu= ETH_MIN_MTU; + dev->max_mtu= ETH_DATA_LEN; dev->addr_len = ETH_ALEN; dev->tx_queue_len = 1000; /* Ethernet wants good queues */ dev->flags = IFF_BROADCAST|IFF_MULTICAST; -- Jarod Wilson ja...@redhat.com
Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
On Wed, 7 Sep 2016 15:42:25 +0300 Saeed Mahameedwrote: > For non-striding RQ configuration before this patch we had a ring > with pre-allocated SKBs and mapped the SKB->data buffers for > device. > > For robustness and better RX data buffers management, we allocate a > page per packet and build_skb around it. > > This patch (which is a prerequisite for XDP) will actually reduce > performance for normal stack usage, because we are now hitting a bottleneck > in the page allocator. A later patch of page reuse mechanism will be > needed to restore or even improve performance in comparison to the old > RX scheme. Yes, it is true that there is a performance reduction (for normal stack, not XDP) caused by hitting a bottleneck in the page allocator. I actually have a PoC implementation of my page_pool, that show we regain the performance and then some. Based on an earlier version of this patch, where I hook it into the mlx5 driver (50Gbit/s version). You desc might be a bit outdated, as this patch and the patch before does contain you own driver local page-cache recycle facility. And you also show that you regain quite a lot of the lost performance. You driver local page_cache does have its limitations (see comments on other patch), as it depend on timely refcnt decrease, by the users of the page. If they hold onto pages (like TCP) then your page-cache will not be efficient. > Packet rate performance testing was done with pktgen 64B packets on > xmit side and TC drop action on RX side. I assume this is TC _ingress_ dropping, like [1] [1] https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > > Comparison is done between: > 1.Baseline, before 'net/mlx5e: Build RX SKB on demand' > 2.Build SKB with RX page cache (This patch) > > StreamsBaselineBuild SKB+page-cacheImprovement > --- > 1 4.33Mpps 5.51Mpps27% > 2 7.35Mpps 11.5Mpps52% > 4 14.0Mpps 16.3Mpps16% > 8 22.2Mpps 29.6Mpps20% > 16 24.8Mpps 34.0Mpps17% The improvements gained from using your page-cache is impressively high. Thanks for working on this, --Jesper > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h | 10 +- > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 31 +++- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 215 > +++--- > 3 files changed, 133 insertions(+), 123 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index afbdf70..a346112 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -65,6 +65,8 @@ > #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3 > #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6 > > +#define MLX5_RX_HEADROOM NET_SKB_PAD > + > #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */ > #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW > restriction */ > #define MLX5_MPWRQ_LOG_WQE_SZ18 > @@ -302,10 +304,14 @@ struct mlx5e_page_cache { > struct mlx5e_rq { > /* data path */ > struct mlx5_wq_ll wq; > - u32wqe_sz; > - struct sk_buff **skb; > + > + struct mlx5e_dma_info *dma_info; > struct mlx5e_mpw_info *wqe_info; > void *mtt_no_align; > + struct { > + u8 page_order; > + u32wqe_sz;/* wqe data buffer size */ > + } buff; > __be32 mkey_be; > > struct device *pdev; > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > index c84702c..c9f1dea 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, > void *rqc = param->rqc; > void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); > u32 byte_count; > + u32 frag_sz; > + int npages; > int wq_sz; > int err; > int i; > @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c, > > rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz); > rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides); > - rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides; > - byte_count = rq->wqe_sz; > + > + rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides; > + byte_count = rq->buff.wqe_sz; > rq->mkey_be =
Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
On Wed, 7 Sep 2016 15:42:22 +0300 Saeed Mahameedwrote: > From: Tariq Toukan > > To improve the memory consumption scheme, we omit the flow that > demands and splits high-order pages in Striding RQ, and stay > with a single Striding RQ flow that uses order-0 pages. Thanks you for doing this! MM-list people thanks you! For others to understand what this means: This driver was doing split_page() on high-order pages (for Striding RQ). This was really bad because it will cause fragmenting the page-allocator, and depleting the high-order pages available quickly. (I've left rest of patch intact below, if some MM people should be interested in looking at the changes). There is even a funny comment in split_page() relevant to this: /* [...] * Note: this is probably too low level an operation for use in drivers. * Please consult with lkml before using this in your driver. */ > Moving to fragmented memory allows the use of larger MPWQEs, > which reduces the number of UMR posts and filler CQEs. > > Moving to a single flow allows several optimizations that improve > performance, especially in production servers where we would > anyway fallback to order-0 allocations: > - inline functions that were called via function pointers. > - improve the UMR post process. > > This patch alone is expected to give a slight performance reduction. > However, the new memory scheme gives the possibility to use a page-cache > of a fair size, that doesn't inflate the memory footprint, which will > dramatically fix the reduction and even give a huge gain. > > We ran pktgen single-stream benchmarks, with iptables-raw-drop: > > Single stride, 64 bytes: > * 4,739,057 - baseline > * 4,749,550 - this patch > no reduction > > Larger packets, no page cross, 1024 bytes: > * 3,982,361 - baseline > * 3,845,682 - this patch > 3.5% reduction > > Larger packets, every 3rd packet crosses a page, 1500 bytes: > * 3,731,189 - baseline > * 3,579,414 - this patch > 4% reduction > Well, the reduction does not really matter than much, because your baseline benchmarks are from a freshly booted system, where you have not fragmented and depleted the high-order pages yet... ;-) > Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)") > Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi > packet WQE") > Signed-off-by: Tariq Toukan > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++-- > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 -- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 292 > - > drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 - > drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +- > 5 files changed, 184 insertions(+), 304 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index bf722aa..075cdfc 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -62,12 +62,12 @@ > #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE0xd > > #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW0x1 > -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x4 > +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW0x3 > #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW0x6 > > #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */ > #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW > restriction */ > -#define MLX5_MPWRQ_LOG_WQE_SZ17 > +#define MLX5_MPWRQ_LOG_WQE_SZ18 > #define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? > \ > MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0) > #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER) > @@ -293,8 +293,8 @@ struct mlx5e_rq { > u32wqe_sz; > struct sk_buff **skb; > struct mlx5e_mpw_info *wqe_info; > + void *mtt_no_align; > __be32 mkey_be; > - __be32 umr_mkey_be; > > struct device *pdev; > struct net_device *netdev; > @@ -323,32 +323,15 @@ struct mlx5e_rq { > > struct mlx5e_umr_dma_info { > __be64*mtt; > - __be64*mtt_no_align; > dma_addr_t mtt_addr; > - struct mlx5e_dma_info *dma_info; > + struct mlx5e_dma_info dma_info[MLX5_MPWRQ_PAGES_PER_WQE]; > + struct mlx5e_umr_wqe wqe; > }; > > struct mlx5e_mpw_info { > - union { > - struct mlx5e_dma_info dma_info; > - struct mlx5e_umr_dma_info umr; > - }; > + struct mlx5e_umr_dma_info umr; > u16 consumed_strides; > u16
[PATCH v2] net: ethernet: renesas: sh_eth: add POST registers for rz
Due to a mistake in the hardware manual, the FWSLC and POST1-4 registers were not documented and left out of the driver for RZ/A making the CAM feature non-operational. Additionally, when the offset values for POST1-4 are left blank, the driver attempts to set them using an offset of 0x which can cause a memory corruption or panic. This patch fixes the panic and properly enables CAM. Reported-by: Daniel PalmerSigned-off-by: Chris Brandt --- v2: * POST registers really do exist, so just add them --- drivers/net/ethernet/renesas/sh_eth.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c index 1f8240a..440ae27 100644 --- a/drivers/net/ethernet/renesas/sh_eth.c +++ b/drivers/net/ethernet/renesas/sh_eth.c @@ -201,9 +201,14 @@ static const u16 sh_eth_offset_fast_rz[SH_ETH_MAX_REGISTER_OFFSET] = { [ARSTR] = 0x, [TSU_CTRST] = 0x0004, + [TSU_FWSLC] = 0x0038, [TSU_VTAG0] = 0x0058, [TSU_ADSBSY]= 0x0060, [TSU_TEN] = 0x0064, + [TSU_POST1] = 0x0070, + [TSU_POST2] = 0x0074, + [TSU_POST3] = 0x0078, + [TSU_POST4] = 0x007c, [TSU_ADRH0] = 0x0100, [TXNLCR0] = 0x0080, @@ -2781,6 +2786,8 @@ static void sh_eth_tsu_init(struct sh_eth_private *mdp) { if (sh_eth_is_rz_fast_ether(mdp)) { sh_eth_tsu_write(mdp, 0, TSU_TEN); /* Disable all CAM entry */ + sh_eth_tsu_write(mdp, TSU_FWSLC_POSTENU | TSU_FWSLC_POSTENL, +TSU_FWSLC);/* Enable POST registers */ return; } -- 2.9.2
[PATCH net 3/3] nfp: don't pad frames on receive
There is no need to pad frames to ETH_ZLEN on RX. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Dinan Gunawardena --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index e4fe0f080f34..252e4924de0f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1440,10 +1440,6 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget) nfp_net_set_hash(nn->netdev, skb, rxd); - /* Pad small frames to minimum */ - if (skb_put_padto(skb, 60)) - break; - /* Stats update */ u64_stats_update_begin(_vec->rx_sync); r_vec->rx_pkts++; -- 1.9.1
[PATCH net 1/3] nfp: remove linux/version.h includes
Remove unnecessary version.h includes. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman Reviewed-by: Dinan Gunawardena --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 - drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 - drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 1 - 3 files changed, 3 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 88678c172b19..e4fe0f080f34 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -41,7 +41,6 @@ * Chris Telfer */ -#include #include #include #include diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c index 7d7933d00b8f..4c9897220969 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c @@ -40,7 +40,6 @@ * Brad Petrus */ -#include #include #include #include diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c index 37abef016a0a..6f22b0e12ac7 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c @@ -38,7 +38,6 @@ * Rolf Neugebauer */ -#include #include #include #include -- 1.9.1
[PATCH net 2/3] nfp: drop support for old firmware ABIs
Be more strict about FW versions. Drop support for old transitional revisions which were never used in production. Dropping support for FW ABI version 0.0.0.0 is particularly useful because 0 could just be uninitialized memory. Signed-off-by: Jakub KicinskiReviewed-by: Dinan Gunawardena --- drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c index 6f22b0e12ac7..f7062cb648e1 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c @@ -133,7 +133,7 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev, } nfp_net_get_fw_version(_ver, ctrl_bar); - if (fw_ver.class != NFP_NET_CFG_VERSION_CLASS_GENERIC) { + if (fw_ver.resv || fw_ver.class != NFP_NET_CFG_VERSION_CLASS_GENERIC) { dev_err(>dev, "Unknown Firmware ABI %d.%d.%d.%d\n", fw_ver.resv, fw_ver.class, fw_ver.major, fw_ver.minor); err = -EINVAL; @@ -141,9 +141,7 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev, } /* Determine stride */ - if (nfp_net_fw_ver_eq(_ver, 0, 0, 0, 0) || - nfp_net_fw_ver_eq(_ver, 0, 0, 0, 1) || - nfp_net_fw_ver_eq(_ver, 0, 0, 0x12, 0x48)) { + if (nfp_net_fw_ver_eq(_ver, 0, 0, 0, 1)) { stride = 2; tx_bar_no = NFP_NET_Q0_BAR; rx_bar_no = NFP_NET_Q1_BAR; -- 1.9.1
[PATCH net 0/3] nfp: fixes and trivial cleanup
Hi! First patch drops unnecessary version.h includes. Second one drops support for pre-release versions of FW ABI. Removing FW ABI 0.0 from supported set is particularly good since 0 could just be uninitialized memory. Last but not least I drop unnecessary padding of frames on RX which makes us count bytes incorrectly for the VF2VF traffic. Jakub Kicinski (3): nfp: remove linux/version.h includes nfp: drop support for old firmware ABIs nfp: don't pad frames on receive drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 5 - drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 - drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 7 ++- 3 files changed, 2 insertions(+), 11 deletions(-) -- 1.9.1
Re: [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
On Wed, 7 Sep 2016 15:42:24 +0300 Saeed Mahameedwrote: > From: Tariq Toukan > > Instead of reallocating and mapping pages for RX data-path, > recycle already used pages in a per ring cache. > > We ran pktgen single-stream benchmarks, with iptables-raw-drop: > > Single stride, 64 bytes: > * 4,739,057 - baseline > * 4,749,550 - order0 no cache > * 4,786,899 - order0 with cache > 1% gain > > Larger packets, no page cross, 1024 bytes: > * 3,982,361 - baseline > * 3,845,682 - order0 no cache > * 4,127,852 - order0 with cache > 3.7% gain > > Larger packets, every 3rd packet crosses a page, 1500 bytes: > * 3,731,189 - baseline > * 3,579,414 - order0 no cache > * 3,931,708 - order0 with cache > 5.4% gain > > Signed-off-by: Tariq Toukan > Signed-off-by: Saeed Mahameed > --- > drivers/net/ethernet/mellanox/mlx5/core/en.h | 16 ++ > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 ++ > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 57 > -- > drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++ > 4 files changed, 99 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h > b/drivers/net/ethernet/mellanox/mlx5/core/en.h > index 075cdfc..afbdf70 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h > @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */ > u8 tired; > }; > > +/* a single cache unit is capable to serve one napi call (for non-striding > rq) > + * or a MPWQE (for striding rq). > + */ > +#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \ > + MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT) > +#define MLX5E_CACHE_SIZE (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT)) > +struct mlx5e_page_cache { > + u32 head; > + u32 tail; > + struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE]; > +}; > + [...] > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > index c1cb510..8e02af3 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq > *rq, u16 ix) > mlx5e_tx_notify_hw(sq, >ctrl, 0); > } > > +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, > + struct mlx5e_dma_info *dma_info) > +{ > + struct mlx5e_page_cache *cache = >page_cache; > + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1); > + > + if (tail_next == cache->head) { > + rq->stats.cache_full++; > + return false; > + } > + > + cache->page_cache[cache->tail] = *dma_info; > + cache->tail = tail_next; > + return true; > +} > + > +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq, > + struct mlx5e_dma_info *dma_info) > +{ > + struct mlx5e_page_cache *cache = >page_cache; > + > + if (unlikely(cache->head == cache->tail)) { > + rq->stats.cache_empty++; > + return false; > + } > + > + if (page_ref_count(cache->page_cache[cache->head].page) != 1) { > + rq->stats.cache_busy++; > + return false; > + } Hmmm... doesn't this cause "blocking" of the page_cache recycle facility until the page at the head of the queue gets (page) refcnt decremented? Real use-case could fairly easily block/cause this... > + > + *dma_info = cache->page_cache[cache->head]; > + cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1); > + rq->stats.cache_reuse++; > + > + dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE, > +DMA_FROM_DEVICE); > + return true; > +} > + > static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, > struct mlx5e_dma_info *dma_info) > { > - struct page *page = dev_alloc_page(); > + struct page *page; > + > + if (mlx5e_rx_cache_get(rq, dma_info)) > + return 0; > > + page = dev_alloc_page(); > if (unlikely(!page)) > return -ENOMEM; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH v2] vti: use right inner_mode for inbound inter address family policy checks
In case of inter address family tunneling (IPv6 over vti4 or IPv4 over vti6), the inbound policy checks in vti_rcv_cb() and vti6_rcv_cb() are using the wrong address family. As a result, all inbound inter address family traffic is dropped. Use the xfrm_ip2inner_mode() helper, as done in xfrm_input() (i.e., also increment LINUX_MIB_XFRMINSTATEMODEERROR in case of error), to select the inner_mode that contains the right address family for the inbound policy checks. Signed-off-by: Thomas Zeitlhofer--- Notes: v2: implement review comments from Steffen (thanks for the reply): - return -EINVAL in case of error - increment LINUX_MIB_XFRMINSTATEMODEERROR in case of error Just to point that out, in case there are arguments against it: this is done in the namespace of skb->dev and not in the t(unnel)?->net namespace. and update the commit message accordingly. net/ipv4/ip_vti.c | 15 ++- net/ipv6/ip6_vti.c | 15 ++- 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c index a917903..c9957ed 100644 --- a/net/ipv4/ip_vti.c +++ b/net/ipv4/ip_vti.c @@ -88,6 +88,7 @@ static int vti_rcv_cb(struct sk_buff *skb, int err) struct net_device *dev; struct pcpu_sw_netstats *tstats; struct xfrm_state *x; + struct xfrm_mode *inner_mode; struct ip_tunnel *tunnel = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4; u32 orig_mark = skb->mark; int ret; @@ -105,7 +106,19 @@ static int vti_rcv_cb(struct sk_buff *skb, int err) } x = xfrm_input_state(skb); - family = x->inner_mode->afinfo->family; + + inner_mode = x->inner_mode; + + if (x->sel.family == AF_UNSPEC) { + inner_mode = xfrm_ip2inner_mode(x, XFRM_MODE_SKB_CB(skb)->protocol); + if (inner_mode == NULL) { + XFRM_INC_STATS(dev_net(skb->dev), + LINUX_MIB_XFRMINSTATEMODEERROR); + return -EINVAL; + } + } + + family = inner_mode->afinfo->family; skb->mark = be32_to_cpu(tunnel->parms.i_key); ret = xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family); diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c index d90a11f..52a2f73 100644 --- a/net/ipv6/ip6_vti.c +++ b/net/ipv6/ip6_vti.c @@ -340,6 +340,7 @@ static int vti6_rcv_cb(struct sk_buff *skb, int err) struct net_device *dev; struct pcpu_sw_netstats *tstats; struct xfrm_state *x; + struct xfrm_mode *inner_mode; struct ip6_tnl *t = XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6; u32 orig_mark = skb->mark; int ret; @@ -357,7 +358,19 @@ static int vti6_rcv_cb(struct sk_buff *skb, int err) } x = xfrm_input_state(skb); - family = x->inner_mode->afinfo->family; + + inner_mode = x->inner_mode; + + if (x->sel.family == AF_UNSPEC) { + inner_mode = xfrm_ip2inner_mode(x, XFRM_MODE_SKB_CB(skb)->protocol); + if (inner_mode == NULL) { + XFRM_INC_STATS(dev_net(skb->dev), + LINUX_MIB_XFRMINSTATEMODEERROR); + return -EINVAL; + } + } + + family = inner_mode->afinfo->family; skb->mark = be32_to_cpu(t->parms.i_key); ret = xfrm_policy_check(NULL, XFRM_POLICY_IN, skb, family); -- 2.1.4
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameedwrote: > On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet wrote: > > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet > >> wrote: > >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...] > > > > Only if a qdisc is present and pressure is high enough. > > > > But in a forwarding setup, we likely receive at a lower rate than the > > NIC can transmit. Yes, I can confirm this happens in my experiments. > > > > Jesper has a similar Idea to make the qdisc think it is under > pressure, when the device TX ring is idle most of the time, i think > his idea can come in handy here. I am not fully involved in the > details, maybe he can elaborate more. > > But if it works, it will be transparent to napi, and xmit more will > happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop
Re: vlan aware bridge doesn't propagate mac changes to vlans on top of it
On 2016-09-07 02:44, Toshiaki Makita wrote: > On 2016/09/07 6:59, Michal Soltys wrote: >> Consider following scenario: >> >> - create vlan aware bridge (say br0) >> - setup br0's vlans, e.g. >> >> bridge vlan add dev br0 vid 10 self >> >> This will add necessary fdb entries directing appropriate traffic to the >> bridge itself. >> >> - create appropriate vlan interfaces on top of it, for example: >> >> ip li add link br0 name br0.10 type vlan id 10 >> ip add add 10.0.0.1/8 dev br0.10 >> >> This will add vlan devices on top of br0 and *inherit br0's mac address*. >> >> - now after all of the above is done >> >> ip li set eth0 master br0 >> >> This will attach interface eth0 to the bridge. With this being the first >> interface attached, br0 will take it's mac address as its own. Any >> further changes to br0's ports may cause the same, with the lowest mac >> address of some port becoming br0's mac. >> >> This will update fdb entries as well, but all vlan interfaces on top of >> br0 (e.g. br0.10) will be using old mac address from the time when vlan >> was created. >> >> The side effect of it is that any traffic addressed to such interface >> will be flooded to all ports (and br0 itself). >> >> The only workaround I found is to either manually update mac addresses >> with 'ip' or recreate vlans (bridge fdb refused to update relevant entries). >> >> But if br0's mac changes due to some port changes - shouldn't it be >> somehow propagated automatically to vlans created on top of it ? > > This should have been addressed at least in kernel 4.7... > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=308453aa9156a3b8ee382c0949befb507a32b0c1 > > Which kernel version do you use? > 4.7.2 git describe on that commit suggests it's been available since 4.6.x What I did in details: ip li add name port1b type veth peer name port1e ip li add br0 type bridge ip li set dev br0 type bridge vlan_default_pvid 0 ip li set dev br0 type bridge vlan_filtering 1 bridge vlan add dev br0 vid 10 self bridge vlan add dev br0 vid 250 untagged pvid self ip li add link br0 name vlan10 type vlan id 10 ip li set port1b master br0 At this point br0.vlan10 had outdated mac after br0 took port1b's one as its own.
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote: > Jesper has a similar Idea to make the qdisc think it is under > pressure, when the device > TX ring is idle most of the time, i think his idea can come in handy here. > I am not fully involved in the details, maybe he can elaborate more. > > But if it works, it will be transparent to napi, and xmit more will > happen by design. I do not think qdisc is relevant here. Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool), because only this layer can know if more packets are to come. What I am saying is that regardless of skb->xmit_more being set or not, (for example if no qdisc is even used) a NAPI driver can arm a bit asking the doorbell being sent at the end of NAPI. I am not saying this must be done, only that the idea could be extended to non XDP world, if we care enough.
Re: [PATCH v4 6/6] arm64: dts: rockchip: enable the gmac for rk3399 evb board
Am Freitag, 2. September 2016, 01:50:04 CEST schrieb Caesar Wang: > From: Roger Chen> > We add the required and optional properties for evb board. > See the [0] to get the detail information. > > [0]: > Documentation/devicetree/bindings/net/rockchip-dwmac.txt > > Signed-off-by: Roger Chen > Signed-off-by: Caesar Wang applied to my dts64 branch, after moving clkin_gmac and sorting the gmac properties according to their alphabetical position. Heiko
Re: [PATCH v4 5/6] arm64: dts: rockchip: add the gmac needed node for rk3399
Am Freitag, 2. September 2016, 01:50:03 CEST schrieb Caesar Wang: > From: Roger Chen> > The RK3399 GMAC Ethernet Controller provides a complete Ethernet interface > from processor to a Reduced Media Independent Interface (RMII) and Reduced > Gigabit Media Independent Interface (RGMII) compliant Ethernet PHY. > > This patch adds the related needed device information. > e.g.: interrupts, grf, clocks, pinctrl and so on. > > The full details are in [0]. > > [0]: > Documentation/devicetree/bindings/net/rockchip-dwmac.txt > > Signed-off-by: Roger Chen > Signed-off-by: Caesar Wang > --- > > Changes in v4: > - The Roger had posted patch on https://patchwork.kernel.org/patch/9274561/. > - re-fixup to original author. > > Changes in v3: > - generate a patch from https://patchwork.kernel.org/patch/9306339/. > > Changes in v2: None > > arch/arm64/boot/dts/rockchip/rk3399.dtsi | 80 > 1 file changed, 80 insertions(+) > > diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi > b/arch/arm64/boot/dts/rockchip/rk3399.dtsi index 2ab233f..092bb45 100644 > --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi > +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi > @@ -200,6 +200,26 @@ > }; > }; > > + gmac: ethernet@fe30 { > + compatible = "rockchip,rk3399-gmac"; > + reg = <0x0 0xfe30 0x0 0x1>; > + interrupts = ; applied to my dts64 branch, after making this a + interrupts = ; (due to the 4-cell interrupt change) Heiko
[PATCH v2] tcp: cwnd does not increase in TCP YeAH
Commit 76174004a0f19785a328f40388e87e982bbf69b9 (tcp: do not slow start when cwnd equals ssthresh ) introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth ~100KB/s. That is caused by stalled cwnd when cwnd equals ssthresh. This patch fixes it by proper increasing cwnd in this case. Signed-off-by: Artem GermanovAcked-by: Dmitry Adamushko --- diff --git a/net/ipv4/tcp_yeah.c b/net/ipv4/tcp_yeah.c index 028eb04..9c5fc97 100644 --- a/net/ipv4/tcp_yeah.c +++ b/net/ipv4/tcp_yeah.c @@ -76,7 +76,7 @@ static void tcp_yeah_cong_avoid(struct sock *sk, u32 ack, u32 acked) if (!tcp_is_cwnd_limited(sk)) return; - if (tp->snd_cwnd <= tp->snd_ssthresh) + if (tcp_in_slow_start(tp)) tcp_slow_start(tp, acked); else if (!yeah->doing_reno_now) {
Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
On Wed, Sep 07, 2016 at 03:42:25PM +0300, Saeed Mahameed wrote: > For non-striding RQ configuration before this patch we had a ring > with pre-allocated SKBs and mapped the SKB->data buffers for > device. > > For robustness and better RX data buffers management, we allocate a > page per packet and build_skb around it. > > This patch (which is a prerequisite for XDP) will actually reduce > performance for normal stack usage, because we are now hitting a bottleneck > in the page allocator. A later patch of page reuse mechanism will be > needed to restore or even improve performance in comparison to the old > RX scheme. > > Packet rate performance testing was done with pktgen 64B packets on xmit > side and TC drop action on RX side. > > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz > > Comparison is done between: > 1.Baseline, before 'net/mlx5e: Build RX SKB on demand' > 2.Build SKB with RX page cache (This patch) > > StreamsBaselineBuild SKB+page-cacheImprovement > --- > 1 4.33Mpps 5.51Mpps27% > 2 7.35Mpps 11.5Mpps52% > 4 14.0Mpps 16.3Mpps16% > 8 22.2Mpps 29.6Mpps20% > 16 24.8Mpps 34.0Mpps17% Impressive gains for build_skb. I think it should help ip forwarding too and likely tcp_rr. tcp_stream shouldn't see any difference. If you can benchmark that along with pktgen+tc_drop it would help to better understand the impact of the changes.
Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
On Wed, Sep 07, 2016 at 03:42:22PM +0300, Saeed Mahameed wrote: > From: Tariq Toukan> > To improve the memory consumption scheme, we omit the flow that > demands and splits high-order pages in Striding RQ, and stay > with a single Striding RQ flow that uses order-0 pages. > > Moving to fragmented memory allows the use of larger MPWQEs, > which reduces the number of UMR posts and filler CQEs. > > Moving to a single flow allows several optimizations that improve > performance, especially in production servers where we would > anyway fallback to order-0 allocations: > - inline functions that were called via function pointers. > - improve the UMR post process. > > This patch alone is expected to give a slight performance reduction. > However, the new memory scheme gives the possibility to use a page-cache > of a fair size, that doesn't inflate the memory footprint, which will > dramatically fix the reduction and even give a huge gain. > > We ran pktgen single-stream benchmarks, with iptables-raw-drop: > > Single stride, 64 bytes: > * 4,739,057 - baseline > * 4,749,550 - this patch > no reduction > > Larger packets, no page cross, 1024 bytes: > * 3,982,361 - baseline > * 3,845,682 - this patch > 3.5% reduction > > Larger packets, every 3rd packet crosses a page, 1500 bytes: > * 3,731,189 - baseline > * 3,579,414 - this patch > 4% reduction imo it's not a realistic use case, but would be good to mention that patch 3 brings performance back for this use case anyway.
Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbertwrote: > On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed > wrote: >> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz wrote: >>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed wrote: >>> Packet rate performance testing was done with pktgen 64B packets and on TX side and, TC drop action on RX side compared to XDP fast drop. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Comparison is done between: 1. Baseline, Before this patch with TC drop action 2. This patch with TC drop action 3. This patch with XDP RX fast drop StreamsBaseline(TC drop)TC dropXDP fast Drop -- 1 5.51Mpps5.14Mpps 13.5Mpps 2 11.5Mpps10.0Mpps 25.1Mpps 4 16.3Mpps17.2Mpps 35.4Mpps 8 29.6Mpps28.2Mpps 45.8Mpps* 16 34.0Mpps30.1Mpps 45.8Mpps* >>> >>> Rana, Guys, congrat!! >>> >>> When you say X streams, does each stream mapped by RSS to different RX ring? >>> or we're on the same RX ring for all rows of the above table? >> >> Yes, I will make this more clear in the actual submission, >> Here we are talking about different RSS core rings. >> >>> >>> In the CX3 work, we had X sender "streams" that all mapped to the same RX >>> ring, >>> I don't think we went beyond one RX ring. >> >> Here we did, the first row is what you are describing the other rows >> are the same test >> with increasing the number of the RSS receiving cores, The xmit side is >> sending >> as many streams as possible to be as much uniformly spread as possible >> across the >> different RSS cores on the receiver. >> > Hi Saeed, > > Please report CPU utilization also. The expectation is that > performance should scale linearly with increasing number of CPUs (i.e. > pps/CPU_utilization should be constant). > Hi Tom That was my expectation too. We didn't do the full analysis yet, It could be that RSS was not spreading the workload on all the cores evenly. Those numbers are from my humble machine with a quick and dirty testing, the idea of this submission is to let the folks look at the code while we continue testing and analyzing those patches. Anyway we will share more accurate results when we have them, with CPU utilization statistics as well. Thanks, Saeed. > Tom >
Re: [PATCH net-next V6 4/4] net/sched: Introduce act_tunnel_key
On Wed, 2016-09-07 at 09:27 -0700, Cong Wang wrote: > Fix checkpatch.pl (or whatever tool you prefer). I do not use checkpatch.pl. But I know for sure David cares about this ordering. https://lkml.org/lkml/2016/4/24/91 https://patchwork.ozlabs.org/patch/629958/ So feel free to fix checkpatch.pl
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazetwrote: > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet wrote: >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: >> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet. >> >> >> >> Here we introduce a xmit more like mechanism that will queue up more >> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware. >> >> >> >> Once RX napi budget is consumed and we exit napi RX loop, we will >> >> flush (doorbell) all XDP looped packets in case there are such. >> > >> > Why is this idea depends on XDP ? >> > >> > It looks like we could apply it to any driver having one IRQ servicing >> > one RX and one TX, without XDP being involved. >> > >> >> Yes but it is more complicated than XDP case, where the RX ring posts >> the TX descriptors and once done >> the RX ring hits the doorbell once for all the TX descriptors it >> posted, and it is the only possible place to hit a doorbell >> for XDP TX ring. >> >> For regular TX and RX ring sharing the same IRQ, there is no such >> simple connection between them, and hitting a doorbell >> from RX ring napi would race with xmit ndo function of the TX ring. >> >> How do you synchronize in such case ? >> isn't the existing xmit more mechanism sufficient enough ? > > Only if a qdisc is present and pressure is high enough. > > But in a forwarding setup, we likely receive at a lower rate than the > NIC can transmit. > Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more. But if it works, it will be transparent to napi, and xmit more will happen by design. > A simple cmpxchg could be used to synchronize the thing, if we really > cared about doorbell cost. (Ie if the cost of this cmpxchg() is way > smaller than doorbell one) >
[PATCH net-next v22 0/3] openvswitch: add 802.1ad support
This series adds 802.1ad support to openvswitch. It is a continuation of the work originally started by Thomas F Herbert - hence the large rev number. The extra VLAN is implemented by using an additional level of the OVS_KEY_ATTR_ENCAP netlink attribute. In OVS flow speak, this looks like eth_type(0x88a8),vlan(vid=100),encap(eth_type(0x8100), vlan(vid=200), encap(eth_type(0x0800), ...)) The userspace counterpart has also seen recent activity on the ovs-dev mailing lists. There are some new 802.1ad OVS tests being added - also on the ovs-dev list. This patch series has been tested using the most recent version of userspace (v3) and tests (v2). v22 changes: - merge patch 4 into patch 3 - fix checkpatch.pl errors - Still some 80 char warnings for long string literals - refresh pointer after pskb_may_pull() - refactor vlan nlattr parsing to remove some double checks - introduce ovs_nla_put_vlan() - move triple VLAN check to after ethertype serialization - WARN_ON_ONCE() on triple VLAN and unexpected encap values v21 changes: - Fix (and simplify) netlink attribute parsing - re-add handling of truncated VLAN tags - fix if/else dangling assignment in {push,pop}_vlan() - simplify parse_vlan() Eric Garver (2): vlan: Check for vlan ethernet types for 8021.q or 802.1ad openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes Thomas F Herbert (1): openvswitch: 802.1ad uapi changes. include/linux/if_vlan.h | 33 +++-- include/uapi/linux/openvswitch.h | 17 ++- net/openvswitch/actions.c| 16 +- net/openvswitch/flow.c | 65 +--- net/openvswitch/flow.h | 8 +- net/openvswitch/flow_netlink.c | 310 ++- net/openvswitch/vport.c | 7 +- 7 files changed, 314 insertions(+), 142 deletions(-) -- 2.5.5
[PATCH net-next v22 3/3] openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes
Add support for 802.1ad including the ability to push and pop double tagged vlans. Add support for 802.1ad to netlink parsing and flow conversion. Uses double nested encap attributes to represent double tagged vlan. Inner TPID encoded along with ctci in nested attributes. This is based on Thomas F Herbert's original v20 patch. I made some small clean ups and bug fixes. Signed-off-by: Thomas F HerbertSigned-off-by: Eric Garver --- net/openvswitch/actions.c | 16 ++- net/openvswitch/flow.c | 65 ++--- net/openvswitch/flow.h | 8 +- net/openvswitch/flow_netlink.c | 310 - net/openvswitch/vport.c| 7 +- 5 files changed, 282 insertions(+), 124 deletions(-) diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index ca91fc33f8a9..4fe9032b1160 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -246,20 +246,24 @@ static int pop_vlan(struct sk_buff *skb, struct sw_flow_key *key) int err; err = skb_vlan_pop(skb); - if (skb_vlan_tag_present(skb)) + if (skb_vlan_tag_present(skb)) { invalidate_flow_key(key); - else - key->eth.tci = 0; + } else { + key->eth.vlan.tci = 0; + key->eth.vlan.tpid = 0; + } return err; } static int push_vlan(struct sk_buff *skb, struct sw_flow_key *key, const struct ovs_action_push_vlan *vlan) { - if (skb_vlan_tag_present(skb)) + if (skb_vlan_tag_present(skb)) { invalidate_flow_key(key); - else - key->eth.tci = vlan->vlan_tci; + } else { + key->eth.vlan.tci = vlan->vlan_tci; + key->eth.vlan.tpid = vlan->vlan_tpid; + } return skb_vlan_push(skb, vlan->vlan_tpid, ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT); } diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c index 0ea128eeeab2..1240ae3b88d2 100644 --- a/net/openvswitch/flow.c +++ b/net/openvswitch/flow.c @@ -302,24 +302,57 @@ static bool icmp6hdr_ok(struct sk_buff *skb) sizeof(struct icmp6hdr)); } -static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) +/** + * Parse vlan tag from vlan header. + * Returns ERROR on memory error. + * Returns 0 if it encounters a non-vlan or incomplete packet. + * Returns 1 after successfully parsing vlan tag. + */ +static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh) { - struct qtag_prefix { - __be16 eth_type; /* ETH_P_8021Q */ - __be16 tci; - }; - struct qtag_prefix *qp; + struct vlan_head *vh = (struct vlan_head *)skb->data; - if (unlikely(skb->len < sizeof(struct qtag_prefix) + sizeof(__be16))) + if (likely(!eth_type_vlan(vh->tpid))) return 0; - if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) + -sizeof(__be16 + if (unlikely(skb->len < sizeof(struct vlan_head) + sizeof(__be16))) + return 0; + + if (unlikely(!pskb_may_pull(skb, sizeof(struct vlan_head) + +sizeof(__be16 return -ENOMEM; - qp = (struct qtag_prefix *) skb->data; - key->eth.tci = qp->tci | htons(VLAN_TAG_PRESENT); - __skb_pull(skb, sizeof(struct qtag_prefix)); + vh = (struct vlan_head *)skb->data; + key_vh->tci = vh->tci | htons(VLAN_TAG_PRESENT); + key_vh->tpid = vh->tpid; + + __skb_pull(skb, sizeof(struct vlan_head)); + return 1; +} + +static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) +{ + int res; + + key->eth.vlan.tci = 0; + key->eth.vlan.tpid = 0; + key->eth.cvlan.tci = 0; + key->eth.cvlan.tpid = 0; + + if (likely(skb_vlan_tag_present(skb))) { + key->eth.vlan.tci = htons(skb->vlan_tci); + key->eth.vlan.tpid = skb->vlan_proto; + } else { + /* Parse outer vlan tag in the non-accelerated case. */ + res = parse_vlan_tag(skb, >eth.vlan); + if (res <= 0) + return res; + } + + /* Parse inner vlan tag. */ + res = parse_vlan_tag(skb, >eth.cvlan); + if (res <= 0) + return res; return 0; } @@ -480,12 +513,8 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) * update skb->csum here. */ - key->eth.tci = 0; - if (skb_vlan_tag_present(skb)) - key->eth.tci = htons(skb->vlan_tci); - else if (eth->h_proto == htons(ETH_P_8021Q)) - if (unlikely(parse_vlan(skb, key))) - return -ENOMEM; + if (unlikely(parse_vlan(skb, key))) + return -ENOMEM; key->eth.type =
[PATCH net-next v22 1/3] openvswitch: 802.1ad uapi changes.
From: Thomas F Herbertopenvswitch: Add support for 8021.AD Change the description of the VLAN tpid field. Signed-off-by: Thomas F Herbert --- include/uapi/linux/openvswitch.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 54c3b4f4aceb..59ed3992c760 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -605,13 +605,13 @@ struct ovs_action_push_mpls { * @vlan_tci: Tag control identifier (TCI) to push. The CFI bit must be set * (but it will not be set in the 802.1Q header that is pushed). * - * The @vlan_tpid value is typically %ETH_P_8021Q. The only acceptable TPID - * values are those that the kernel module also parses as 802.1Q headers, to - * prevent %OVS_ACTION_ATTR_PUSH_VLAN followed by %OVS_ACTION_ATTR_POP_VLAN - * from having surprising results. + * The @vlan_tpid value is typically %ETH_P_8021Q or %ETH_P_8021AD. + * The only acceptable TPID values are those that the kernel module also parses + * as 802.1Q or 802.1AD headers, to prevent %OVS_ACTION_ATTR_PUSH_VLAN followed + * by %OVS_ACTION_ATTR_POP_VLAN from having surprising results. */ struct ovs_action_push_vlan { - __be16 vlan_tpid; /* 802.1Q TPID. */ + __be16 vlan_tpid; /* 802.1Q or 802.1ad TPID. */ __be16 vlan_tci;/* 802.1Q TCI (VLAN ID and priority). */ }; @@ -721,9 +721,10 @@ enum ovs_nat_attr { * is copied from the value to the packet header field, rest of the bits are * left unchanged. The non-masked value bits must be passed in as zeroes. * Masking is not supported for the %OVS_KEY_ATTR_TUNNEL attribute. - * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q header onto the - * packet. - * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet. + * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q or 802.1ad header + * onto the packet. + * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q or 802.1ad header + * from the packet. * @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in * the nested %OVS_SAMPLE_ATTR_* attributes. * @OVS_ACTION_ATTR_PUSH_MPLS: Push a new MPLS label stack entry onto the -- 2.5.5
[PATCH net-next v22 2/3] vlan: Check for vlan ethernet types for 8021.q or 802.1ad
This is to simplify using double tagged vlans. This function allows all valid vlan ethertypes to be checked in a single function call. Also replace some instances that check for both ETH_P_8021Q and ETH_P_8021AD. Patch based on one originally by Thomas F Herbert. Signed-off-by: Thomas F HerbertSigned-off-by: Eric Garver --- include/linux/if_vlan.h | 33 +++-- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h index 49d4aef1f789..3319d97d789d 100644 --- a/include/linux/if_vlan.h +++ b/include/linux/if_vlan.h @@ -272,6 +272,23 @@ static inline int vlan_get_encap_level(struct net_device *dev) } #endif +/** + * eth_type_vlan - check for valid vlan ether type. + * @ethertype: ether type to check + * + * Returns true if the ether type is a vlan ether type. + */ +static inline bool eth_type_vlan(__be16 ethertype) +{ + switch (ethertype) { + case htons(ETH_P_8021Q): + case htons(ETH_P_8021AD): + return true; + default: + return false; + } +} + static inline bool vlan_hw_offload_capable(netdev_features_t features, __be16 proto) { @@ -425,8 +442,7 @@ static inline int __vlan_get_tag(const struct sk_buff *skb, u16 *vlan_tci) { struct vlan_ethhdr *veth = (struct vlan_ethhdr *)skb->data; - if (veth->h_vlan_proto != htons(ETH_P_8021Q) && - veth->h_vlan_proto != htons(ETH_P_8021AD)) + if (!eth_type_vlan(veth->h_vlan_proto)) return -EINVAL; *vlan_tci = ntohs(veth->h_vlan_TCI); @@ -488,7 +504,7 @@ static inline __be16 __vlan_get_protocol(struct sk_buff *skb, __be16 type, * present at mac_len - VLAN_HLEN (if mac_len > 0), or at * ETH_HLEN otherwise */ - if (type == htons(ETH_P_8021Q) || type == htons(ETH_P_8021AD)) { + if (eth_type_vlan(type)) { if (vlan_depth) { if (WARN_ON(vlan_depth < VLAN_HLEN)) return 0; @@ -506,8 +522,7 @@ static inline __be16 __vlan_get_protocol(struct sk_buff *skb, __be16 type, vh = (struct vlan_hdr *)(skb->data + vlan_depth); type = vh->h_vlan_encapsulated_proto; vlan_depth += VLAN_HLEN; - } while (type == htons(ETH_P_8021Q) || -type == htons(ETH_P_8021AD)); + } while (eth_type_vlan(type)); } if (depth) @@ -572,8 +587,7 @@ static inline void vlan_set_encap_proto(struct sk_buff *skb, static inline bool skb_vlan_tagged(const struct sk_buff *skb) { if (!skb_vlan_tag_present(skb) && - likely(skb->protocol != htons(ETH_P_8021Q) && - skb->protocol != htons(ETH_P_8021AD))) + likely(!eth_type_vlan(skb->protocol))) return false; return true; @@ -593,15 +607,14 @@ static inline bool skb_vlan_tagged_multi(const struct sk_buff *skb) if (!skb_vlan_tag_present(skb)) { struct vlan_ethhdr *veh; - if (likely(protocol != htons(ETH_P_8021Q) && - protocol != htons(ETH_P_8021AD))) + if (likely(!eth_type_vlan(protocol))) return false; veh = (struct vlan_ethhdr *)skb->data; protocol = veh->h_vlan_encapsulated_proto; } - if (protocol != htons(ETH_P_8021Q) && protocol != htons(ETH_P_8021AD)) + if (!eth_type_vlan(protocol)) return false; return true; -- 2.5.5
Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameedwrote: > On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz wrote: >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed wrote: >> >>> Packet rate performance testing was done with pktgen 64B packets and on >>> TX side and, TC drop action on RX side compared to XDP fast drop. >>> >>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz >>> >>> Comparison is done between: >>> 1. Baseline, Before this patch with TC drop action >>> 2. This patch with TC drop action >>> 3. This patch with XDP RX fast drop >>> >>> StreamsBaseline(TC drop)TC dropXDP fast Drop >>> -- >>> 1 5.51Mpps5.14Mpps 13.5Mpps >>> 2 11.5Mpps10.0Mpps 25.1Mpps >>> 4 16.3Mpps17.2Mpps 35.4Mpps >>> 8 29.6Mpps28.2Mpps 45.8Mpps* >>> 16 34.0Mpps30.1Mpps 45.8Mpps* >> >> Rana, Guys, congrat!! >> >> When you say X streams, does each stream mapped by RSS to different RX ring? >> or we're on the same RX ring for all rows of the above table? > > Yes, I will make this more clear in the actual submission, > Here we are talking about different RSS core rings. > >> >> In the CX3 work, we had X sender "streams" that all mapped to the same RX >> ring, >> I don't think we went beyond one RX ring. > > Here we did, the first row is what you are describing the other rows > are the same test > with increasing the number of the RSS receiving cores, The xmit side is > sending > as many streams as possible to be as much uniformly spread as possible > across the > different RSS cores on the receiver. > Hi Saeed, Please report CPU utilization also. The expectation is that performance should scale linearly with increasing number of CPUs (i.e. pps/CPU_utilization should be constant). Tom >> >> Here, I guess you want to 1st get an initial max for N pktgen TX >> threads all sending >> the same stream so you land on single RX ring, and then move to M * N pktgen >> TX >> threads to max that further. >> >> I don't see how the current Linux stack would be able to happily drive 34M >> PPS >> (== allocate SKB, etc, you know...) on a single CPU, Jesper? >> >> Or.
Re: [PATCH net-next V6 4/4] net/sched: Introduce act_tunnel_key
On Wed, Sep 7, 2016 at 7:14 AM, Eric Dumazetwrote: > On Wed, 2016-09-07 at 11:08 +0300, Hadar Hen Zion wrote: >> From: Amir Vadai >> > > >> +static int tunnel_key_init(struct net *net, struct nlattr *nla, >> +struct nlattr *est, struct tc_action **a, >> +int ovr, int bind) >> +{ >> + struct tc_action_net *tn = net_generic(net, tunnel_key_net_id); >> + struct nlattr *tb[TCA_TUNNEL_KEY_MAX + 1]; >> + struct metadata_dst *metadata = NULL; >> + struct tc_tunnel_key *parm; >> + struct tcf_tunnel_key *t; >> + struct tcf_tunnel_key_params *params_old; >> + struct tcf_tunnel_key_params *params_new; >> + __be64 key_id; >> + bool exists = false; >> + int ret = 0; >> + int err; > > We try for new code to avoid reverse Christmas tree ;) Fix checkpatch.pl (or whatever tool you prefer).
Re: [PATCH net-next V6 4/4] net/sched: Introduce act_tunnel_key
On Wed, Sep 7, 2016 at 1:08 AM, Hadar Hen Zionwrote: > +struct tcf_tunnel_key_params { > + struct rcu_head rcu; > + int tcft_action; > + int action; > + struct metadata_dst *tcft_enc_metadata; > +}; > + > +struct tcf_tunnel_key { > + struct tc_action common; > + struct tcf_tunnel_key_params __rcu *params; > +}; > + ... This is unnecessary if we make the tc action API aware of RCU. > + > +static void tunnel_key_release(struct tc_action *a, int bind) > +{ > + struct tcf_tunnel_key *t = to_tunnel_key(a); > + struct tcf_tunnel_key_params *params; > + > + rcu_read_lock(); > + params = rcu_dereference(t->params); > + > + if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) > + dst_release(>tcft_enc_metadata->dst); > + > + rcu_read_unlock(); > + So you allocate memory for t->params in ->init() but not release it here? Also, ->cleanup() should be called with RTNL, no need to take read lock here. BTW, again you do NOT need to make it RCU, the whole tc action API should be, as my patchset does, I will take care of this as a part of my patchset. Eric is wasting your time on this, with no benefits, the code will be replaced soon.
Re: [RFC Patch net-next 0/6] net_sched: really switch to RCU for tc actions
On 16-09-01 10:57 PM, Cong Wang wrote: > Currently there are only two tc actions lockless: > gact and mirred. But they are questionable because > we don't have anything to prevent a parallel update > on an existing tc action in hash table while reading > it on fast path, this could be a problem when a tc > action becomes complex. hmm I'm trying to see where the questionable part is in the current code? What is it exactly. The calls to tcf_lastuse_update(>tcf_tm); for example are possibly wrong as that is a u64 may be set from multiple cores. So fixing that seems like a good idea. Actions themselves don't have a path to be "updated" while live do they? iirc and I think a quick scan this morning of the code shows actions have a refcnt and a "bind"/"release" action that increments/decrements this counter. Both bind and release are protected via rtnl lock in the control path. I need to follow all the code paths but is there a way to remove an action that still has a refcnt > 0? In other words does it need to be removed from all filters before it can be deleted. If yes then by the time it is removed (after rcu grace period) it should not be in use. If no then I think there is a problem. I'm looking at this code path here, int __tcf_hash_release(struct tc_action *p, bool bind, bool strict) { int ret = 0; if (p) { if (bind) p->tcfa_bindcnt--; else if (strict && p->tcfa_bindcnt > 0) return -EPERM; p->tcfa_refcnt--; if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) { if (p->ops->cleanup) p->ops->cleanup(p, bind); tcf_hash_destroy(p->hinfo, p); ret = ACT_P_DELETED; } } return ret; } It looks to me that every call site that jumps here where its possible an action is being used by a filter is "strict". And further filters only release actions after an rcu grace period when being destroyed and the filter is no longer using the action. Although the refcnt should be atomic now that its being called from outside the rtnl lock in rcu call back? At least it looks racy to me at a glance this morning. If the refcnt'ing is atomic then do we care/need the hash rcu bits? I'm not seeing how it helps because in the fast path we don't even touch the hash table we have a pointer to a refcnt'd action object. What did I miss? > > This patchset introduces a few new tc action API's > based on RCU so that the fast path could now really > be protected by RCU and we can update existing tc > actions safely and race-freely. > > Obviously this is still _not_ complete yet, I only > modified mirred action to show the use case of > the new API's, all the rest actions could switch to > the new API's too. The new API's are a bit ugly too, > any suggestion to improve them is welcome. > > I tested mirred action with a few test cases, so far > so good, at least no obvious bugs. ;) Take a quick survey of the actions I didn't see any with global state. But I didn't look at them all. > > > Cong Wang (6): > net_sched: use RCU for action hash table > net_sched: introduce tcf_hash_replace() > net_sched: return NULL in tcf_hash_check() > net_sched: introduce tcf_hash_copy() > net_sched: use rcu in fast path > net_sched: switch to RCU API for act_mirred > > include/net/act_api.h | 3 +++ > net/sched/act_api.c| 59 > +++--- > net/sched/act_mirred.c | 41 --- > 3 files changed, 73 insertions(+), 30 deletions(-) >
[PATCH net 4/5] net/mlx5e: Fix global PFC counters replication
From: Gal PressmanCurrently when reading global PFC statistics we left the counter iterator out of the equation and we ended up reading the same counter over and over again. Instead of reading the counter at index 0 on every iteration we now read the counter at index (i). Fixes: e989d5a532ce ('net/mlx5e: Expose flow control counters to ethtool') Signed-off-by: Gal Pressman Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c index 98e1a4a..7a346bb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c @@ -331,7 +331,7 @@ static void mlx5e_get_ethtool_stats(struct net_device *dev, if (mlx5e_query_global_pause_combined(priv)) { for (i = 0; i < NUM_PPORT_PER_PRIO_PFC_COUNTERS; i++) { data[idx++] = MLX5E_READ_CTR64_BE(>stats.pport.per_prio_counters[0], - pport_per_prio_pfc_stats_desc, 0); + pport_per_prio_pfc_stats_desc, i); } } -- 2.7.4
[PATCH net 0/5] Mellanox 100G mlx5 fixes 2016-09-07
Hi Dave, The following series contains bug fixes for the mlx5e driver. from Gal, - Static code checker cleanup (casting overflow) - Fix global PFC counter statistics reading - Fix HW LRO when vlan stripping is off >From Bodong, - Deprecate old autoneg capability bit and use new one. >From Tariq, - Fix xmit more counter race condition For -stable: ('net/mlx5e: Fix parsing of vlan packets when updating lro header') No conflicts are introduced with the mlx5 ongoing net-next submission. Thanks, Saeed. Bodong Wang (1): net/mlx5e: Move an_disable_cap bit to a new position Gal Pressman (3): net/mlx5e: Prevent casting overflow net/mlx5e: Fix global PFC counters replication net/mlx5e: Fix parsing of vlan packets when updating lro header Tariq Toukan (1): net/mlx5e: Fix xmit_more counter race issue .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 8 +--- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 22 +++--- drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 2 +- include/linux/mlx5/mlx5_ifc.h | 5 +++-- 4 files changed, 24 insertions(+), 13 deletions(-) -- 2.7.4
[PATCH net 5/5] net/mlx5e: Fix parsing of vlan packets when updating lro header
From: Gal PressmanCurrently vlan tagged packets were not parsed correctly and assumed to be regular IPv4/IPv6 packets. We should check for 802.1Q/802.1ad tags and update the lro header accordingly. This fixes the use case where LRO is on and rxvlan is off (vlan stripping is off). Fixes: e586b3b0baee ('net/mlx5: Ethernet Datapath files') Signed-off-by: Gal Pressman Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index b6f8ebb..e7c969d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -637,24 +637,32 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq) static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe, u32 cqe_bcnt) { - struct ethhdr *eth= (struct ethhdr *)(skb->data); - struct iphdr*ipv4 = (struct iphdr *)(skb->data + ETH_HLEN); - struct ipv6hdr *ipv6 = (struct ipv6hdr *)(skb->data + ETH_HLEN); + struct ethhdr *eth = (struct ethhdr *)(skb->data); + struct iphdr*ipv4; + struct ipv6hdr *ipv6; struct tcphdr *tcp; + int network_depth = 0; + __be16 proto; + u16 tot_len; u8 l4_hdr_type = get_cqe_l4_hdr_type(cqe); int tcp_ack = ((CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA == l4_hdr_type) || (CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA == l4_hdr_type)); - u16 tot_len = cqe_bcnt - ETH_HLEN; + skb->mac_len = ETH_HLEN; + proto = __vlan_get_protocol(skb, eth->h_proto, _depth); - if (eth->h_proto == htons(ETH_P_IP)) { - tcp = (struct tcphdr *)(skb->data + ETH_HLEN + + ipv4 = (struct iphdr *)(skb->data + network_depth); + ipv6 = (struct ipv6hdr *)(skb->data + network_depth); + tot_len = cqe_bcnt - network_depth; + + if (proto == htons(ETH_P_IP)) { + tcp = (struct tcphdr *)(skb->data + network_depth + sizeof(struct iphdr)); ipv6 = NULL; skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4; } else { - tcp = (struct tcphdr *)(skb->data + ETH_HLEN + + tcp = (struct tcphdr *)(skb->data + network_depth + sizeof(struct ipv6hdr)); ipv4 = NULL; skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6; -- 2.7.4
[PATCH net 2/5] net/mlx5e: Move an_disable_cap bit to a new position
From: Bodong WangPrevious an_disable_cap position bit31 is deprecated to be use in driver with newer firmware. New firmware will advertise the same capability in bit29. Old capability didn't allow setting more than one protocol for a specific speed when autoneg is off, while newer firmware will allow this and it is indicated in the new capability location. Signed-off-by: Bodong Wang Signed-off-by: Saeed Mahameed --- include/linux/mlx5/mlx5_ifc.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 21bc455..d1f9a58 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -6710,9 +6710,10 @@ struct mlx5_ifc_pude_reg_bits { }; struct mlx5_ifc_ptys_reg_bits { - u8 an_disable_cap[0x1]; + u8 reserved_at_0[0x1]; u8 an_disable_admin[0x1]; - u8 reserved_at_2[0x6]; + u8 an_disable_cap[0x1]; + u8 reserved_at_3[0x5]; u8 local_port[0x8]; u8 reserved_at_10[0xd]; u8 proto_mask[0x3]; -- 2.7.4
[PATCH net 3/5] net/mlx5e: Prevent casting overflow
From: Gal PressmanOn 64 bits architectures unsigned long is longer than u32, casting to unsigned long will result in overflow. We need to first allocate an unsigned long variable, then assign the wanted value. Fixes: 665bc53969d7 ('net/mlx5e: Use new ethtool get/set link ksettings API') Signed-off-by: Gal Pressman Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c index d0cf8fa..98e1a4a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c @@ -659,9 +659,10 @@ out: static void ptys2ethtool_supported_link(unsigned long *supported_modes, u32 eth_proto_cap) { + unsigned long proto_cap = eth_proto_cap; int proto; - for_each_set_bit(proto, (unsigned long *)_proto_cap, MLX5E_LINK_MODES_NUMBER) + for_each_set_bit(proto, _cap, MLX5E_LINK_MODES_NUMBER) bitmap_or(supported_modes, supported_modes, ptys2ethtool_table[proto].supported, __ETHTOOL_LINK_MODE_MASK_NBITS); @@ -670,9 +671,10 @@ static void ptys2ethtool_supported_link(unsigned long *supported_modes, static void ptys2ethtool_adver_link(unsigned long *advertising_modes, u32 eth_proto_cap) { + unsigned long proto_cap = eth_proto_cap; int proto; - for_each_set_bit(proto, (unsigned long *)_proto_cap, MLX5E_LINK_MODES_NUMBER) + for_each_set_bit(proto, _cap, MLX5E_LINK_MODES_NUMBER) bitmap_or(advertising_modes, advertising_modes, ptys2ethtool_table[proto].advertised, __ETHTOOL_LINK_MODE_MASK_NBITS); -- 2.7.4
[PATCH net 1/5] net/mlx5e: Fix xmit_more counter race issue
From: Tariq ToukanUpdate the xmit_more counter before notifying the HW, to prevent a possible use-after-free of the skb. Fixes: c8cf78fe100b ("net/mlx5e: Add ethtool counter for TX xmit_more") Signed-off-by: Tariq Toukan Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index 988eca9..eb0e725 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -356,6 +356,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb) sq->stats.stopped++; } + sq->stats.xmit_more += skb->xmit_more; if (!skb->xmit_more || netif_xmit_stopped(sq->txq)) { int bf_sz = 0; @@ -375,7 +376,6 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb) sq->stats.packets++; sq->stats.bytes += num_bytes; - sq->stats.xmit_more += skb->xmit_more; return NETDEV_TX_OK; dma_unmap_wqe_err: -- 2.7.4
[PATCH net-next v3] net: inet: diag: expose the socket mark to privileged processes.
This adds the capability for a process that has CAP_NET_ADMIN on a socket to see the socket mark in socket dumps. Commit a52e95abf772 ("net: diag: allow socket bytecode filters to match socket marks") recently gave privileged processes the ability to filter socket dumps based on mark. This patch is complementary: it ensures that the mark is also passed to userspace in the socket's netlink attributes. It is useful for tools like ss which display information about sockets. Tested: https://android-review.googlesource.com/270210 Signed-off-by: Lorenzo Colitti--- include/linux/inet_diag.h | 4 ++-- include/uapi/linux/inet_diag.h | 1 + net/ipv4/inet_diag.c | 49 -- net/ipv4/udp_diag.c| 10 + net/sctp/sctp_diag.c | 20 +++-- 5 files changed, 56 insertions(+), 28 deletions(-) diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h index feb04ea..65da430 100644 --- a/include/linux/inet_diag.h +++ b/include/linux/inet_diag.h @@ -37,7 +37,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, struct sk_buff *skb, const struct inet_diag_req_v2 *req, struct user_namespace *user_ns, u32 pid, u32 seq, u16 nlmsg_flags, - const struct nlmsghdr *unlh); + const struct nlmsghdr *unlh, bool net_admin); void inet_diag_dump_icsk(struct inet_hashinfo *h, struct sk_buff *skb, struct netlink_callback *cb, const struct inet_diag_req_v2 *r, @@ -56,7 +56,7 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock *sk); int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb, struct inet_diag_msg *r, int ext, -struct user_namespace *user_ns); +struct user_namespace *user_ns, bool net_admin); extern int inet_diag_register(const struct inet_diag_handler *handler); extern void inet_diag_unregister(const struct inet_diag_handler *handler); diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h index 5581206..b5c366f 100644 --- a/include/uapi/linux/inet_diag.h +++ b/include/uapi/linux/inet_diag.h @@ -123,6 +123,7 @@ enum { INET_DIAG_LOCALS, INET_DIAG_PEERS, INET_DIAG_PAD, + INET_DIAG_MARK, __INET_DIAG_MAX, }; diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index abfbe49..e4d16fc 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -99,6 +99,7 @@ static size_t inet_sk_attr_size(void) + nla_total_size(1) /* INET_DIAG_SHUTDOWN */ + nla_total_size(1) /* INET_DIAG_TOS */ + nla_total_size(1) /* INET_DIAG_TCLASS */ + + nla_total_size(4) /* INET_DIAG_MARK */ + nla_total_size(sizeof(struct inet_diag_meminfo)) + nla_total_size(sizeof(struct inet_diag_msg)) + nla_total_size(SK_MEMINFO_VARS * sizeof(u32)) @@ -109,7 +110,8 @@ static size_t inet_sk_attr_size(void) int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb, struct inet_diag_msg *r, int ext, -struct user_namespace *user_ns) +struct user_namespace *user_ns, +bool net_admin) { const struct inet_sock *inet = inet_sk(sk); @@ -136,6 +138,9 @@ int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb, } #endif + if (net_admin && nla_put_u32(skb, INET_DIAG_MARK, sk->sk_mark)) + goto errout; + r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk)); r->idiag_inode = sock_i_ino(sk); @@ -149,7 +154,8 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, struct sk_buff *skb, const struct inet_diag_req_v2 *req, struct user_namespace *user_ns, u32 portid, u32 seq, u16 nlmsg_flags, - const struct nlmsghdr *unlh) + const struct nlmsghdr *unlh, + bool net_admin) { const struct tcp_congestion_ops *ca_ops; const struct inet_diag_handler *handler; @@ -175,7 +181,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, r->idiag_timer = 0; r->idiag_retrans = 0; - if (inet_diag_msg_attrs_fill(sk, skb, r, ext, user_ns)) + if (inet_diag_msg_attrs_fill(sk, skb, r, ext, user_ns, net_admin)) goto errout; if (ext & (1 << (INET_DIAG_MEMINFO - 1))) { @@ -274,10 +280,11 @@ static int inet_csk_diag_fill(struct sock *sk, const struct inet_diag_req_v2 *req, struct user_namespace *user_ns,
Re: [PATCH net-next] net: inet: diag: expose the socket mark to privileged processes.
On Wed, Sep 7, 2016 at 8:20 PM, kbuild test robotwrote: > [auto build test ERROR on net-next/master] > >net/sctp/sctp_diag.c: In function 'inet_sctp_diag_fill': > >> net/sctp/sctp_diag.c:136:6: error: too few arguments to function > >> 'inet_diag_msg_attrs_fill' > if (inet_diag_msg_attrs_fill(sk, skb, r, ext, user_ns)) > ^~~~ Fixed in v3, and unit tests updated with basic SCTP tests as well.
[PATCH net-next 2/2] rxrpc: Add tracepoint for working out where aborts happen
Add a tracepoint for working out where local aborts happen. Each tracepoint call is labelled with a 3-letter code so that they can be distinguished - and the DATA sequence number is added too where available. rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can indicate the circumstances when it aborts a call. Signed-off-by: David Howells--- fs/afs/rxrpc.c | 17 --- include/net/af_rxrpc.h |3 + include/trace/events/rxrpc.h | 29 +++ net/rxrpc/ar-internal.h | 14 - net/rxrpc/call_event.c |7 ++- net/rxrpc/call_object.c |2 - net/rxrpc/conn_event.c |6 ++ net/rxrpc/input.c|7 ++- net/rxrpc/insecure.c | 19 --- net/rxrpc/rxkad.c| 108 +++--- net/rxrpc/sendmsg.c | 18 --- 11 files changed, 132 insertions(+), 98 deletions(-) diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c index 37608be52abd..53750dece80e 100644 --- a/fs/afs/rxrpc.c +++ b/fs/afs/rxrpc.c @@ -377,7 +377,7 @@ int afs_make_call(struct in_addr *addr, struct afs_call *call, gfp_t gfp, return wait_mode->wait(call); error_do_abort: - rxrpc_kernel_abort_call(afs_socket, rxcall, RX_USER_ABORT); + rxrpc_kernel_abort_call(afs_socket, rxcall, RX_USER_ABORT, -ret, "KSD"); error_kill_call: afs_end_call(call); _leave(" = %d", ret); @@ -425,12 +425,12 @@ static void afs_deliver_to_call(struct afs_call *call) case -ENOTCONN: abort_code = RX_CALL_DEAD; rxrpc_kernel_abort_call(afs_socket, call->rxcall, - abort_code); + abort_code, -ret, "KNC"); goto do_abort; case -ENOTSUPP: abort_code = RX_INVALID_OPERATION; rxrpc_kernel_abort_call(afs_socket, call->rxcall, - abort_code); + abort_code, -ret, "KIV"); goto do_abort; case -ENODATA: case -EBADMSG: @@ -440,7 +440,7 @@ static void afs_deliver_to_call(struct afs_call *call) if (call->state != AFS_CALL_AWAIT_REPLY) abort_code = RXGEN_SS_UNMARSHAL; rxrpc_kernel_abort_call(afs_socket, call->rxcall, - abort_code); + abort_code, EBADMSG, "KUM"); goto do_abort; } } @@ -463,6 +463,7 @@ do_abort: */ static int afs_wait_for_call_to_complete(struct afs_call *call) { + const char *abort_why; int ret; DECLARE_WAITQUEUE(myself, current); @@ -481,9 +482,11 @@ static int afs_wait_for_call_to_complete(struct afs_call *call) continue; } + abort_why = "KWC"; ret = call->error; if (call->state == AFS_CALL_COMPLETE) break; + abort_why = "KWI"; ret = -EINTR; if (signal_pending(current)) break; @@ -497,7 +500,7 @@ static int afs_wait_for_call_to_complete(struct afs_call *call) if (call->state < AFS_CALL_COMPLETE) { _debug("call incomplete"); rxrpc_kernel_abort_call(afs_socket, call->rxcall, - RX_CALL_DEAD); + RX_CALL_DEAD, -ret, abort_why); } _debug("call complete"); @@ -695,7 +698,7 @@ void afs_send_empty_reply(struct afs_call *call) case -ENOMEM: _debug("oom"); rxrpc_kernel_abort_call(afs_socket, call->rxcall, - RX_USER_ABORT); + RX_USER_ABORT, ENOMEM, "KOO"); default: afs_end_call(call); _leave(" [error]"); @@ -734,7 +737,7 @@ void afs_send_simple_reply(struct afs_call *call, const void *buf, size_t len) if (n == -ENOMEM) { _debug("oom"); rxrpc_kernel_abort_call(afs_socket, call->rxcall, - RX_USER_ABORT); + RX_USER_ABORT, ENOMEM, "KOO"); } afs_end_call(call); _leave(" [error]"); diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h index b4b6a3664dda..08ed8729126c 100644 --- a/include/net/af_rxrpc.h +++ b/include/net/af_rxrpc.h @@ -35,7 +35,8 @@ int rxrpc_kernel_send_data(struct socket *, struct rxrpc_call *, struct msghdr *, size_t); int rxrpc_kernel_recv_data(struct socket *, struct rxrpc_call *,
[PATCH net-next 0/2] rxrpc: Local abort tracepoint
Here are two patches. They need to be applied on top of the just-posted call refcount overhaul patch: (1) Fix the return value of some call completion helpers. (2) Add a tracepoint that allows local aborts to be debugged. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160904-2 David --- David Howells (2): rxrpc: Fix returns of call completion helpers rxrpc: Add tracepoint for working out where aborts happen fs/afs/rxrpc.c | 17 --- include/net/af_rxrpc.h |3 + include/trace/events/rxrpc.h | 29 +++ net/rxrpc/ar-internal.h | 27 +++ net/rxrpc/call_event.c |7 ++- net/rxrpc/call_object.c |2 - net/rxrpc/conn_event.c |6 ++ net/rxrpc/input.c|7 ++- net/rxrpc/insecure.c | 19 --- net/rxrpc/rxkad.c| 108 +++--- net/rxrpc/sendmsg.c | 18 --- 11 files changed, 140 insertions(+), 103 deletions(-)
[PATCH net-next 1/2] rxrpc: Fix returns of call completion helpers
rxrpc_set_call_completion() returns bool, not int, so the ret variable should match this. rxrpc_call_completed() and __rxrpc_call_completed() should return the value of rxrpc_set_call_completion(). Signed-off-by: David Howells--- net/rxrpc/ar-internal.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index 3addda4bfa6b..0353399792b6 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -608,7 +608,7 @@ static inline bool rxrpc_set_call_completion(struct rxrpc_call *call, u32 abort_code, int error) { - int ret; + bool ret; write_lock_bh(>state_lock); ret = __rxrpc_set_call_completion(call, compl, abort_code, error); @@ -619,16 +619,19 @@ static inline bool rxrpc_set_call_completion(struct rxrpc_call *call, /* * Record that a call successfully completed. */ -static inline void __rxrpc_call_completed(struct rxrpc_call *call) +static inline bool __rxrpc_call_completed(struct rxrpc_call *call) { - __rxrpc_set_call_completion(call, RXRPC_CALL_SUCCEEDED, 0, 0); + return __rxrpc_set_call_completion(call, RXRPC_CALL_SUCCEEDED, 0, 0); } -static inline void rxrpc_call_completed(struct rxrpc_call *call) +static inline bool rxrpc_call_completed(struct rxrpc_call *call) { + bool ret; + write_lock_bh(>state_lock); - __rxrpc_call_completed(call); + ret = __rxrpc_call_completed(call); write_unlock_bh(>state_lock); + return ret; } /*
[PATCH net] tcp: fastopen: avoid negative sk_forward_alloc
From: Eric DumazetWhen DATA and/or FIN are carried in a SYN/ACK message or SYN message, we append an skb in socket receive queue, but we forget to call sk_forced_mem_schedule(). Effect is that the socket has a negative sk->sk_forward_alloc as long as the message is not read by the application. Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp fin memory accounting") Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Signed-off-by: Eric Dumazet --- net/ipv4/tcp_fastopen.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 62a5751d4fe1..4e777a3243f9 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -150,6 +150,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb) tp->segs_in = 0; tcp_segs_in(tp, skb); __skb_pull(skb, tcp_hdrlen(skb)); + sk_forced_mem_schedule(sk, skb->truesize); skb_set_owner_r(skb, sk); TCP_SKB_CB(skb)->seq++;
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: > On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazetwrote: > > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: > >> Previously we rang XDP SQ doorbell on every forwarded XDP packet. > >> > >> Here we introduce a xmit more like mechanism that will queue up more > >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware. > >> > >> Once RX napi budget is consumed and we exit napi RX loop, we will > >> flush (doorbell) all XDP looped packets in case there are such. > > > > Why is this idea depends on XDP ? > > > > It looks like we could apply it to any driver having one IRQ servicing > > one RX and one TX, without XDP being involved. > > > > Yes but it is more complicated than XDP case, where the RX ring posts > the TX descriptors and once done > the RX ring hits the doorbell once for all the TX descriptors it > posted, and it is the only possible place to hit a doorbell > for XDP TX ring. > > For regular TX and RX ring sharing the same IRQ, there is no such > simple connection between them, and hitting a doorbell > from RX ring napi would race with xmit ndo function of the TX ring. > > How do you synchronize in such case ? > isn't the existing xmit more mechanism sufficient enough ? Only if a qdisc is present and pressure is high enough. But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit. A simple cmpxchg could be used to synchronize the thing, if we really cared about doorbell cost. (Ie if the cost of this cmpxchg() is way smaller than doorbell one)
[PATCH net-next 0/1] rxrpc: Local abort tracepoint
Here's a patch that adds a tracepoint that allows local aborts to be debugged. This needs to be applied on top of the just-posted call refcount overhaul patch. The patches can be found here also: http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite Tagged thusly: git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git rxrpc-rewrite-20160904-2 David --- David Howells (1): rxrpc: Add tracepoint for working out where aborts happen fs/afs/rxrpc.c | 17 --- include/net/af_rxrpc.h |3 + include/trace/events/rxrpc.h | 29 +++ net/rxrpc/ar-internal.h | 27 +++ net/rxrpc/call_event.c |7 ++- net/rxrpc/call_object.c |2 - net/rxrpc/conn_event.c |6 ++ net/rxrpc/input.c|7 ++- net/rxrpc/insecure.c | 19 --- net/rxrpc/rxkad.c| 108 +++--- net/rxrpc/sendmsg.c | 18 --- 11 files changed, 140 insertions(+), 103 deletions(-)
[PATCH net-next 4/8] rxrpc: Use call->peer rather than call->conn->params.peer
Use call->peer rather than call->conn->params.peer to avoid the possibility of call->conn being NULL and, whilst we're at it, check it for NULL before we access it. Signed-off-by: David Howells--- net/rxrpc/call_object.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index 3166b5222435..060ddc32a85e 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -514,9 +514,11 @@ void rxrpc_release_call(struct rxrpc_call *call) */ _debug("RELEASE CALL %p (%d CONN %p)", call, call->debug_id, conn); - spin_lock(>params.peer->lock); - hlist_del_init(>error_link); - spin_unlock(>params.peer->lock); + if (call->peer) { + spin_lock(>peer->lock); + hlist_del_init(>error_link); + spin_unlock(>peer->lock); + } write_lock_bh(>call_lock); if (!list_empty(>accept_link)) {
[PATCH net-next 1/8] rxrpc: Whitespace cleanup
Remove some whitespace. Signed-off-by: David Howells--- net/rxrpc/ar-internal.h |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index bb342f5fe7e4..ad702f9f8d1f 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -720,7 +720,6 @@ static inline void rxrpc_put_connection(struct rxrpc_connection *conn) } } - static inline bool rxrpc_queue_conn(struct rxrpc_connection *conn) { if (!rxrpc_get_connection_maybe(conn)) @@ -879,7 +878,7 @@ int __init rxrpc_init_security(void); void rxrpc_exit_security(void); int rxrpc_init_client_conn_security(struct rxrpc_connection *); int rxrpc_init_server_conn_security(struct rxrpc_connection *); - + /* * sendmsg.c */