Re: [PATCH 0/3] IB/ipoib: Use dev_port to disambiguate port numbers
On Wed, Aug 29, 2018 at 12:01:14AM +0300, Arseny Maslennikov wrote: > Pre-3.15 userspace had trouble distinguishing different ports > of a NIC on a single PCI bus/device/function. To solve this, > a sysfs field `dev_port' was introduced quite a while ago > (commit v3.14-rc3-739-g3f85944fe207), and some relevant device > drivers were fixed to use it, but not in case of IPoIB. > > The convention for some reason never got documented in the kernel, but > was immediately adopted by userspace (notably udev[1][2], biosdevname[3]) > > 3/3 documents the sysfs field — that's why I'm CC-ing netdev. > > This series was tested on and applies to 4.19-rc1. > > [1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html > [2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html > [3] > https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38 > > Arseny Maslennikov (3): > IB/ipoib: Use dev_port to expose network interface port numbers > IB/ipoib: Stop using dev_id to expose port numbers I completely agree with previous Yuval's comment, it makes no sense to start separate commits for every line. Please decide what is best and right behavior and do it, instead of pushing it up to be the maintainer's problem. Thanks > Documentation/ABI: document /sys/class/net/*/dev_port > > Documentation/ABI/testing/sysfs-class-net | 10 ++ > drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- > 2 files changed, 11 insertions(+), 1 deletion(-) > > -- > 2.18.0 > signature.asc Description: PGP signature
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
Wed, Aug 29, 2018 at 05:24:28PM CEST, step...@networkplumber.org wrote: >On Wed, 29 Aug 2018 09:18:55 +0200 >Jiri Pirko wrote: > >> Tue, Aug 28, 2018 at 10:58:11PM CEST, mar...@holtmann.org wrote: >> >The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs >> >file as DEVTYPE= information. To avoid any kind of race conditions >> >between netlink messages and reading from sysfs, it is useful to add the >> >same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK >> >messages. >> > >> >For network managing daemons that have to classify ARPHRD_ETHER network >> >devices into different types (like Wireless LAN, Bluetooth etc.), this >> >avoids the extra round trip to sysfs and parsing of the uevent file. >> > >> >Signed-off-by: Marcel Holtmann >> >--- >> > include/uapi/linux/if_link.h | 2 ++ >> > net/core/rtnetlink.c | 12 >> > 2 files changed, 14 insertions(+) >> > >> >diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h >> >index 43391e2d1153..781294972bb4 100644 >> >--- a/include/uapi/linux/if_link.h >> >+++ b/include/uapi/linux/if_link.h >> >@@ -166,6 +166,8 @@ enum { >> >IFLA_NEW_IFINDEX, >> >IFLA_MIN_MTU, >> >IFLA_MAX_MTU, >> >+ IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ >> >> This is not something netdev-related. dev->dev.type is struct device_type. >> This is a generic "device" thing. Incorrect to expose over >> netdev-specific API. Please use "device" API for this. > >There is no device API in netlink. The whole point of this patch is to >do it in one message. It might be a performance optimization, but I can't >see how it could be a race condition. Devices set type before registering. I understand. My point is to avoid exposing generic "struct device" values over rtnetlink. It mixes apples and oranges.
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
Wed, Aug 29, 2018 at 05:23:58PM CEST, mar...@holtmann.org wrote: >Hi Jiri, > >>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs >>> file as DEVTYPE= information. To avoid any kind of race conditions >>> between netlink messages and reading from sysfs, it is useful to add the >>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK >>> messages. >>> >>> For network managing daemons that have to classify ARPHRD_ETHER network >>> devices into different types (like Wireless LAN, Bluetooth etc.), this >>> avoids the extra round trip to sysfs and parsing of the uevent file. >>> >>> Signed-off-by: Marcel Holtmann >>> --- >>> include/uapi/linux/if_link.h | 2 ++ >>> net/core/rtnetlink.c | 12 >>> 2 files changed, 14 insertions(+) >>> >>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h >>> index 43391e2d1153..781294972bb4 100644 >>> --- a/include/uapi/linux/if_link.h >>> +++ b/include/uapi/linux/if_link.h >>> @@ -166,6 +166,8 @@ enum { >>> IFLA_NEW_IFINDEX, >>> IFLA_MIN_MTU, >>> IFLA_MAX_MTU, >>> + IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ >> >> This is not something netdev-related. dev->dev.type is struct device_type. >> This is a generic "device" thing. Incorrect to expose over >> netdev-specific API. Please use "device" API for this. > >it is not just "device" related since this is a sub-classification of >ARPHRD_ETHER type as a wrote above. Don't get hang up that this information is >part of struct device. The dev->dev.type contains strings like "wlan", >"bluetooth", "wimax", "gadget" etc. so that you can tell what kind of Ethernet >card you have. I understand. But using dev->dev.type for that purpose seems like an abuse. If this is subtype of netdev, it should not be stored in parent device. I believe that you need to re-introduce this as a part of struct net_device - preferably an enum - and that you can expose over rtnl (and net-sysfs). > >We can revive the patches for adding this information as IFLA_INFO_KIND, but >last time they where reverted since NetworkManager did all the wrong decisions >based on that. That part is fixed now and we can put that back and declare the >sub-type classification of Ethernet device in two places. Or we just take the >information that we added a long time ago for exactly this sub-classification >and provide them to userspace via RTNL. IFLA_INFO_KIND serves for different purpose. It is for drivers that register rtnl_link_ops. I don't think it would be wise to mix it with this. Btw, could you wrap the sentences at 72 cols? Would be easier to read that way. > >Regards > >Marcel >
Re: [PATCH net v2 0/2] net_sched: reject unknown tcfa_action values
From: Paolo Abeni Date: Wed, 29 Aug 2018 10:22:32 +0200 > As agreed some time ago, this changeset reject unknown tcfa_action values, > instead of changing such values under the hood. > > A tdc test is included to verify the new behavior. > > v1 -> v2: > - helper is now static and renamed according to act_* convention > - updated extack message, according to the new behavior Series applied, thank you.
Re: [PATCH v2] net: mvpp2: initialize port of_node pointer
From: Baruch Siach Date: Wed, 29 Aug 2018 09:44:39 +0300 > Without a valid of_node in struct device we can't find the mvpp2 port > device by its DT node. Specifically, this breaks > of_find_net_device_by_node(). > > For example, the Armada 8040 based Clearfog GT-8K uses Marvell 88E6141 > switch connected to the _eth2 port: > > _mdio { > ... > > switch0: switch0@4 { > compatible = "marvell,mv88e6085"; > ... > > ports { > ... > > port@5 { > reg = <5>; > label = "cpu"; > ethernet = <_eth2>; > }; > }; > }; > }; > > Without this patch, dsa_register_switch() returns -EPROBE_DEFER because > of_find_net_device_by_node() can't find the device_node of the _eth2 > device. > > Reviewed-by: Andrew Lunn > Signed-off-by: Baruch Siach Applied.
Re: [PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac
From: Li RongQing Date: Wed, 29 Aug 2018 11:52:10 +0800 > vxlan_find_mac() unconditionally set f->used for every packet, > this causes a cache miss for every packet, since remote, hlist > and used of vxlan_fdb share the same cache line, which are > accessed when send every packets. > > so f->used is set only if not equal to jiffies, to reduce dirty > cache line times, this gives 3% speed-up with small packets. > > Signed-off-by: Zhang Yu > Signed-off-by: Li RongQing Applied.
Re: [PATCH 3/4] clk: x86: Stop marking clocks as CLK_IS_CRITICAL
Quoting Hans de Goede (2018-08-27 07:31:59) > Commit d31fd43c0f9a ("clk: x86: Do not gate clocks enabled by the > firmware"), which added the code to mark clocks as CLK_IS_CRITICAL, causes > all unclaimed PMC clocks on Cherry Trail devices to be on all the time, > resulting on the device not being able to reach S0i3 when suspended. > > The reason for this commit is that on some Bay Trail / Cherry Trail devices > the r8169 ethernet controller uses pmc_plt_clk_4. Now that the clk-pmc-atom > driver exports an "ether_clk" alias for pmc_plt_clk_4 and the r8169 driver > has been modified to get and enable this clock (if present) the marking of > the clocks as CLK_IS_CRITICAL is no longer necessary. > > This commit removes the CLK_IS_CRITICAL marking, fixing Cherry Trail > devices not being able to reach S0i3 greatly decreasing their battery > drain when suspended. > > Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=193891#c102 > Cc: Johannes Stezenbach > Cc: Carlo Caione > Signed-off-by: Hans de Goede > --- Acked-by: Stephen Boyd
Re: [PATCH net-next 0/4] liquidio: improve soft command/response handling
From: Felix Manlunas Date: Tue, 28 Aug 2018 18:50:58 -0700 > From: Weilin Chang > > Change soft command handling to fix the possible race condition when the > process handles a response of a soft command that was already freed by an > application which got timeout for this request. Series applied, thank you.
Re: [net-next 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h
From: Jeff Kirsher Date: Wed, 29 Aug 2018 15:48:21 -0700 > diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h > b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h > new file mode 100644 > index ..0290ade7494b > --- /dev/null > +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h > @@ -0,0 +1,221 @@ ... > +/** > + * __i40e_add_stat_strings - copy stat strings into ethtool buffer > + * @p: ethtool supplied buffer > + * @stats: stat definitions array > + * @size: size of the stats array > + * > + * Format and copy the strings described by stats into the buffer pointed at > + * by p. > + **/ > +static void __i40e_add_stat_strings(u8 **p, const struct i40e_stats stats[], > + const unsigned int size, ...) Need to be marked inline.
Re: [PATCH net-next,v5] net/tls: Calculate nsg for zerocopy path without skb_cow_data.
From: Doron Roberts-Kedes Date: Tue, 28 Aug 2018 16:33:57 -0700 > decrypt_skb fails if the number of sg elements required to map it > is greater than MAX_SKB_FRAGS. nsg must always be calculated, but > skb_cow_data adds unnecessary memcpy's for the zerocopy case. > > The new function skb_nsg calculates the number of scatterlist elements > required to map the skb without the extra overhead of skb_cow_data. > This patch reduces memcpy by 50% on my encrypted NBD benchmarks. > > Reported-by: Vakul Garg > Reviewed-by: Vakul Garg > Tested-by: Vakul Garg > Signed-off-by: Doron Roberts-Kedes Applied, thank you.
Re: [PATCH net-next 1/2] ip: fail fast on IP defrag errors
From: Peter Oskolkov Date: Tue, 28 Aug 2018 11:36:19 -0700 > The current behavior of IP defragmentation is inconsistent: > - some overlapping/wrong length fragments are dropped without > affecting the queue; > - most overlapping fragments cause the whole frag queue to be dropped. > > This patch brings consistency: if a bad fragment is detected, > the whole frag queue is dropped. Two major benefits: > - fail fast: corrupted frag queues are cleared immediately, instead of > by timeout; > - testing of overlapping fragments is now much easier: any kind of > random fragment length mutation now leads to the frag queue being > discarded (IP packet dropped); before this patch, some overlaps were > "corrected", with tests not seeing expected packet drops. > > Note that in one case (see "if (end&7)" conditional) the current > behavior is preserved as there are concerns that this could be > legitimate padding. > > Signed-off-by: Peter Oskolkov > Reviewed-by: Eric Dumazet > Reviewed-by: Willem de Bruijn Applied.
Re: [PATCH net-next 2/2] selftests/net: add ip_defrag selftest
From: Peter Oskolkov Date: Tue, 28 Aug 2018 11:36:20 -0700 > This test creates a raw IPv4 socket, fragments a largish UDP > datagram and sends the fragments out of order. > > Then repeats in a loop with different message and fragment lengths. > > Then does the same with overlapping fragments (with overlapping > fragments the expectation is that the recv times out). > > Tested: > > root@# time ./ip_defrag.sh > ipv4 defrag > PASS > ipv4 defrag with overlaps > PASS > > real1m7.679s > user0m0.628s > sys 0m2.242s > > A similar test for IPv6 is to follow. > > Signed-off-by: Peter Oskolkov > Reviewed-by: Willem de Bruijn Applied.
Re: [PATCH net-next] liquidio: fix race condition in instruction completion processing
From: Felix Manlunas Date: Tue, 28 Aug 2018 11:32:55 -0700 > From: Rick Farrington > > In lio_enable_irq, the pkt_in_done count register was being cleared to > zero. However, there could be some completed instructions which were not > yet processed due to budget and limit constraints. > So, only write this register with the number of actual completions > that were processed. > > Signed-off-by: Rick Farrington > Signed-off-by: Felix Manlunas Applied.
Re: [PATCH net-next] liquidio: remove unnecessary delay when processing IQ responses
From: Felix Manlunas Date: Tue, 28 Aug 2018 11:19:54 -0700 > From: Rick Farrington > > Signed-off-by: Rick Farrington > Signed-off-by: Felix Manlunas Applied.
Re: [PATCH net-next] net: thunderbolt: Convert to use SPDX identifier
From: Mika Westerberg Date: Tue, 28 Aug 2018 19:58:43 +0300 > This gets rid of the licence boilerblate in favor of SPDX identifier > which only takes a single line comment. > > Signed-off-by: Mika Westerberg Applied.
Re: [PATCH net 0/3] ipv6: fix error path of inet6_init()
From: Sabrina Dubroca Date: Tue, 28 Aug 2018 13:40:50 +0200 > The error path of inet6_init() can trigger multiple kernel panics, > mostly due to wrong ordering of cleanups. This series fixes those > issues. Series applied, thank you.
Re: [RFC PATCH v2 bpf-next 0/2] verifier liveness simplification
On Wed, Aug 22, 2018 at 08:00:46PM +0100, Edward Cree wrote: > The first patch is a simplification of register liveness tracking by using > a separate parentage chain for each register and stack slot, thus avoiding > the need for logic to handle callee-saved registers when applying read > marks. In the future this idea may be extended to form use-def chains. > The second patch adds information about misc/zero data on the stack to the > state dumps emitted to the log at various points; this information was > found essential in debugging the first patch, and may be useful elsewhere. I think this set is a great improvement in liveness tracking, so depsite seeing the issues I applied it to bpf-next. I think it's a better base to continue debugging. In particular: 1. we have instability issue in the verifier. from time to time the verifier goes to process extra 7 instructions on one of the cilium tests. This was happening before and after this set. 2. there is a nice improvement in number of processed insns with this set, but the difference I cannot explain, hence it has to debugged. In theory the liveness rewrite shouldn't cause the difference in processed insns. If not for the issue 1 I would argue that the issue 2 means that the set has to be debugged before going in, but since the verifier is unstable it's better to debug from this base with this patch set applied (because it greatly simplifies liveness and adds additional debug in patch2) and once we figure out the issue 1, I hope, the issue 2 will be resolved automatically. The numbers on cilium bpf programs: before1before2 after1 after2 bpf_lb-DLB_L3.o 2003 2003 20032003 bpf_lb-DLB_L4.o 3173 3173 31733173 bpf_lb-DUNKNOWN.o 1080 1080 10801080 bpf_lxc-DDROP_ALL.o 29587 29587 29587 29587 bpf_lxc-DUNKNOWN.o 37204 37211 36926 36933 bpf_netdev.o11283 11283 11188 11188 bpf_overlay.o 6679 6679 66796679 bpf_lcx_jit.o 39657 39657 39561 39561 notice how bpf_lxc-DUNKNOWN.o fluctuates with +7 before and after. That is issue 1. bpf_lxc-DUNKNOWN.o, bpf_netdev.o, and bpf_lcx_jit.o have small improvements. That is issue 2. To reproduce above numbers clone this repo: https://github.com/4ast/bpf_cilium_test and run .sh. The .o files in there are pretty old cilium bpf programs. I kept them frozen and didn't recompile for more than a year to keep stable base line and track the progress of the verifier in 'processed insns'. Thanks
[PATCH net-next] net/ncsi: remove duplicated include from ncsi-netlink.c
Remove duplicated include. Signed-off-by: YueHaibing --- net/ncsi/ncsi-netlink.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/ncsi/ncsi-netlink.c b/net/ncsi/ncsi-netlink.c index 45f33d6..32cb775 100644 --- a/net/ncsi/ncsi-netlink.c +++ b/net/ncsi/ncsi-netlink.c @@ -12,7 +12,6 @@ #include #include #include -#include #include #include #include
Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
> In the long run cpsw should be really treated as an > interconnect instance with it's control module providing > standard Linux framework services such as clock / > regulator / phy / pinctrl / iio whatever for the other > modules. Some of us have been applying pressure for a new driver. This sounds like another argument for such a re-write. Andrew
Re: [PATCH net] net/sched: act_pedit: fix dump of extended layered op
From: Davide Caratti Date: Mon, 27 Aug 2018 22:56:22 +0200 > in the (rare) case of failure in nla_nest_start(), missing NULL checks in > tcf_pedit_key_ex_dump() can make the following command > > # tc action add action pedit ex munge ip ttl set 64 > > dereference a NULL pointer: ... > Like it's done for other TC actions, give up dumping pedit rules and return > an error if nla_nest_start() returns NULL. > > Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the > conventional network headers") > Signed-off-by: Davide Caratti Applied and queued up for -stable, thanks.
Re: [PATCH] r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices
From: Azat Khuzhin Date: Sun, 26 Aug 2018 17:03:09 +0300 > I have two Ethernet adapters: > r8169 :03:01.0 eth0: RTL8169sb/8110sb, 00:14:d1:14:2d:49, XID 1000, > IRQ 18 > r8169 :01:00.0 eth0: RTL8168e/8111e, 64:66:b3:11:14:5d, XID 2c20, > IRQ 30 > And after upgrading from linux 4.15 [1] to linux 4.18+ [2] RTL8169sb failed to > receive any packets. tcpdump shows a lot of checksum mismatch. > > [1]: a0f79386a4968b4925da6db2d1daffd0605a4402 > [2]: 0519359784328bfa92bf0931bf0cff3b58c16932 (4.19 merge window opened) > > I started bisecting and the found that [3] breaks it. According to [4]: > "For 8110S, 8110SB, and 8110SC series, the initial value of RxConfig > needs to be set after the tx/rx is enabled." > So I moved rtl_init_rxcfg() after enabling tx/rs and now my adapter works > (RTL8168e works too). > > [3]: 3559d81e76bfe3803e89f2e04cf6ef7ab4f3aace > [4]: e542a2269f232d61270ceddd42b73a4348dee2bb ("r8169: adjust the RxConfig > settings.") > > Also drop "rx" from rtl_set_rx_tx_config_registers(), since it does nothing > with it already. > > Fixes: 3559d81e76bfe3803e89f2e04cf6ef7ab4f3aace ("r8169: simplify > rtl_hw_start_8169") > > Cc: Heiner Kallweit > Cc: David S. Miller > Cc: netdev@vger.kernel.org > Cc: Realtek linux nic maintainers > Signed-off-by: Azat Khuzhin Applied and queued up for -stable.
Re: [Patch net] tipc: fix a missing rhashtable_walk_exit()
From: Cong Wang Date: Thu, 23 Aug 2018 16:19:44 -0700 > rhashtable_walk_exit() must be paired with rhashtable_walk_enter(). > > Fixes: 40f9f4397060 ("tipc: Fix tipc_sk_reinit race conditions") > Cc: Herbert Xu > Cc: Ying Xue > Signed-off-by: Cong Wang Applied and queued up for -stable, thanks Cong.
RE: [PATCH] hv_netvsc: Fix a deadlock by getting rtnl_lock earlier in netvsc_probe()
> From: David Miller > Sent: Wednesday, August 29, 2018 17:49 > > From: Dexuan Cui > Date: Wed, 22 Aug 2018 21:20:03 + > > > --- > > drivers/net/hyperv/netvsc_drv.c | 11 ++- > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > FYI: these are the related 3 paths which show the deadlock: > > This incredibly useful information belongs in the commit log > message, and therefore before the --- and signoffs. Hi David, I was afraid the call-traces are too detailed. :-) Can you please move the info to before the --- line? Or, should I resend the patch with the commit log updated? Thanks, -- Dexuan
Re: [PATCH net] vti6: remove !skb->ignore_df check from vti6_xmit()
From: Alexey Kodanev Date: Thu, 23 Aug 2018 19:49:54 +0300 > Before the commit d6990976af7c ("vti6: fix PMTU caching and reporting > on xmit") '!skb->ignore_df' check was always true because the function > skb_scrub_packet() was called before it, resetting ignore_df to zero. > > In the commit, skb_scrub_packet() was moved below, and now this check > can be false for the packet, e.g. when sending it in the two fragments, > this prevents successful PMTU updates in such case. The next attempts > to send the packet lead to the same tx error. Moreover, vti6 initial > MTU value relies on PMTU adjustments. > > This issue can be reproduced with the following LTP test script: > udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000 > > Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.") > Signed-off-by: Alexey Kodanev Applied and queued up for -stable, thank you.
Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
* Grygorii Strashko [180830 00:12]: > Hi Tony, > > On 08/29/2018 10:00 AM, Tony Lindgren wrote: > > The current cpsw usage for cpsw-phy-sel is undocumented but is used for > > all the boards using cpsw. And cpsw-phy-sel is not really a child of > > the cpsw device, it lives in the system control module instead. > > > > Let's document the existing usage, and improve it a bit where we prefer > > to use a phandle instead of a child device for it. That way we can > > properly describe the hardware in dts files for things like genpd. > > I'm ok with this series, but I really don't like cpsw-phy-sel in general. Yeah this binding predates any standards. This series only fixes the nasty issue of cpsw claiming a module as a child that's outside it's IO range. > It was introduced long time back and now I'm thinking about possibility to > replace it with > one of current generic interfaces - for example mux-controller. > Each port will control up to 3 muxes (port mode, idmode and rmii_ext_clk) and > > transform phy-mode => mux states. > What do you think? Sure a mux-controller here makes sense. > Another option is to use phy, but it'd be complicated. For the port muxes, how about a phy driver just using a pinctrl driver? In general, it seems cpsw is just an interconnect instance (L4_FAST) with a control module (CPSW_WR) and a pile of independent other modules. That's described nicely in am437x TRM chapter "2.1.4 L4 Fast Peripheral Memory Map". So from that point of view the binding reg entries right now are all wrong :) In the long run cpsw should be really treated as an interconnect instance with it's control module providing standard Linux framework services such as clock / regulator / phy / pinctrl / iio whatever for the other modules. Just my 2c based on looking at the interconnect, I'm not too familiar with cpsw otherwise. Regards, Tony
Re: [PATCH net] ebpf: fix bpf_msg_pull_data
On 08/29/2018 05:07 PM, Tushar Dave wrote: While doing some preliminary testing it is found that bpf helper bpf_msg_pull_data does not calculate the data and data_end offset correctly. Fix it! Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") Signed-off-by: Tushar Dave Acked-by: Sowmini Varadhan --- net/core/filter.c | 38 +- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index c25eb36..3eeb3d6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) BPF_CALL_4(bpf_msg_pull_data, struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags) { - unsigned int len = 0, offset = 0, copy = 0; + unsigned int len = 0, offset = 0, copy = 0, off = 0; struct scatterlist *sg = msg->sg_data; int first_sg, last_sg, i, shift; unsigned char *p, *to, *from; @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) i = msg->sg_start; do { len = sg[i].length; - offset += len; if (start < offset + len) break; + offset += len; i++; if (i == MAX_SKB_FRAGS) i = 0; - } while (i != msg->sg_end); + } while (i <= msg->sg_end); + /* return error if start is out of range */ if (unlikely(start >= offset + len)) return -EINVAL; - if (!msg->sg_copy[i] && bytes <= len) - goto out; + /* return error if i is last entry in sglist and end is out of range */ + if (msg->sg_copy[i] && end > offset + len) + return -EINVAL; first_sg = i; + /* if i is not last entry in sg list and end (i.e start + bytes) is +* within this sg[i] then goto out and calculate data and data_end +*/ + if (!msg->sg_copy[i] && end <= offset + len) + goto out; + /* At this point we need to linearize multiple scatterlist * elements or a single shared page. Either way we need to * copy into a linear buffer exclusively owned by BPF. Then @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) i++; if (i == MAX_SKB_FRAGS) i = 0; - if (bytes < copy) + if (end < copy) break; - } while (i != msg->sg_end); + } while (i <= msg->sg_end); + + /* return error if i is last entry in sglist and end is out of range */ + if (i > msg->sg_end && end > offset + copy) + return -EINVAL; + last_sg = i; if (unlikely(copy < end - start)) @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) if (unlikely(!page)) return -ENOMEM; p = page_address(page); - offset = 0; i = first_sg; do { from = sg_virt([i]); len = sg[i].length; - to = p + offset; + to = p + off; memcpy(to, from, len); - offset += len; + off += len; sg[i].length = 0; put_page(sg_page([i])); i++; if (i == MAX_SKB_FRAGS) i = 0; - } while (i != last_sg); + } while (i < last_sg); sg[first_sg].length = copy; sg_set_page([first_sg], page, copy, 0); @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) else move_from = i + shift; - if (move_from == msg->sg_end) + if (move_from > msg->sg_end) break; sg[i] = sg[move_from]; @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) if (msg->sg_end < 0) msg->sg_end += MAX_SKB_FRAGS; out: - msg->data = sg_virt([i]) + start - offset; + msg->data = sg_virt([first_sg]) + start - offset; msg->data_end = msg->data + bytes; return 0; Please discard this patch. I just noticed that Daniel Borkmann sent some similar fixes for bpf_msg_pull_data. -Tushar
[PATCH net-next] net: dsa: mv88e6xxx: Share main switch IRQ
On some boards the interrupt can be shared between multiple devices. For example on Turris Mox the interrupt is shared between all switches. Signed-off-by: Marek Behun --- drivers/net/dsa/mv88e6xxx/chip.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 8da3d39e3218..b57f5403982a 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -434,7 +434,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip *chip) err = request_threaded_irq(chip->irq, NULL, mv88e6xxx_g1_irq_thread_fn, - IRQF_ONESHOT, + IRQF_ONESHOT | IRQF_SHARED, dev_name(chip->dev), chip); if (err) mv88e6xxx_g1_irq_free_common(chip); -- 2.16.4
Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
> Thanks for working on this, LGTM! Are you also planning to get ixgbe > out after that? > I currently don't have i40e nic to test, so I'm also looking forward to the ixgbe patch! Thank you William
Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
Hi Tony, On 08/29/2018 10:00 AM, Tony Lindgren wrote: > The current cpsw usage for cpsw-phy-sel is undocumented but is used for > all the boards using cpsw. And cpsw-phy-sel is not really a child of > the cpsw device, it lives in the system control module instead. > > Let's document the existing usage, and improve it a bit where we prefer > to use a phandle instead of a child device for it. That way we can > properly describe the hardware in dts files for things like genpd. I'm ok with this series, but I really don't like cpsw-phy-sel in general. It was introduced long time back and now I'm thinking about possibility to replace it with one of current generic interfaces - for example mux-controller. Each port will control up to 3 muxes (port mode, idmode and rmii_ext_clk) and transform phy-mode => mux states. What do you think? Another option is to use phy, but it'd be complicated. > > Cc: devicet...@vger.kernel.org > Cc: Andrew Lunn > Cc: Grygorii Strashko > Cc: Ivan Khoronzhuk > Cc: Mark Rutland > Cc: Murali Karicheri > Cc: Rob Herring > Signed-off-by: Tony Lindgren > --- > Documentation/devicetree/bindings/net/cpsw.txt | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/Documentation/devicetree/bindings/net/cpsw.txt > b/Documentation/devicetree/bindings/net/cpsw.txt > --- a/Documentation/devicetree/bindings/net/cpsw.txt > +++ b/Documentation/devicetree/bindings/net/cpsw.txt > @@ -19,6 +19,10 @@ Required properties: > - slaves: Specifies number for slaves > - active_slave : Specifies the slave to use for time stamping, > ethtool and SIOCGMIIPHY > +- cpsw-phy-sel : Specifies the phandle to the CPSW phy mode > selection > + device. See also cpsw-phy-sel.txt for it's binding. > + Note that in legacy cases cpsw-phy-sel may be > + a child device instead of a phandle. > > Optional properties: > - ti,hwmods : Must be "cpgmac0" > @@ -75,6 +79,7 @@ Examples: > cpts_clock_mult = <0x8000>; > cpts_clock_shift = <29>; > syscon = <>; > + cpsw-phy-sel = <_sel>; > cpsw_emac0: slave@0 { > phy_id = <_mdio>, <0>; > phy-mode = "rgmii-txid"; > @@ -103,6 +108,7 @@ Examples: > cpts_clock_mult = <0x8000>; > cpts_clock_shift = <29>; > syscon = <>; > + cpsw-phy-sel = <_sel>; > cpsw_emac0: slave@0 { > phy_id = <_mdio>, <0>; > phy-mode = "rgmii-txid"; > -- regards, -grygorii
[PATCH net] ebpf: fix bpf_msg_pull_data
While doing some preliminary testing it is found that bpf helper bpf_msg_pull_data does not calculate the data and data_end offset correctly. Fix it! Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") Signed-off-by: Tushar Dave Acked-by: Sowmini Varadhan --- net/core/filter.c | 38 +- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index c25eb36..3eeb3d6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) BPF_CALL_4(bpf_msg_pull_data, struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags) { - unsigned int len = 0, offset = 0, copy = 0; + unsigned int len = 0, offset = 0, copy = 0, off = 0; struct scatterlist *sg = msg->sg_data; int first_sg, last_sg, i, shift; unsigned char *p, *to, *from; @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) i = msg->sg_start; do { len = sg[i].length; - offset += len; if (start < offset + len) break; + offset += len; i++; if (i == MAX_SKB_FRAGS) i = 0; - } while (i != msg->sg_end); + } while (i <= msg->sg_end); + /* return error if start is out of range */ if (unlikely(start >= offset + len)) return -EINVAL; - if (!msg->sg_copy[i] && bytes <= len) - goto out; + /* return error if i is last entry in sglist and end is out of range */ + if (msg->sg_copy[i] && end > offset + len) + return -EINVAL; first_sg = i; + /* if i is not last entry in sg list and end (i.e start + bytes) is +* within this sg[i] then goto out and calculate data and data_end +*/ + if (!msg->sg_copy[i] && end <= offset + len) + goto out; + /* At this point we need to linearize multiple scatterlist * elements or a single shared page. Either way we need to * copy into a linear buffer exclusively owned by BPF. Then @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) i++; if (i == MAX_SKB_FRAGS) i = 0; - if (bytes < copy) + if (end < copy) break; - } while (i != msg->sg_end); + } while (i <= msg->sg_end); + + /* return error if i is last entry in sglist and end is out of range */ + if (i > msg->sg_end && end > offset + copy) + return -EINVAL; + last_sg = i; if (unlikely(copy < end - start)) @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) if (unlikely(!page)) return -ENOMEM; p = page_address(page); - offset = 0; i = first_sg; do { from = sg_virt([i]); len = sg[i].length; - to = p + offset; + to = p + off; memcpy(to, from, len); - offset += len; + off += len; sg[i].length = 0; put_page(sg_page([i])); i++; if (i == MAX_SKB_FRAGS) i = 0; - } while (i != last_sg); + } while (i < last_sg); sg[first_sg].length = copy; sg_set_page([first_sg], page, copy, 0); @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) else move_from = i + shift; - if (move_from == msg->sg_end) + if (move_from > msg->sg_end) break; sg[i] = sg[move_from]; @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg) if (msg->sg_end < 0) msg->sg_end += MAX_SKB_FRAGS; out: - msg->data = sg_virt([i]) + start - offset; + msg->data = sg_virt([first_sg]) + start - offset; msg->data_end = msg->data + bytes; return 0; -- 1.8.3.1
[PATCH net-next] net/ipv6: Do not reset nl_net in ip6_route_info_create
From: David Ahern nl_net is set on entry to ip6_route_info_create. Only devices within that namespace are considered so no need to reset it before returning. Signed-off-by: David Ahern --- net/ipv6/route.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index c4ea13e8360b..b5f385d2b0e9 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -3138,8 +3138,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg, rt->fib6_nh.nh_dev = dev; rt->fib6_table = table; - cfg->fc_nlinfo.nl_net = dev_net(dev); - if (idev) in6_dev_put(idev); -- 2.11.0
[PATCH net-next] net/ipv4: Add extack message that dev is required for ONLINK
From: David Ahern Make IPv4 consistent with IPv6 and return an extack message that the ONLINK flag requires a nexthop device. Signed-off-by: David Ahern --- net/ipv4/fib_semantics.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index f3c89ccf14c5..bee8db979195 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -797,8 +797,10 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_nh *nh, return -EINVAL; } dev = __dev_get_by_index(net, nh->nh_oif); - if (!dev) + if (!dev) { + NL_SET_ERR_MSG(extack, "Nexthop device required for onlink"); return -ENODEV; + } if (!(dev->flags & IFF_UP)) { NL_SET_ERR_MSG(extack, "Nexthop device is not up"); -- 2.11.0
Re: pull-request: bpf 2018-08-29
From: Daniel Borkmann Date: Wed, 29 Aug 2018 21:07:24 +0200 > The following pull-request contains BPF updates for your *net* tree. > > The main changes are: > > 1) Fix a build error in sk_reuseport_convert_ctx_access() when >compiling with clang which cannot resolve hweight_long() at >build time inside the BUILD_BUG_ON() assertion, from Stefan. > > 2) Several fixes for BPF sockmap, four of them in getting the >bpf_msg_pull_data() helper to work, one use after free case >in bpf_tcp_close() and one refcount leak in bpf_tcp_recvmsg(), >from Daniel. > > 3) Another fix for BPF sockmap where we misaccount sk_mem_uncharge() >in the socket redirect error case from unwinding scatterlist >twice, from John. > > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Pulled, thanks Daniel.
[net-next 15/15] i40e: Prevent deleting MAC address from VF when set by PF
From: Patryk Małek To prevent VF from deleting MAC address that was assigned by the PF we need to check for that scenario when we try to delete a MAC address from a VF. Signed-off-by: Patryk Małek Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index f56c645588f3..3e707c7e6782 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2569,6 +2569,16 @@ static int i40e_vc_del_mac_addr_msg(struct i40e_vf *vf, u8 *msg, u16 msglen) ret = I40E_ERR_INVALID_MAC_ADDR; goto error_param; } + + if (vf->pf_set_mac && + ether_addr_equal(al->list[i].addr, +vf->default_lan_addr.addr)) { + dev_err(>pdev->dev, + "MAC addr %pM has been set by PF, cannot delete it for VF %d, reset VF to change MAC addr\n", + vf->default_lan_addr.addr, vf->vf_id); + ret = I40E_ERR_PARAM; + goto error_param; + } } vsi = pf->vsi[vf->lan_vsi_idx]; -- 2.17.1
[net-next 11/15] i40e: Check and correct speed values for link on open
From: Jan Sokolowski If our card has been put in an unstable state due to other drivers interacting with it, speed settings might be incorrect. If incorrect, forcefully reset them on open to known default values. Signed-off-by: Jan Sokolowski Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 27 ++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6cb4076e8fba..c30feb27e1c0 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -6570,6 +6570,24 @@ static i40e_status i40e_force_link_state(struct i40e_pf *pf, bool is_up) struct i40e_hw *hw = >hw; i40e_status err; u64 mask; + u8 speed; + + /* Card might've been put in an unstable state by other drivers +* and applications, which causes incorrect speed values being +* set on startup. In order to clear speed registers, we call +* get_phy_capabilities twice, once to get initial state of +* available speeds, and once to get current PHY config. +*/ + err = i40e_aq_get_phy_capabilities(hw, false, true, , + NULL); + if (err) { + dev_err(>pdev->dev, + "failed to get phy cap., ret = %s last_status = %s\n", + i40e_stat_str(hw, err), + i40e_aq_str(hw, hw->aq.asq_last_status)); + return err; + } + speed = abilities.link_speed; /* Get the current phy config */ err = i40e_aq_get_phy_capabilities(hw, false, false, , @@ -6583,9 +6601,9 @@ static i40e_status i40e_force_link_state(struct i40e_pf *pf, bool is_up) } /* If link needs to go up, but was not forced to go down, -* no need for a flap +* and its speed values are OK, no need for a flap */ - if (is_up && abilities.phy_type != 0) + if (is_up && abilities.phy_type != 0 && abilities.link_speed != 0) return I40E_SUCCESS; /* To force link we need to set bits for all supported PHY types, @@ -6597,7 +6615,10 @@ static i40e_status i40e_force_link_state(struct i40e_pf *pf, bool is_up) config.phy_type_ext = is_up ? (u8)((mask >> 32) & 0xff) : 0; /* Copy the old settings, except of phy_type */ config.abilities = abilities.abilities; - config.link_speed = abilities.link_speed; + if (abilities.link_speed != 0) + config.link_speed = abilities.link_speed; + else + config.link_speed = speed; config.eee_capability = abilities.eee_capability; config.eeer = abilities.eeer_val; config.low_power_ctrl = abilities.d3_lpan; -- 2.17.1
[net-next 08/15] virtchnl: use u8 type for a field in the virtchnl_filter struct
From: Harshitha Ramamurthy The virtchnl_filter struct has a field called field_flags. A previous commit mistakenly had the type to be a __u8. What we want is for the field to be an unsigned 8 bit value, so let's just use the existing kernel type u8 for that. Signed-off-by: Harshitha Ramamurthy Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- include/linux/avf/virtchnl.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h index 212b3822d180..b41f7bc958ef 100644 --- a/include/linux/avf/virtchnl.h +++ b/include/linux/avf/virtchnl.h @@ -573,7 +573,7 @@ struct virtchnl_filter { enumvirtchnl_flow_type flow_type; enumvirtchnl_action action; u32 action_meta; - __u8field_flags; + u8 field_flags; }; VIRTCHNL_CHECK_STRUCT_LEN(272, virtchnl_filter); -- 2.17.1
[net-next 07/15] i40evf: set IFF_UNICAST_FLT flag for the VF
From: Lihong Yang Set IFF_UNICAST_FLT flag for the VF to prevent it from entering promiscuous mode when macvlan is added to the VF. Signed-off-by: Lihong Yang Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index 5906c1c1d19d..07f187b5bcac 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -3358,6 +3358,8 @@ int i40evf_process_config(struct i40evf_adapter *adapter) if (vfres->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_VLAN) netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER; + netdev->priv_flags |= IFF_UNICAST_FLT; + /* Do not turn on offloads when they are requested to be turned off. * TSO needs minimum 576 bytes to work correctly. */ -- 2.17.1
[net-next 01/15] i40e: convert queue stats to i40e_stats array
From: Jacob Keller Use an i40e_stats array to handle the queue stats, instead of coding similar functionality separately. Because of how the queue stats are accessed on some kernels, we can't easily use i40e_add_ethtool_stats. Instead, implement a separate helper, i40e_add_queue_stats, which we'll use instead. This helper will correctly implement the u64_stats_fetch_begin_irq logic and allow retries until successful. We share the most complex code by re-using i40e_add_one_ethtool_stat. This logic additionally easily supports skipping disabled rings by using a ternary operator before calling the u64_stats_fetch_begin_irq() function, so that we correctly zero-out the stats values without having to perform two separate sections of code. This significantly reduces the boiler plate code in i40e_get_ethtool_stats, and helps keep the complex logic contained to as few functions as possible. With this patch, we've finally converted all the statistics to use the helpers and the i40e_stats function. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../net/ethernet/intel/i40e/i40e_ethtool.c| 148 +++--- 1 file changed, 89 insertions(+), 59 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 5ff6caa83948..235012b3bd42 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -33,6 +33,8 @@ struct i40e_stats { I40E_STAT(struct i40e_veb, _name, _stat) #define I40E_PFC_STAT(_name, _stat) \ I40E_STAT(struct i40e_pfc_stats, _name, _stat) +#define I40E_QUEUE_STAT(_name, _stat) \ + I40E_STAT(struct i40e_ring, _name, _stat) static const struct i40e_stats i40e_gstrings_net_stats[] = { I40E_NETDEV_STAT(rx_packets), @@ -171,20 +173,16 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = { I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff), }; -/* We use num_tx_queues here as a proxy for the maximum number of queues - * available because we always allocate queues symmetrically. - */ -#define I40E_MAX_NUM_QUEUES(n) ((n)->num_tx_queues) -#define I40E_QUEUE_STATS_LEN(n) \ - (I40E_MAX_NUM_QUEUES(n) \ - * 2 /* Tx and Rx together */ \ - * (sizeof(struct i40e_queue_stats) / sizeof(u64))) -#define I40E_GLOBAL_STATS_LEN ARRAY_SIZE(i40e_gstrings_stats) +static const struct i40e_stats i40e_gstrings_queue_stats[] = { + I40E_QUEUE_STAT("%s-%u.packets", stats.packets), + I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes), +}; + #define I40E_NETDEV_STATS_LEN ARRAY_SIZE(i40e_gstrings_net_stats) + #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats) -#define I40E_VSI_STATS_LEN(n) (I40E_NETDEV_STATS_LEN + \ -I40E_MISC_STATS_LEN + \ -I40E_QUEUE_STATS_LEN((n))) + +#define I40E_VSI_STATS_LEN (I40E_NETDEV_STATS_LEN + I40E_MISC_STATS_LEN) #define I40E_PFC_STATS_LEN (ARRAY_SIZE(i40e_gstrings_pfc_stats) * \ I40E_MAX_USER_PRIORITY) @@ -193,10 +191,15 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = { (ARRAY_SIZE(i40e_gstrings_veb_tc_stats) * \ I40E_MAX_TRAFFIC_CLASS)) -#define I40E_PF_STATS_LEN(n) (I40E_GLOBAL_STATS_LEN + \ +#define I40E_GLOBAL_STATS_LEN ARRAY_SIZE(i40e_gstrings_stats) + +#define I40E_PF_STATS_LEN (I40E_GLOBAL_STATS_LEN + \ I40E_PFC_STATS_LEN + \ I40E_VEB_STATS_LEN + \ -I40E_VSI_STATS_LEN((n))) +I40E_VSI_STATS_LEN) + +/* Length of stats for a single queue */ +#define I40E_QUEUE_STATS_LEN ARRAY_SIZE(i40e_gstrings_queue_stats) enum i40e_ethtool_test_id { I40E_ETH_TEST_REG = 0, @@ -1701,11 +1704,30 @@ static int i40e_get_stats_count(struct net_device *netdev) struct i40e_netdev_priv *np = netdev_priv(netdev); struct i40e_vsi *vsi = np->vsi; struct i40e_pf *pf = vsi->back; + int stats_len; if (vsi == pf->vsi[pf->lan_vsi] && pf->hw.partition_id == 1) - return I40E_PF_STATS_LEN(netdev); + stats_len = I40E_PF_STATS_LEN; else - return I40E_VSI_STATS_LEN(netdev); + stats_len = I40E_VSI_STATS_LEN; + + /* The number of stats reported for a given net_device must remain +* constant throughout the life of that device. +* +* This is because the API for obtaining the size, strings, and stats +* is spread out over three separate ethtool ioctls. There is no safe +* way to lock the number of stats across these calls, so we must +
[net-next 05/15] i40evf: Validate the number of queues a PF sends
From: Paul M Stillwell Jr A PF can send any number of queues to the VF and the VF may not be able to support that many. Check to see that the number of queues is less than or equal to the max number of queues the VF can have. Signed-off-by: Paul M Stillwell Jr Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../ethernet/intel/i40evf/i40evf_virtchnl.c | 32 +++ 1 file changed, 32 insertions(+) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c index 79b7be83b5c3..6579dabab78c 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c @@ -153,6 +153,32 @@ int i40evf_send_vf_config_msg(struct i40evf_adapter *adapter) NULL, 0); } +/** + * i40evf_validate_num_queues + * @adapter: adapter structure + * + * Validate that the number of queues the PF has sent in + * VIRTCHNL_OP_GET_VF_RESOURCES is not larger than the VF can handle. + **/ +static void i40evf_validate_num_queues(struct i40evf_adapter *adapter) +{ + if (adapter->vf_res->num_queue_pairs > I40EVF_MAX_REQ_QUEUES) { + struct virtchnl_vsi_resource *vsi_res; + int i; + + dev_info(>pdev->dev, "Received %d queues, but can only have a max of %d\n", +adapter->vf_res->num_queue_pairs, +I40EVF_MAX_REQ_QUEUES); + dev_info(>pdev->dev, "Fixing by reducing queues to %d\n", +I40EVF_MAX_REQ_QUEUES); + adapter->vf_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES; + for (i = 0; i < adapter->vf_res->num_vsis; i++) { + vsi_res = >vf_res->vsi_res[i]; + vsi_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES; + } + } +} + /** * i40evf_get_vf_config * @adapter: private adapter structure @@ -195,6 +221,11 @@ int i40evf_get_vf_config(struct i40evf_adapter *adapter) err = (i40e_status)le32_to_cpu(event.desc.cookie_low); memcpy(adapter->vf_res, event.msg_buf, min(event.msg_len, len)); + /* some PFs send more queues than we should have so validate that +* we aren't getting too many queues +*/ + if (!err) + i40evf_validate_num_queues(adapter); i40e_vf_parse_hw_config(hw, adapter->vf_res); out_alloc: kfree(event.msg_buf); @@ -1329,6 +1360,7 @@ void i40evf_virtchnl_completion(struct i40evf_adapter *adapter, I40E_MAX_VF_VSI * sizeof(struct virtchnl_vsi_resource); memcpy(adapter->vf_res, msg, min(msglen, len)); + i40evf_validate_num_queues(adapter); i40e_vf_parse_hw_config(>hw, adapter->vf_res); if (is_zero_ether_addr(adapter->hw.mac.addr)) { /* restore current mac address */ -- 2.17.1
[net-next 12/15] i40evf: Don't enable vlan stripping when rx offload is turned on
From: Patryk Małek With current implementation of i40evf_set_features when user sets any offload via ethtool we set I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING as a required aq which triggers driver to call i40evf_enable_vlan_stripping. This shouldn't take place. This patches fixes it by setting the flag only when VLAN offload is turned on. Signed-off-by: Patryk Małek Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index 07f187b5bcac..c7048cf484fb 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -3120,18 +3120,19 @@ static int i40evf_set_features(struct net_device *netdev, { struct i40evf_adapter *adapter = netdev_priv(netdev); - /* Don't allow changing VLAN_RX flag when VLAN is set for VF -* and return an error in this case + /* Don't allow changing VLAN_RX flag when adapter is not capable +* of VLAN offload */ - if (VLAN_ALLOWED(adapter)) { + if (!VLAN_ALLOWED(adapter)) { + if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) + return -EINVAL; + } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) { if (features & NETIF_F_HW_VLAN_CTAG_RX) adapter->aq_required |= I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING; else adapter->aq_required |= I40EVF_FLAG_AQ_DISABLE_VLAN_STRIPPING; - } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) { - return -EINVAL; } return 0; -- 2.17.1
[net-next 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h
From: Jacob Keller Move the boiler plate structures and helper functions we recently added into their own header file, so that the complete collection is located together. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../net/ethernet/intel/i40e/i40e_ethtool.c| 182 +-- .../ethernet/intel/i40e/i40e_ethtool_stats.h | 221 ++ 2 files changed, 222 insertions(+), 181 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 235012b3bd42..d7d3974beca2 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -6,25 +6,8 @@ #include "i40e.h" #include "i40e_diag.h" -struct i40e_stats { - /* The stat_string is expected to be a format string formatted using -* vsnprintf by i40e_add_stat_strings. Every member of a stats array -* should use the same format specifiers as they will be formatted -* using the same variadic arguments. -*/ - char stat_string[ETH_GSTRING_LEN]; - int sizeof_stat; - int stat_offset; -}; - -#define I40E_STAT(_type, _name, _stat) { \ - .stat_string = _name, \ - .sizeof_stat = FIELD_SIZEOF(_type, _stat), \ - .stat_offset = offsetof(_type, _stat) \ -} +#include "i40e_ethtool_stats.h" -#define I40E_NETDEV_STAT(_net_stat) \ - I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat) #define I40E_PF_STAT(_name, _stat) \ I40E_STAT(struct i40e_pf, _name, _stat) #define I40E_VSI_STAT(_name, _stat) \ @@ -173,11 +156,6 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = { I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff), }; -static const struct i40e_stats i40e_gstrings_queue_stats[] = { - I40E_QUEUE_STAT("%s-%u.packets", stats.packets), - I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes), -}; - #define I40E_NETDEV_STATS_LEN ARRAY_SIZE(i40e_gstrings_net_stats) #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats) @@ -1749,128 +1727,6 @@ static int i40e_get_sset_count(struct net_device *netdev, int sset) } } -/** - * i40e_add_one_ethtool_stat - copy the stat into the supplied buffer - * @data: location to store the stat value - * @pointer: basis for where to copy from - * @stat: the stat definition - * - * Copies the stat data defined by the pointer and stat structure pair into - * the memory supplied as data. Used to implement i40e_add_ethtool_stats and - * i40e_add_queue_stats. If the pointer is null, data will be zero'd. - */ -static inline void -i40e_add_one_ethtool_stat(u64 *data, void *pointer, - const struct i40e_stats *stat) -{ - char *p; - - if (!pointer) { - /* ensure that the ethtool data buffer is zero'd for any stats -* which don't have a valid pointer. -*/ - *data = 0; - return; - } - - p = (char *)pointer + stat->stat_offset; - switch (stat->sizeof_stat) { - case sizeof(u64): - *data = *((u64 *)p); - break; - case sizeof(u32): - *data = *((u32 *)p); - break; - case sizeof(u16): - *data = *((u16 *)p); - break; - case sizeof(u8): - *data = *((u8 *)p); - break; - default: - WARN_ONCE(1, "unexpected stat size for %s", - stat->stat_string); - *data = 0; - } -} - -/** - * __i40e_add_ethtool_stats - copy stats into the ethtool supplied buffer - * @data: ethtool stats buffer - * @pointer: location to copy stats from - * @stats: array of stats to copy - * @size: the size of the stats definition - * - * Copy the stats defined by the stats array using the pointer as a base into - * the data buffer supplied by ethtool. Updates the data pointer to point to - * the next empty location for successive calls to __i40e_add_ethtool_stats. - * If pointer is null, set the data values to zero and update the pointer to - * skip these stats. - **/ -static inline void -__i40e_add_ethtool_stats(u64 **data, void *pointer, -const struct i40e_stats stats[], -const unsigned int size) -{ - unsigned int i; - - for (i = 0; i < size; i++) - i40e_add_one_ethtool_stat((*data)++, pointer, [i]); -} - -/** - * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer - * @data: ethtool stats buffer - * @pointer: location where stats are stored - * @stats: static const array of stat definitions - * - * Macro to ease the use of __i40e_add_ethtool_stats by taking a static - * constant stats array and passing the ARRAY_SIZE(). This avoids typos by - * ensuring that we pass the size
[net-next 06/15] i40e: use correct length for strncpy
From: Mitch Williams Caught by GCC 8. When we provide a length for strncpy, we should not include the terminating null. So we must tell it one less than the size of the destination buffer. Signed-off-by: Mitch Williams Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_ptp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c b/drivers/net/ethernet/intel/i40e/i40e_ptp.c index 35f2866b38c6..1199f0502d6d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c @@ -694,7 +694,8 @@ static long i40e_ptp_create_clock(struct i40e_pf *pf) if (!IS_ERR_OR_NULL(pf->ptp_clock)) return 0; - strncpy(pf->ptp_caps.name, i40e_driver_name, sizeof(pf->ptp_caps.name)); + strncpy(pf->ptp_caps.name, i40e_driver_name, + sizeof(pf->ptp_caps.name) - 1); pf->ptp_caps.owner = THIS_MODULE; pf->ptp_caps.max_adj = 9; pf->ptp_caps.n_ext_ts = 0; -- 2.17.1
[net-next 13/15] i40e: hold the rtnl lock on clearing interrupt scheme
From: Patryk Małek Hold the rtnl lock when we're clearing interrupt scheme in i40e_shutdown and in i40e_remove. Signed-off-by: Patryk Małek Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index c30feb27e1c0..112245f32d7d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -14182,6 +14182,7 @@ static void i40e_remove(struct pci_dev *pdev) mutex_destroy(>aq.asq_mutex); /* Clear all dynamic memory lists of rings, q_vectors, and VSIs */ + rtnl_lock(); i40e_clear_interrupt_scheme(pf); for (i = 0; i < pf->num_alloc_vsi; i++) { if (pf->vsi[i]) { @@ -14190,6 +14191,7 @@ static void i40e_remove(struct pci_dev *pdev) pf->vsi[i] = NULL; } } + rtnl_unlock(); for (i = 0; i < I40E_MAX_VEB; i++) { kfree(pf->veb[i]); @@ -14401,7 +14403,13 @@ static void i40e_shutdown(struct pci_dev *pdev) wr32(hw, I40E_PFPM_WUFC, (pf->wol_en ? I40E_PFPM_WUFC_MAG_MASK : 0)); + /* Since we're going to destroy queues during the +* i40e_clear_interrupt_scheme() we should hold the RTNL lock for this +* whole section +*/ + rtnl_lock(); i40e_clear_interrupt_scheme(pf); + rtnl_unlock(); if (system_state == SYSTEM_POWER_OFF) { pci_wake_from_d3(pdev, pf->wol_en); -- 2.17.1
[net-next 09/15] i40e: static analysis report from community
From: Martyna Szapar Static analysis tools report a problem from original driver submission. Removing unnecessary check in condition. Signed-off-by: Martyna Szapar Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index ac685ad4d877..62f2b5bce6bb 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -13033,7 +13033,7 @@ struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 flags, for (vsi_idx = 0; vsi_idx < pf->num_alloc_vsi; vsi_idx++) if (pf->vsi[vsi_idx] && pf->vsi[vsi_idx]->seid == vsi_seid) break; - if (vsi_idx >= pf->num_alloc_vsi && vsi_seid != 0) { + if (vsi_idx == pf->num_alloc_vsi && vsi_seid != 0) { dev_info(>pdev->dev, "vsi seid %d not found\n", vsi_seid); return NULL; -- 2.17.1
[net-next 10/15] i40e: report correct statistics when XDP is enabled
From: Björn Töpel When XDP is enabled, the driver will report incorrect statistics. Received frames will reported as transmitted frames. This commits fixes the i40e implementation of ndo_get_stats64 (struct net_device_ops), so that iproute2 will report correct statistics (e.g. when running "ip -stats link show dev eth0") even when XDP is enabled. Reported-by: Jesper Dangaard Brouer Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action") Signed-off-by: Björn Töpel Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++-- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 62f2b5bce6bb..6cb4076e8fba 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -420,9 +420,9 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev, struct rtnl_link_stats64 *stats) { struct i40e_netdev_priv *np = netdev_priv(netdev); - struct i40e_ring *tx_ring, *rx_ring; struct i40e_vsi *vsi = np->vsi; struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi); + struct i40e_ring *ring; int i; if (test_bit(__I40E_VSI_DOWN, vsi->state)) @@ -436,24 +436,26 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev, u64 bytes, packets; unsigned int start; - tx_ring = READ_ONCE(vsi->tx_rings[i]); - if (!tx_ring) + ring = READ_ONCE(vsi->tx_rings[i]); + if (!ring) continue; - i40e_get_netdev_stats_struct_tx(tx_ring, stats); + i40e_get_netdev_stats_struct_tx(ring, stats); - rx_ring = _ring[1]; + if (i40e_enabled_xdp_vsi(vsi)) { + ring++; + i40e_get_netdev_stats_struct_tx(ring, stats); + } + ring++; do { - start = u64_stats_fetch_begin_irq(_ring->syncp); - packets = rx_ring->stats.packets; - bytes = rx_ring->stats.bytes; - } while (u64_stats_fetch_retry_irq(_ring->syncp, start)); + start = u64_stats_fetch_begin_irq(>syncp); + packets = ring->stats.packets; + bytes = ring->stats.bytes; + } while (u64_stats_fetch_retry_irq(>syncp, start)); stats->rx_packets += packets; stats->rx_bytes += bytes; - if (i40e_enabled_xdp_vsi(vsi)) - i40e_get_netdev_stats_struct_tx(_ring[1], stats); } rcu_read_unlock(); -- 2.17.1
[net-next 14/15] i40evf: cancel workqueue sync for adminq when a VF is removed
From: Lihong Yang If a VF is being removed, there is no need to continue with the workqueue sync for the adminq task, thus cancel it. Without this call, when VFs are created and removed right away, there might be a chance for the driver to crash with events stuck in the adminq. Signed-off-by: Lihong Yang Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index c7048cf484fb..174d1da2857b 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -3910,6 +3910,8 @@ static void i40evf_remove(struct pci_dev *pdev) if (adapter->watchdog_timer.function) del_timer_sync(>watchdog_timer); + cancel_work_sync(>adminq_task); + i40evf_free_rss(adapter); if (hw->aq.asq.count) -- 2.17.1
[net-next 03/15] i40evf: update ethtool stats code and use helper functions
From: Jacob Keller Fix a bug in the way we handled VF queues, by always showing stats for the maximum number of queues, even if they aren't allocated. It is not safe to change the number of strings reported to ethtool, as grabbing statistics occurs over multiple ethtool ops for which the rtnl_lock() cannot be held the entire time. Avoid this by always reporting queue stats for the maximum number of queues in the netdevice. Share some of the helper functionality for adding stats with the PF code in i40e_ethtool_stats.h This should reduce the chance of potential future bugs, and make adding new statistics easier. Note for the queue stats, unlike the PF driver we do not keep an array of queue pointers, but an array of queues, so care must be taken to avoid accessing queue memory that hasn't yet been allocated. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../intel/i40evf/i40e_ethtool_stats.h | 221 ++ .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 ++- 2 files changed, 298 insertions(+), 60 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h diff --git a/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h new file mode 100644 index ..60b595dd8c39 --- /dev/null +++ b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h @@ -0,0 +1,221 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright(c) 2013 - 2018 Intel Corporation. */ + +/* ethtool statistics helpers */ + +/** + * struct i40e_stats - definition for an ethtool statistic + * @stat_string: statistic name to display in ethtool -S output + * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64) + * @stat_offset: offsetof() the stat from a base pointer + * + * This structure defines a statistic to be added to the ethtool stats buffer. + * It defines a statistic as offset from a common base pointer. Stats should + * be defined in constant arrays using the I40E_STAT macro, with every element + * of the array using the same _type for calculating the sizeof_stat and + * stat_offset. + * + * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or + * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from + * the i40e_add_ethtool_stat() helper function. + * + * The @stat_string is interpreted as a format string, allowing formatted + * values to be inserted while looping over multiple structures for a given + * statistics array. Thus, every statistic string in an array should have the + * same type and number of format specifiers, to be formatted by variadic + * arguments to the i40e_add_stat_string() helper function. + **/ +struct i40e_stats { + char stat_string[ETH_GSTRING_LEN]; + int sizeof_stat; + int stat_offset; +}; + +/* Helper macro to define an i40e_stat structure with proper size and type. + * Use this when defining constant statistics arrays. Note that @_type expects + * only a type name and is used multiple times. + */ +#define I40E_STAT(_type, _name, _stat) { \ + .stat_string = _name, \ + .sizeof_stat = FIELD_SIZEOF(_type, _stat), \ + .stat_offset = offsetof(_type, _stat) \ +} + +/* Helper macro for defining some statistics directly copied from the netdev + * stats structure. + */ +#define I40E_NETDEV_STAT(_net_stat) \ + I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat) + +/* Helper macro for defining some statistics related to queues */ +#define I40E_QUEUE_STAT(_name, _stat) \ + I40E_STAT(struct i40e_ring, _name, _stat) + +/* Stats associated with a Tx or Rx ring */ +static const struct i40e_stats i40e_gstrings_queue_stats[] = { + I40E_QUEUE_STAT("%s-%u.packets", stats.packets), + I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes), +}; + +/** + * i40evf_add_one_ethtool_stat - copy the stat into the supplied buffer + * @data: location to store the stat value + * @pointer: basis for where to copy from + * @stat: the stat definition + * + * Copies the stat data defined by the pointer and stat structure pair into + * the memory supplied as data. Used to implement i40e_add_ethtool_stats and + * i40evf_add_queue_stats. If the pointer is null, data will be zero'd. + */ +static inline void +i40evf_add_one_ethtool_stat(u64 *data, void *pointer, + const struct i40e_stats *stat) +{ + char *p; + + if (!pointer) { + /* ensure that the ethtool data buffer is zero'd for any stats +* which don't have a valid pointer. +*/ + *data = 0; + return; + } + + p = (char *)pointer + stat->stat_offset; + switch (stat->sizeof_stat) { + case sizeof(u64): + *data = *((u64 *)p); + break; + case sizeof(u32): + *data = *((u32 *)p); + break; + case sizeof(u16): +
[net-next 04/15] i40evf: Change a VF mac without reloading the VF driver
From: Paweł Jabłoński Add possibility to change a VF mac address from host side without reloading the VF driver on the guest side. Without this patch it is not possible to change the VF mac because executing i40evf_virtchnl_completion function with VIRTCHNL_OP_GET_VF_RESOURCES opcode resets the VF mac address to previous value. Signed-off-by: Paweł Jabłoński Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 8 +--- drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 11 +-- 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index c6d24eaede18..f56c645588f3 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2458,7 +2458,7 @@ static inline int i40e_check_vf_permission(struct i40e_vf *vf, !is_multicast_ether_addr(addr) && vf->pf_set_mac && !ether_addr_equal(addr, vf->default_lan_addr.addr)) { dev_err(>pdev->dev, - "VF attempting to override administratively set MAC address, reload the VF driver to resume normal operation\n"); + "VF attempting to override administratively set MAC address, bring down and up the VF interface to resume normal operation\n"); return -EPERM; } } @@ -3873,9 +3873,11 @@ int i40e_ndo_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac) mac, vf_id); } - /* Force the VF driver stop so it has to reload with new MAC address */ + /* Force the VF interface down so it has to bring up with new MAC +* address +*/ i40e_vc_disable_vf(vf); - dev_info(>pdev->dev, "Reload the VF driver to make this change effective.\n"); + dev_info(>pdev->dev, "Bring down and up the VF interface to make this change effective.\n"); error_param: return ret; diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c index 565677de5ba3..79b7be83b5c3 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c @@ -1330,8 +1330,15 @@ void i40evf_virtchnl_completion(struct i40evf_adapter *adapter, sizeof(struct virtchnl_vsi_resource); memcpy(adapter->vf_res, msg, min(msglen, len)); i40e_vf_parse_hw_config(>hw, adapter->vf_res); - /* restore current mac address */ - ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); + if (is_zero_ether_addr(adapter->hw.mac.addr)) { + /* restore current mac address */ + ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); + } else { + /* refresh current mac address if changed */ + ether_addr_copy(netdev->dev_addr, adapter->hw.mac.addr); + ether_addr_copy(netdev->perm_addr, + adapter->hw.mac.addr); + } i40evf_process_config(adapter); } break; -- 2.17.1
[net-next 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-29
This series contains updates to i40e, i40evf and virtchnl. Jake implements helper functions to use an array to handle the queue stats which reduces the boiler plate code as well as keep the complexity localized to a few functions. Paweł adds the ability to change a VF's MAC address from the host side without having to reload the VF driver on the guest side. Paul adds a check to ensure that the number of queues that the PF sends to the VF is equal to or less than the maximum number of queues the VF can support. Mitch fixes an issue caught by GCC 8, where we need to not include the terminating null in the length of the string for strncpy(). Lihong fixes a VF issue to ensure that it does not enter into promiscuous mode when macvlan is added to the VF. Fixed a potential crash after a VF is removed, since the workqueue sync for the adminq task was not being cancelled. Harshitha fixes the type for field_flags in the virtchnl_filter struct. Martyna removes an unnecessary check in a conditional if statement. Björn fixes an issue reported by Jesper Dangaard Brouer, where the driver was reporting incorrect statistics when XDP was enabled. Jan fixes the potential reporting of incorrect speed settings. Patryk fixed an issue where the flag I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING was getting set when any offload is set via ethtool. Resolved by only setting this flag when VLAN offload is enabled. Also ensure we hold the rtnl lock when we are clearing the interrupt scheme. Added a check when deleting the MAC address from the VF to ensure that the MAC address was not set by the PF and if it was, do not delete it. The following are changes since commit 817e60a7a2bb1f22052f18562990d675cb3a3762: Merge branch 'nfp-add-NFP5000-support' and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE Björn Töpel (1): i40e: report correct statistics when XDP is enabled Harshitha Ramamurthy (1): virtchnl: use u8 type for a field in the virtchnl_filter struct Jacob Keller (3): i40e: convert queue stats to i40e_stats array i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h i40evf: update ethtool stats code and use helper functions Jan Sokolowski (1): i40e: Check and correct speed values for link on open Lihong Yang (2): i40evf: set IFF_UNICAST_FLT flag for the VF i40evf: cancel workqueue sync for adminq when a VF is removed Martyna Szapar (1): i40e: static analysis report from community Mitch Williams (1): i40e: use correct length for strncpy Patryk Małek (3): i40evf: Don't enable vlan stripping when rx offload is turned on i40e: hold the rtnl lock on clearing interrupt scheme i40e: Prevent deleting MAC address from VF when set by PF Paul M Stillwell Jr (1): i40evf: Validate the number of queues a PF sends Paweł Jabłoński (1): i40evf: Change a VF mac without reloading the VF driver .../net/ethernet/intel/i40e/i40e_ethtool.c| 238 -- .../ethernet/intel/i40e/i40e_ethtool_stats.h | 221 drivers/net/ethernet/intel/i40e/i40e_main.c | 61 +++-- drivers/net/ethernet/intel/i40e/i40e_ptp.c| 3 +- .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 18 +- .../intel/i40evf/i40e_ethtool_stats.h | 221 .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 +- .../net/ethernet/intel/i40evf/i40evf_main.c | 15 +- .../ethernet/intel/i40evf/i40evf_virtchnl.c | 43 +++- include/linux/avf/virtchnl.h | 2 +- 10 files changed, 678 insertions(+), 281 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h -- 2.17.1
[PATCH net-next] tcp: change IPv6 flow-label upon receiving spurious retransmission
Currently a Linux IPv6 TCP sender will change the flow label upon timeouts to potentially steer away from a data path that has gone bad. However this does not help if the problem is on the ACK path and the data path is healthy. In this case the receiver is likely to receive repeated spurious retransmission because the sender couldn't get the ACKs in time and has recurring timeouts. This patch adds another feature to mitigate this problem. It leverages the DSACK states in the receiver to change the flow label of the ACKs to speculatively re-route the ACK packets. In order to allow triggering on the second consecutive spurious RTO, the receiver changes the flow label upon sending a second consecutive DSACK for a sequence number below RCV.NXT. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_input.c | 13 + 2 files changed, 15 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b8af2fec5ad5..8c4235c098fd 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2595,6 +2595,8 @@ int tcp_disconnect(struct sock *sk, int flags) tp->compressed_ack = 0; tp->bytes_sent = 0; tp->bytes_retrans = 0; + tp->duplicate_sack[0].start_seq = 0; + tp->duplicate_sack[0].end_seq = 0; tp->dsack_dups = 0; tp->reord_seen = 0; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4c2dd9f863f7..62508a2f9b21 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4199,6 +4199,17 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, u32 end_seq) tcp_sack_extend(tp->duplicate_sack, seq, end_seq); } +static void tcp_rcv_spurious_retrans(struct sock *sk, const struct sk_buff *skb) +{ + /* When the ACK path fails or drops most ACKs, the sender would +* timeout and spuriously retransmit the same segment repeatedly. +* The receiver remembers and reflects via DSACKs. Leverage the +* DSACK state and change the txhash to re-route speculatively. +*/ + if (TCP_SKB_CB(skb)->seq == tcp_sk(sk)->duplicate_sack[0].start_seq) + sk_rethink_txhash(sk); +} + static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); @@ -4211,6 +4222,7 @@ static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb) if (tcp_is_sack(tp) && sock_net(sk)->ipv4.sysctl_tcp_dsack) { u32 end_seq = TCP_SKB_CB(skb)->end_seq; + tcp_rcv_spurious_retrans(sk, skb); if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) end_seq = tp->rcv_nxt; tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq); @@ -4755,6 +4767,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) } if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) { + tcp_rcv_spurious_retrans(sk, skb); /* A retransmit, 2nd most common case. Force an immediate ack. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOST); tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq); -- 2.19.0.rc0.228.g281dcd1b4d0-goog
[PATCH bpf-next 0/3] bpf: implement percpu map pretty print for bpffs and bpftool
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap") and Commit 699c86d6ec21 ("bpf: btf: add pretty print for hash/lru_hash maps") added bpffs pretty print for array, hash and lru hash maps. The pretty print gives users a structurally formatted dump for keys/values which much easy to understand than raw bytes. This patch set implemented bpffs pretty print support for percpu arraymap, percpu hashmap and percpu lru hashmap. For complex key/value types, the pretty print here is even more useful due to . large volumne of data making it even harder to correlate bytes to a particular field in a particular cpu. . kernel rounds the value size for each cpu to multiple of 8. User has to be aware of this otherwise wrong value may be derived from cpu 1/2/... For example, we may have a bpffs pretty print like below: 43602: { cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} } for a percpu map. This patch also added percpu formatted print on bpftool. For example, bpftool may print like below: { "key": 0, "values": [{ "cpu": 0, "value": { "ui32": 0, "ui16": 0, } },{ "cpu": 1, "value": { "ui32": 1, "ui16": 0, } },{ "cpu": 2, "value": { "ui32": 2, "ui16": 0, } },{ "cpu": 3, "value": { "ui32": 3, "ui16": 0, } } ] } Patch #1 implemented bpffs pretty print for percpu arraymap/hash/lru_hash in kernel. Patch #2 added the test case in tools bpf selftest test_btf. Patch #3 added percpu map btf based dump. Yonghong Song (3): bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash tools/bpf: add bpffs percpu map pretty print tests in test_btf tools/bpf: bpftool: add btf percpu map formated dump kernel/bpf/arraymap.c | 24 + kernel/bpf/hashtab.c | 31 ++ tools/bpf/bpftool/map.c| 33 +- tools/testing/selftests/bpf/test_btf.c | 179 ++--- 4 files changed, 230 insertions(+), 37 deletions(-) -- 2.14.3
[PATCH bpf-next 1/3] bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
Added bpffs pretty print for percpu arraymap, percpu hashmap and percpu lru hashmap. For each map pair, the format is: : { cpu0: cpu1: ... cpun: } For example, on my VM, there are 4 cpus, and for test_btf test in the next patch: cat /sys/fs/bpf/pprint_test_percpu_hash You may get: ... 43602: { cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} } 72847: { cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE} cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE} cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE} cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE} } ... Signed-off-by: Yonghong Song --- kernel/bpf/arraymap.c | 24 kernel/bpf/hashtab.c | 31 +++ 2 files changed, 55 insertions(+) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 0c17aab3ce5f..f9d24121be99 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -358,6 +358,29 @@ static void array_map_seq_show_elem(struct bpf_map *map, void *key, rcu_read_unlock(); } +static void percpu_array_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + struct bpf_array *array = container_of(map, struct bpf_array, map); + u32 index = *(u32 *)key; + void __percpu *pptr; + int cpu; + + rcu_read_lock(); + + seq_printf(m, "%u: {\n", *(u32 *)key); + pptr = array->pptrs[index & array->index_mask]; + for_each_possible_cpu(cpu) { + seq_printf(m, "\tcpu%d: ", cpu); + btf_type_seq_show(map->btf, map->btf_value_type_id, + per_cpu_ptr(pptr, cpu), m); + seq_puts(m, "\n"); + } + seq_puts(m, "}\n"); + + rcu_read_unlock(); +} + static int array_map_check_btf(const struct bpf_map *map, const struct btf_type *key_type, const struct btf_type *value_type) @@ -398,6 +421,7 @@ const struct bpf_map_ops percpu_array_map_ops = { .map_lookup_elem = percpu_array_map_lookup_elem, .map_update_elem = array_map_update_elem, .map_delete_elem = array_map_delete_elem, + .map_seq_show_elem = percpu_array_map_seq_show_elem, .map_check_btf = array_map_check_btf, }; diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 03cc59ee9c95..2c1790288138 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1285,6 +1285,35 @@ int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value, return ret; } +static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + struct htab_elem *l; + void __percpu *pptr; + int cpu; + + rcu_read_lock(); + + l = __htab_map_lookup_elem(map, key); + if (!l) { + rcu_read_unlock(); + return; + } + + btf_type_seq_show(map->btf, map->btf_key_type_id, key, m); + seq_puts(m, ": {\n"); + pptr = htab_elem_get_ptr(l, map->key_size); + for_each_possible_cpu(cpu) { + seq_printf(m, "\tcpu%d: ", cpu); + btf_type_seq_show(map->btf, map->btf_value_type_id, + per_cpu_ptr(pptr, cpu), m); + seq_puts(m, "\n"); + } + seq_puts(m, "}\n"); + + rcu_read_unlock(); +} + const struct bpf_map_ops htab_percpu_map_ops = { .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, @@ -1293,6 +1322,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { .map_lookup_elem = htab_percpu_map_lookup_elem, .map_update_elem = htab_percpu_map_update_elem, .map_delete_elem = htab_map_delete_elem, + .map_seq_show_elem = htab_percpu_map_seq_show_elem, }; const struct bpf_map_ops htab_lru_percpu_map_ops = { @@ -1303,6 +1333,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = { .map_lookup_elem = htab_lru_percpu_map_lookup_elem, .map_update_elem = htab_lru_percpu_map_update_elem, .map_delete_elem = htab_lru_map_delete_elem, + .map_seq_show_elem = htab_percpu_map_seq_show_elem, }; static int fd_htab_map_alloc_check(union bpf_attr *attr) -- 2.14.3
[PATCH bpf-next 2/3] tools/bpf: add bpffs percpu map pretty print tests in test_btf
The bpf selftest test_btf is extended to test bpffs percpu map pretty print for percpu array, percpu hash and percpu lru hash. Signed-off-by: Yonghong Song --- tools/testing/selftests/bpf/test_btf.c | 179 ++--- 1 file changed, 144 insertions(+), 35 deletions(-) diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c index 6b5cfeb7a9cc..f42b3396d622 100644 --- a/tools/testing/selftests/bpf/test_btf.c +++ b/tools/testing/selftests/bpf/test_btf.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include @@ -45,7 +46,6 @@ static int count_result(int err) return err; } -#define min(a, b) ((a) < (b) ? (a) : (b)) #define __printf(a, b) __attribute__((format(printf, a, b))) __printf(1, 2) @@ -130,6 +130,7 @@ struct btf_raw_test { bool map_create_err; bool ordered_map; bool lossless_map; + bool percpu_map; int hdr_len_delta; int type_off_delta; int str_off_delta; @@ -2157,6 +2158,7 @@ static struct btf_pprint_test_meta { const char *map_name; bool ordered_map; bool lossless_map; + bool percpu_map; } pprint_tests_meta[] = { { .descr = "BTF pretty print array", @@ -2164,6 +2166,7 @@ static struct btf_pprint_test_meta { .map_name = "pprint_test_array", .ordered_map = true, .lossless_map = true, + .percpu_map = false, }, { @@ -2172,6 +2175,7 @@ static struct btf_pprint_test_meta { .map_name = "pprint_test_hash", .ordered_map = false, .lossless_map = true, + .percpu_map = false, }, { @@ -2180,30 +2184,83 @@ static struct btf_pprint_test_meta { .map_name = "pprint_test_lru_hash", .ordered_map = false, .lossless_map = false, + .percpu_map = false, +}, + +{ + .descr = "BTF pretty print percpu array", + .map_type = BPF_MAP_TYPE_PERCPU_ARRAY, + .map_name = "pprint_test_percpu_array", + .ordered_map = true, + .lossless_map = true, + .percpu_map = true, +}, + +{ + .descr = "BTF pretty print percpu hash", + .map_type = BPF_MAP_TYPE_PERCPU_HASH, + .map_name = "pprint_test_percpu_hash", + .ordered_map = false, + .lossless_map = true, + .percpu_map = true, +}, + +{ + .descr = "BTF pretty print lru percpu hash", + .map_type = BPF_MAP_TYPE_LRU_PERCPU_HASH, + .map_name = "pprint_test_lru_percpu_hash", + .ordered_map = false, + .lossless_map = false, + .percpu_map = true, }, }; -static void set_pprint_mapv(struct pprint_mapv *v, uint32_t i) +static void set_pprint_mapv(struct pprint_mapv *v, uint32_t i, + int num_cpus, int rounded_value_size) { - v->ui32 = i; - v->si32 = -i; - v->unused_bits2a = 3; - v->bits28 = i; - v->unused_bits2b = 3; - v->ui64 = i; - v->aenum = i & 0x03; + int cpu; + + for (cpu = 0; cpu < num_cpus; cpu++) { + v->ui32 = i + cpu; + v->si32 = -i; + v->unused_bits2a = 3; + v->bits28 = i; + v->unused_bits2b = 3; + v->ui64 = i; + v->aenum = i & 0x03; + v = (void *)v + rounded_value_size; + } } +static int check_line(const char *expected_line, int nexpected_line, + int expected_line_len, const char *line) +{ + if (CHECK(nexpected_line == expected_line_len, + "expected_line is too long")) + return -1; + + if (strcmp(expected_line, line)) { + fprintf(stderr, "unexpected pprint output\n"); + fprintf(stderr, "expected: %s", expected_line); + fprintf(stderr, "read: %s", line); + return -1; + } + + return 0; +} + + static int do_test_pprint(void) { const struct btf_raw_test *test = _test_template; struct bpf_create_map_attr create_attr = {}; + bool ordered_map, lossless_map, percpu_map; + int err, ret, num_cpus, rounded_value_size; + struct pprint_mapv *mapv = NULL; unsigned int key, nr_read_elems; - bool ordered_map, lossless_map; int map_fd = -1, btf_fd = -1; - struct pprint_mapv mapv = {}; unsigned int raw_btf_size; char expected_line[255]; FILE *pin_file = NULL; @@ -2212,7 +2269,6 @@ static int do_test_pprint(void) char *line = NULL; uint8_t *raw_btf; ssize_t nread; - int err, ret; fprintf(stderr, "%s..", test->descr); raw_btf = btf_raw_create(_tmpl, test->raw_types, @@ -2261,9 +2317,18 @@ static int do_test_pprint(void) if (CHECK(err, "bpf_obj_pin(%s): errno:%d.", pin_path, errno)) goto done; + percpu_map = test->percpu_map; + num_cpus = percpu_map ? bpf_num_possible_cpus() : 1; +
[PATCH bpf-next 3/3] tools/bpf: bpftool: add btf percpu map formated dump
The btf pretty print is added to percpu arraymap, percpu hashmap and percpu lru hashmap. For each pair, the following will be added to plain/json output: { "key": , "values": [{ "cpu": 0, "value": },{ "cpu": 1, "value": },{ },{ "cpu": n, "value": } ] } For example, the following could be part of plain or json formatted output: { "key": 0, "values": [{ "cpu": 0, "value": { "ui32": 0, "ui16": 0, } },{ "cpu": 1, "value": { "ui32": 1, "ui16": 0, } },{ "cpu": 2, "value": { "ui32": 2, "ui16": 0, } },{ "cpu": 3, "value": { "ui32": 3, "ui16": 0, } } ] } Signed-off-by: Yonghong Song --- tools/bpf/bpftool/map.c | 33 +++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index b2ec20e562bd..df175bc33c5d 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -169,9 +169,28 @@ static int do_dump_btf(const struct btf_dumper *d, if (ret) goto err_end_obj; - jsonw_name(d->jw, "value"); + if (!map_is_per_cpu(map_info->type)) { + jsonw_name(d->jw, "value"); + ret = btf_dumper_type(d, map_info->btf_value_type_id, value); + } else { + unsigned int i, n, step; - ret = btf_dumper_type(d, map_info->btf_value_type_id, value); + jsonw_name(d->jw, "values"); + jsonw_start_array(d->jw); + n = get_possible_cpus(); + step = round_up(map_info->value_size, 8); + for (i = 0; i < n; i++) { + jsonw_start_object(d->jw); + jsonw_int_field(d->jw, "cpu", i); + jsonw_name(d->jw, "value"); + ret = btf_dumper_type(d, map_info->btf_value_type_id, + value + i * step); + jsonw_end_object(d->jw); + if (ret) + break; + } + jsonw_end_array(d->jw); + } err_end_obj: /* end of key-value pair */ @@ -298,6 +317,16 @@ static void print_entry_json(struct bpf_map_info *info, unsigned char *key, jsonw_end_object(json_wtr); } jsonw_end_array(json_wtr); + if (btf) { + struct btf_dumper d = { + .btf = btf, + .jw = json_wtr, + .is_plain_text = false, + }; + + jsonw_name(json_wtr, "formatted"); + do_dump_btf(, info, key, value); + } } jsonw_end_object(json_wtr); -- 2.14.3
Motorcycle Owners List
Hi, Greeting of the day! Would you are interested in acquiring an email list of "Motorcycle Owners"? from USA. We also having data of Harley Davidson Owners, Car Owners List, BMW Owners List, Luxury Car Owners List, RV Owners, Pick Up Truck Owners, Boat Owners, RV Owners List, HNI,Travelers List and many more... Each record we will provide you with: Contact (First and Last name), Mailing Address and Emails Address. All the contacts are opt-in verified, complete permission based and can be used for unlimited multi-channel marketing. Let me know if you'd be interested in hearing more about it. Waiting for your valuable and sincere reply. Best Regards, Marlene Royle Research Analyst
Re: GPL compliance issue with liquidio/lio_23xx_vsw.bin firmware
On Mon, 2018-08-27 at 17:04 -0700, Felix Manlunas wrote: > On Mon, Aug 27, 2018 at 05:01:10PM +0200, Florian Weimer wrote: > > liquidio/lio_23xx_vsw.bin contains a compiled MIPS Linux kernel: > > > > $ tail --bytes=+1313 liquidio/lio_23xx_vsw.bin > elf > > $ readelf -aW elf > > […] > > [ 6] __ksymtab PROGBITS80e495f8 64a5f8 00d130 > > 00 A 0 0 8 > > [ 7] __ksymtab_gpl PROGBITS80e56728 657728 008400 > > 00 A 0 0 8 > > [ 8] __ksymtab_strings PROGBITS80e5eb28 65fb28 018868 > > 00 A 0 0 1 > > […] > > Symbol table '.symtab' contains 1349 entries: > >Num:Value Size TypeBind Vis Ndx Name > > 0: 0 NOTYPE LOCAL DEFAULT UND > > 1: 0 FILELOCAL DEFAULT ABS > > arch/mips/kernel/head.o > > 2: 0 FILELOCAL DEFAULT ABS init/main.c > > 3: 0 FILELOCAL DEFAULT ABS > > include/linux/types.h > > […] > > > > Yet there is no corresponding source provided, and LICENCE.cavium lacks > > the required notices. > > > > Thanks, > > Florian > > Cavium apologizes for the oversight. Cavium has been advertising the > appropriate license terms including the existence of GPL in the firmware > in our outbox releases. We will update the license terms in LICENCE.cavium > in our upstream contribution in collaboration with our legal team. Everything added to linux-firmware.git needs to be safe for Linux distributions to redistribute. (There are some ancient firmware images with unclear licensing, but we don't want to add any more.) My understanding is that GPL v2 requires that commercial distributors either provide the complete and corresponding source alongside the binaries, or include a written offer to provide it later. They *cannot* simply point to your offer to do this (only non-commercial distributors can do that). So the source needs to be published too. Adding the complete Linux kernel source code to linux-firmware.git doesn't seem like a sensible step, so maybe this particular firmware needs to live elsewhere. Ben. -- Ben Hutchings For every complex problem there is a solution that is simple, neat, and wrong. signature.asc Description: This is a digitally signed message part
Wine Enthusiasts List
Hi, Greeting of the day! Would you be interested in reaching out to "Wine Enthusiasts list" from USA? Our Databases:- 1.Beer Enthusiasts List2.Alcohol Enthusiasts List 3.Beverage Consumers 4.Liquor Enthusiasts List 5.Chocolate Enthusiasts List 6.Students List 7.Luxury Brand Buyers List 8.Food Lovers List 9.Spa and Resort Visitors 10.Sports Enthusiasts List and many more. Each record in the list contains Contact Name (First and Last Name), Mailing Address, List type and Opt-in email address. All the contacts are opt-in verified, complete permission based and can be used for unlimited multi-channel marketing. Let me know if you'd be interested in hearing more about it. Waiting for your valuable and sincere reply. Best Regards, Julie Wider Research Analyst
Re: Fw: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg
On Wed, Aug 29, 2018 at 8:02 AM, Stephen Hemminger wrote: > > > > Begin forwarded message: > > Date: Sun, 26 Aug 2018 22:24:12 + > From: bugzilla-dae...@bugzilla.kernel.org > To: step...@networkplumber.org > Subject: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg > > > https://bugzilla.kernel.org/show_bug.cgi?id=200943 > > Bug ID: 200943 >Summary: Repeating tcp_mark_head_lost in dmesg >Product: Networking >Version: 2.5 > Kernel Version: 4.14.66 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > Assignee: step...@networkplumber.org > Reporter: rm+...@romanrm.net > Regression: No > > Getting a bunch of these now every hour during continuous ~100 Mbit of network > traffic. > What's up with that? Seems harmless, as in the kernel doesn't crash and the > network connection is not interrupted. (Maybe the particular TCP session is?) > If there are no ill-effects from this condition, is such spammy WARN_ON really > necessary? This warning is likely triggered by buggy remote SACK behaviors, and is pretty harmless - in my opinion the warning tcp_verify_left_out() is still worthy to detect other inflight states inconsistencies. The good news the particular loss recovery code path is disabled by default on 4.18+ kernels by this patch commit b38a51fec1c1f693f03b1aa19d0622123634d4b7 Author: Yuchung Cheng Date: Wed May 16 16:40:11 2018 -0700 tcp: disable RFC6675 loss detection > > [Mon Aug 27 02:16:11 2018] [ cut here ] > [Mon Aug 27 02:16:11 2018] WARNING: CPU: 5 PID: 0 at net/ipv4/tcp_input.c:2263 > tcp_mark_head_lost+0x247/0x260 > [Mon Aug 27 02:16:11 2018] Modules linked in: dm_snapshot loop vhost_net vhost > tap tun ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE > nf_nat_masquerade_ipv4 xt_DSCP xt_mark ip6t_REJECT nf_reject_ipv6 ipt_REJECT > nf_reject_ipv4 xt_owner xt_tcpudp xt_set ip_set_hash_net ip_set nfnetlink > xt_limit xt_length xt_multiport xt_conntrack ip6t_rpfilter ipt_rpfilter > ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw > wireguard ip6_udp_tunnel udp_tunnel ip6table_mangle iptable_nat > nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw > iptable_mangle ip6table_filter ip6_tables matroxfb_base matroxfb_g450 > matroxfb_Ti3026 matroxfb_accel matroxfb_DAC1064 g450_pll matroxfb_misc > iptable_filter ip_tables x_tables cpufreq_powersave cpufreq_userspace > cpufreq_conservative 8021q garp mrp > [Mon Aug 27 02:16:11 2018] bridge stp llc bonding tcp_bbr sch_fq tcp_illinois > fuse radeon ttm drm_kms_helper drm i2c_algo_bit it87 hwmon_vid eeepc_wmi > asus_wmi sparse_keymap rfkill video wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm > snd_pcm snd_timer snd soundcore joydev evdev pcspkr k10temp fam15h_power > sp5100_tco sg shpchp wmi pcc_cpufreq acpi_cpufreq button ext4 crc16 mbcache > jbd2 fscrypto btrfs zstd_decompress zstd_compress xxhash algif_skcipher af_alg > dm_crypt dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod > hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq > async_xor async_tx xor sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0 > multipath linear md_mod vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio > ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel > [Mon Aug 27 02:16:11 2018] pcbc aesni_intel aes_x86_64 crypto_simd > glue_helper > cryptd r8169 ahci xhci_pci libahci ohci_hcd ehci_pci mii xhci_hcd ehci_hcd > i2c_piix4 libata usbcore scsi_mod bnx2 > [Mon Aug 27 02:16:11 2018] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW >4.14.66-rm1+ #132 > [Mon Aug 27 02:16:11 2018] Hardware name: To be filled by O.E.M. To be filled > by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016 > [Mon Aug 27 02:16:11 2018] task: 8ba79c679dc0 task.stack: b4d741928000 > [Mon Aug 27 02:16:11 2018] RIP: 0010:tcp_mark_head_lost+0x247/0x260 > [Mon Aug 27 02:16:11 2018] RSP: 0018:8ba7aed437d8 EFLAGS: 00010202 > [Mon Aug 27 02:16:11 2018] RAX: 0018 RBX: 8ba3901a0800 RCX: > > [Mon Aug 27 02:16:11 2018] RDX: 0017 RSI: 0001 RDI: > 8ba4d47e9000 > [Mon Aug 27 02:16:11 2018] RBP: 8ba4d47e9000 R08: 000d R09: > > [Mon Aug 27 02:16:11 2018] R10: 100c R11: R12: > 0001 > [Mon Aug 27 02:16:11 2018] R13: 8ba4d47e9158 R14: 0001 R15: > 9d0b6708 > [Mon Aug 27 02:16:11 2018] FS: () > GS:8ba7aed4() knlGS: > [Mon Aug 27 02:16:11 2018] CS: 0010 DS: ES: CR0: 80050033 > [Mon Aug 27 02:16:11 2018] CR2: 01fbdff0 CR3: 00040c8d6000 CR4: > 000406e0 > [Mon Aug 27 02:16:11 2018] Call Trace: > [Mon
[PATCH net] nfp: wait for posted reconfigs when disabling the device
To avoid leaking a running timer we need to wait for the posted reconfigs after netdev is unregistered. In common case the process of deinitializing the device will perform synchronous reconfigs which wait for posted requests, but especially with VXLAN ports being actively added and removed there can be a race condition leaving a timer running after adapter structure is freed leading to a crash. Add an explicit flush after deregistering and for a good measure a warning to check if timer is running just before structures are freed. Fixes: 3d780b926a12 ("nfp: add async reconfiguration mechanism") Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- .../ethernet/netronome/nfp/nfp_net_common.c | 48 +-- 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index a8b9fbab5f73..253bdaef1505 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -229,29 +229,16 @@ static void nfp_net_reconfig_post(struct nfp_net *nn, u32 update) spin_unlock_bh(>reconfig_lock); } -/** - * nfp_net_reconfig() - Reconfigure the firmware - * @nn: NFP Net device to reconfigure - * @update: The value for the update field in the BAR config - * - * Write the update word to the BAR and ping the reconfig queue. The - * poll until the firmware has acknowledged the update by zeroing the - * update word. - * - * Return: Negative errno on error, 0 on success - */ -int nfp_net_reconfig(struct nfp_net *nn, u32 update) +static void nfp_net_reconfig_sync_enter(struct nfp_net *nn) { bool cancelled_timer = false; u32 pre_posted_requests; - int ret; spin_lock_bh(>reconfig_lock); nn->reconfig_sync_present = true; if (nn->reconfig_timer_active) { - del_timer(>reconfig_timer); nn->reconfig_timer_active = false; cancelled_timer = true; } @@ -260,14 +247,43 @@ int nfp_net_reconfig(struct nfp_net *nn, u32 update) spin_unlock_bh(>reconfig_lock); - if (cancelled_timer) + if (cancelled_timer) { + del_timer_sync(>reconfig_timer); nfp_net_reconfig_wait(nn, nn->reconfig_timer.expires); + } /* Run the posted reconfigs which were issued before we started */ if (pre_posted_requests) { nfp_net_reconfig_start(nn, pre_posted_requests); nfp_net_reconfig_wait(nn, jiffies + HZ * NFP_NET_POLL_TIMEOUT); } +} + +static void nfp_net_reconfig_wait_posted(struct nfp_net *nn) +{ + nfp_net_reconfig_sync_enter(nn); + + spin_lock_bh(>reconfig_lock); + nn->reconfig_sync_present = false; + spin_unlock_bh(>reconfig_lock); +} + +/** + * nfp_net_reconfig() - Reconfigure the firmware + * @nn: NFP Net device to reconfigure + * @update: The value for the update field in the BAR config + * + * Write the update word to the BAR and ping the reconfig queue. The + * poll until the firmware has acknowledged the update by zeroing the + * update word. + * + * Return: Negative errno on error, 0 on success + */ +int nfp_net_reconfig(struct nfp_net *nn, u32 update) +{ + int ret; + + nfp_net_reconfig_sync_enter(nn); nfp_net_reconfig_start(nn, update); ret = nfp_net_reconfig_wait(nn, jiffies + HZ * NFP_NET_POLL_TIMEOUT); @@ -3633,6 +3649,7 @@ struct nfp_net *nfp_net_alloc(struct pci_dev *pdev, bool needs_netdev, */ void nfp_net_free(struct nfp_net *nn) { + WARN_ON(timer_pending(>reconfig_timer) || nn->reconfig_posted); if (nn->dp.netdev) free_netdev(nn->dp.netdev); else @@ -3920,4 +3937,5 @@ void nfp_net_clean(struct nfp_net *nn) return; unregister_netdev(nn->dp.netdev); + nfp_net_reconfig_wait_posted(nn); } -- 2.17.1
Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
On Tue, Aug 28, 2018 at 02:44:24PM +0200, Björn Töpel wrote: > From: Björn Töpel > > This patch set introduces zero-copy AF_XDP support for Intel's i40e > driver. In the first preparatory patch we also add support for > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can > redirect them. This was a ToDo from the first AF_XDP zero-copy patch > set from early June. Special thanks to Alex Duyck and Jesper Dangaard > Brouer for reviewing earlier versions of this patch set. > > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note > that in the interest of time, to get an AF_XDP zero-copy implementation > out there for people to try, some code paths have been copied from the > XDP path to the zero-copy path. It is out goal to merge the two paths > in later patch sets. > > In contrast to the implementation from beginning of June, this patch > set does not require any extra HW queues for AF_XDP zero-copy > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and > AF_XDP zero-copy TX. > > Jeff, given that most of changes are in i40e, it is up to you how you > would like to route these patches. The set is tagged bpf-next, but > if taking it via the Intel driver tree is easier, let us know. > > We have run some benchmarks on a dual socket system with two Broadwell > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 > cores which gives a total of 28, but only two cores are used in these > experiments. One for TR/RX and one for the user space application. The > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The > NIC is Intel I40E 40Gbit/s using the i40e driver. > > Below are the results in Mpps of the I40E NIC benchmark runs for 64 > and 1500 byte packets, generated by a commercial packet generator HW > outputing packets at full 40 Gbit/s line rate. The results are with > retpoline and all other spectre and meltdown fixes, so these results > are not comparable to the ones from the zero-copy patch set in June. > > AF_XDP performance 64 byte packets. > Benchmark XDP_SKBXDP_DRVXDP_DRV with zerocopy > rxdrop 2.68.2 15.0 > txpush 2.2- 21.9 > l2fwd1.72.3 11.3 > > AF_XDP performance 1500 byte packets: > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy > rxdrop 2.03.3 3.3 > l2fwd1.31.7 3.1 > > XDP performance on our system as a base line: > > 64 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 18.4M 0 > > 1500 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 3.3M0 > > The structure of the patch set is as follows: > > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames > Patches 2-4: Preparatory patches to common xsk and net code > Patches 5-7: Preparatory patches to i40e driver code for RX > Patch 8: i40e zero-copy support for RX > Patch 9: Preparatory patch to i40e driver code for TX > Patch 10: i40e zero-copy support for TX > Patch 11: Add flags to sample application to force zero-copy/copy mode > > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") > > > Magnus & Björn > > Björn Töpel (8): > xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY > xdp: export xdp_rxq_info_unreg_mem_model > xsk: expose xdp_umem_get_{data,dma} to drivers > i40e: added queue pair disable/enable functions > i40e: refactor Rx path for re-use > i40e: move common Rx functions to i40e_txrx_common.h > i40e: add AF_XDP zero-copy Rx support > samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock > > Magnus Karlsson (3): > net: add napi_if_scheduled_mark_missed > i40e: move common Tx functions to i40e_txrx_common.h > i40e: add AF_XDP zero-copy Tx support Applied to bpf-next. Thanks a lot to the authors, code reviewers and testers. I hope all the issues brought up during code review can be quickly addressed in the follow up patches and zerocopy feature in i40e and other drivers will be mainline ready for the next merge window.
Re: Kernel warnings from tcp.c
On Wed, 29 Aug 2018 14:38:38 -0400 Adam Mitchell wrote: > Anyone with experience in tcp.c have an idea what's causing this on > our busy database server? Comes from this macro at line 2278: > WARN_ON(sock_owned_by_user(sk)); > > > [726780.788201] WARNING: CPU: 15 PID: 52245 at net/ipv4/tcp.c:2278 > tcp_close+0x40f/0x430 > > [726780.794947] Modules linked in: binfmt_misc nf_conntrack_netlink > nfnetlink_queue tcp_diag inet_diag isofs ip6table_mangle ip6table_raw > ip6table_nat nf_nat_ipv6 iptable_security xt_CT iptable_raw > iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_pkttype xt_NFLOG > nfnetlink_log xt_u32 xt_multiport xt_set xt_conntrack > ip_set_hash_netport ip_set_hash_ipport ip_set_hash_net ip_set_hash_ip > ip_set nfnetlink nf_conntrack_proto_gre nf_conntrack_ipv6 > nf_defrag_ipv6 ip6table_filter ip6_tables xt_LOG nf_conntrack_tftp > nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack > iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) sb_edac > intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel > pcbc aesni_intel crypto_simd glue_helper cryptd cirrus snd_seq > zcommon(PO) ttm intel_rapl_perf snd_seq_device > [726780.848696] drm_kms_helper znvpair(PO) snd_pcm snd_timer spl(O) > drm snd soundcore syscopyarea sysfillrect pcspkr sysimgblt input_leds > fb_sys_fops i2c_piix4 ip_tables xfs libcrc32c ata_generic pata_acpi > ata_piix xen_blkfront crc32c_intel libata ena(O) serio_raw floppy > sunrpc > [726780.863916] CPU: 15 PID: 52245 Comm: mysqld Tainted: PW O > 4.16.13-1.el7.elrepo.x86_64 #1 > Sorry, you have a tainted kernel with modules that are not upstream therefore don't expect any support from kernel community
Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
On Tue, Aug 28, 2018 at 02:44:32PM +0200, Björn Töpel wrote: > From: Björn Töpel > > This patch adds zero-copy Rx support for AF_XDP sockets. Instead of > allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are > allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain > queue. > > All AF_XDP specific functions are added to a new file, i40e_xsk.c. > > Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS > will allocate a new buffer and copy the zero-copy frame prior passing > it to the kernel stack. > > Signed-off-by: Björn Töpel ... > +/** > + * i40e_construct_skb_zc - Create skbufff from zero-copy Rx buffer typo sk_buff overall the set looks great to me. The driver side changes look particularly clean and independent from skb path of the driver. I think the issues brought up by other reviews are not blockers for the set, but certainly should be addressed in the follow up patches.
[RFC] net: xsk: add a simple buffer reuse queue
XSK UMEM is strongly single producer single consumer so reuse of frames is challenging. Add a simple "stash" of FILL packets to reuse for drivers to optionally make use of. This is useful when driver has to free (ndo_stop) or resize a ring with an active AF_XDP ZC socket. Signed-off-by: Jakub Kicinski --- include/net/xdp_sock.h | 44 + net/xdp/xdp_umem.c | 2 ++ net/xdp/xsk_queue.c| 56 ++ net/xdp/xsk_queue.h| 3 +++ 4 files changed, 105 insertions(+) diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 6871e4755975..108c1c100de4 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -14,6 +14,7 @@ #include struct net_device; +struct xdp_umem_fq_reuse; struct xsk_queue; struct xdp_umem_props { @@ -41,6 +42,7 @@ struct xdp_umem { struct page **pgs; u32 npgs; struct net_device *dev; + struct xdp_umem_fq_reuse *fq_reuse; u16 queue_id; bool zc; spinlock_t xsk_list_lock; @@ -110,4 +112,46 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1)); } +struct xdp_umem_fq_reuse { + u32 nentries; + u32 length; + u64 handles[]; +}; + +/* Following functions are not thread-safe in any way */ +struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries); +struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, + struct xdp_umem_fq_reuse *newq); +void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq); + +/* Reuse-queue aware version of FILL queue helpers */ +static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) +{ + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; + + if (!rq->length) { + return xsk_umem_peek_addr(umem, addr); + } else { + *addr = rq->handles[rq->length - 1]; + return addr; + } +} + +static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem) +{ + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; + + if (!rq->length) + xsk_umem_discard_addr(umem); + else + rq->length--; +} + +static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr) +{ + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; + + rq->handles[rq->length++] = addr; +} + #endif /* _LINUX_XDP_SOCK_H */ diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index e762310c9bee..40303e24c954 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -170,6 +170,8 @@ static void xdp_umem_release(struct xdp_umem *umem) umem->cq = NULL; } + xsk_reuseq_destroy(umem); + xdp_umem_unpin_pages(umem); task = get_pid_task(umem->pid, PIDTYPE_PID); diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c index 6c32e92e98fc..f9ee40a13a9a 100644 --- a/net/xdp/xsk_queue.c +++ b/net/xdp/xsk_queue.c @@ -3,7 +3,9 @@ * Copyright(c) 2018 Intel Corporation. */ +#include #include +#include #include "xsk_queue.h" @@ -61,3 +63,57 @@ void xskq_destroy(struct xsk_queue *q) page_frag_free(q->ring); kfree(q); } + +struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries) +{ + struct xdp_umem_fq_reuse *newq; + + /* Check for overflow */ + if (nentries > (u32)roundup_pow_of_two(nentries)) + return NULL; + nentries = roundup_pow_of_two(nentries); + + newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL); + if (!newq) + return NULL; + memset(newq, 0, offsetof(typeof(*newq), handles)); + + newq->nentries = nentries; + return newq; +} +EXPORT_SYMBOL_GPL(xsk_reuseq_prepare); + +struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, + struct xdp_umem_fq_reuse *newq) +{ + struct xdp_umem_fq_reuse *oldq = umem->fq_reuse; + + if (!oldq) { + umem->fq_reuse = newq; + return NULL; + } + + if (newq->nentries < oldq->length) + return newq; + + + memcpy(newq->handles, oldq->handles, + array_size(oldq->length, sizeof(u64))); + newq->length = oldq->length; + + umem->fq_reuse = newq; + return oldq; +} +EXPORT_SYMBOL_GPL(xsk_reuseq_swap); + +void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq) +{ + kvfree(rq); +} +EXPORT_SYMBOL_GPL(xsk_reuseq_free); + +void xsk_reuseq_destroy(struct xdp_umem *umem) +{ + xsk_reuseq_free(umem->fq_reuse); + umem->fq_reuse = NULL; +} diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index 8a64b150be54..7a480e3eb35d 100644 --- a/net/xdp/xsk_queue.h +++ b/net/xdp/xsk_queue.h @@ -257,4 +257,7 @@ void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props); struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote: > From: Björn Töpel > > This patch adds zero-copy Rx support for AF_XDP sockets. Instead of > allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are > allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain > queue. > > All AF_XDP specific functions are added to a new file, i40e_xsk.c. > > Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS > will allocate a new buffer and copy the zero-copy frame prior passing > it to the kernel stack. > > Signed-off-by: Björn Töpel Mm.. I'm surprised you don't run into buffer reuse issues that I had when playing with AF_XDP. What happens in i40e if someone downs the interface? Will UMEMs get destroyed? Will the RX buffers get freed? I'll shortly send an RFC with my quick and dirty RX buffer reuse queue, FWIW.
pull-request: bpf 2018-08-29
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a build error in sk_reuseport_convert_ctx_access() when compiling with clang which cannot resolve hweight_long() at build time inside the BUILD_BUG_ON() assertion, from Stefan. 2) Several fixes for BPF sockmap, four of them in getting the bpf_msg_pull_data() helper to work, one use after free case in bpf_tcp_close() and one refcount leak in bpf_tcp_recvmsg(), from Daniel. 3) Another fix for BPF sockmap where we misaccount sk_mem_uncharge() in the socket redirect error case from unwinding scatterlist twice, from John. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit 53ae914d898e5dd5984d352d5fa0b23410f966a0: net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6 (2018-08-27 15:26:01 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to d65e6c80c6bb72ced46ce90dea4016d913a8ddd4: Merge branch 'bpf_msg_pull_data-fixes' (2018-08-29 10:47:18 -0700) Alexei Starovoitov (1): Merge branch 'bpf_msg_pull_data-fixes' Daniel Borkmann (6): bpf, sockmap: fix potential use after free in bpf_tcp_close bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg bpf: fix several offset tests in bpf_msg_pull_data bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data bpf: fix sg shift repair start offset in bpf_msg_pull_data John Fastabend (1): bpf: sockmap, decrement copied count correctly in redirect error case Stefan Agner (1): bpf: fix build error with clang kernel/bpf/sockmap.c | 52 +--- net/core/filter.c| 52 +--- 2 files changed, 54 insertions(+), 50 deletions(-)
Re: bpfilter causes a leftover kernel process
On Tue, 28 Aug 2018 22:35:38 -0700 Alexei Starovoitov wrote: > > Yeah, I have a similar thing happening on shutdown, except that > > we're talking about a kernel thread here, so that process is > > ignored by the mentionned killing spree as a result, thus leaving > > that process running. > > it's not a kernel thread and sounds like there is a bug in your pid 1 > that is worth fixing. Well, it sure does look like one. By which I mean that looking at its /proc/$PID entry, one can see that it has an empty command line and kthreadd as parent (ppid 2), which usually are only true for kernel threads (unless I'm mistaken). The tool I'm using to send the signals on shutdown does indeed not use kill(-1,sig) but instead scans /proc and determines whether or not to signal a process (thus allowing to ignore a few specific processes to be killed later on). Kernel threads are obviously to be skipped, and to identify them it uses the classic method of checking whether or not it has an empty command line. As I said, this bpfilter helper does and that's why it is considered a kernel thread & thus not killed. This is also what other tools do, like ps or top, which is indeed why they also show it as a kernel thread ("[none]"). So is this way of doing this wrong? And if so, what would be a better way to identify kernel threads? Out of curiosity I also had a look at how systemd does things, and it looks to me like it does the exact same thing[1], also identifying kernel threads by their empty command line. So I would think that a similar issue could be observed under systemd as well. Thanks. [1] https://github.com/systemd/systemd/blob/master/src/core/killall.c#L44
[PATCH net] Revert "packet: switch kvzalloc to allocate memory"
This reverts commit 71e41286203c017d24f041a7cd71abea7ca7b1e0. mmap()/munmap() can not be backed by kmalloced pages : We fault in : VM_BUG_ON_PAGE(PageSlab(page), page); unmap_single_vma+0x8a/0x110 unmap_vmas+0x4b/0x90 unmap_region+0xc9/0x140 do_munmap+0x274/0x360 vm_munmap+0x81/0xc0 SyS_munmap+0x2b/0x40 do_syscall_64+0x13e/0x1c0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: 71e41286203c ("packet: switch kvzalloc to allocate memory") Signed-off-by: Eric Dumazet Reported-by: John Sperbeck Bisected-by: John Sperbeck Cc: Zhang Yu Cc: Li RongQing --- net/packet/af_packet.c | 44 +- net/packet/internal.h | 1 + 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 5610061e7f2e00b935ce44dd9cf82d10eb77a7bf..75c92a87e7b2481141161c8945f5e7eef8e0abf8 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -4137,36 +4137,52 @@ static const struct vm_operations_struct packet_mmap_ops = { .close = packet_mm_close, }; -static void free_pg_vec(struct pgv *pg_vec, unsigned int len) +static void free_pg_vec(struct pgv *pg_vec, unsigned int order, + unsigned int len) { int i; for (i = 0; i < len; i++) { if (likely(pg_vec[i].buffer)) { - kvfree(pg_vec[i].buffer); + if (is_vmalloc_addr(pg_vec[i].buffer)) + vfree(pg_vec[i].buffer); + else + free_pages((unsigned long)pg_vec[i].buffer, + order); pg_vec[i].buffer = NULL; } } kfree(pg_vec); } -static char *alloc_one_pg_vec_page(unsigned long size) +static char *alloc_one_pg_vec_page(unsigned long order) { char *buffer; + gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | + __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY; - buffer = kvzalloc(size, GFP_KERNEL); + buffer = (char *) __get_free_pages(gfp_flags, order); if (buffer) return buffer; - buffer = kvzalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL); + /* __get_free_pages failed, fall back to vmalloc */ + buffer = vzalloc(array_size((1 << order), PAGE_SIZE)); + if (buffer) + return buffer; + + /* vmalloc failed, lets dig into swap here */ + gfp_flags &= ~__GFP_NORETRY; + buffer = (char *) __get_free_pages(gfp_flags, order); + if (buffer) + return buffer; - return buffer; + /* complete and utter failure */ + return NULL; } -static struct pgv *alloc_pg_vec(struct tpacket_req *req) +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) { unsigned int block_nr = req->tp_block_nr; - unsigned long size = req->tp_block_size; struct pgv *pg_vec; int i; @@ -4175,7 +4191,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req) goto out; for (i = 0; i < block_nr; i++) { - pg_vec[i].buffer = alloc_one_pg_vec_page(size); + pg_vec[i].buffer = alloc_one_pg_vec_page(order); if (unlikely(!pg_vec[i].buffer)) goto out_free_pgvec; } @@ -4184,7 +4200,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req) return pg_vec; out_free_pgvec: - free_pg_vec(pg_vec, block_nr); + free_pg_vec(pg_vec, order, block_nr); pg_vec = NULL; goto out; } @@ -4194,9 +4210,9 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, { struct pgv *pg_vec = NULL; struct packet_sock *po = pkt_sk(sk); + int was_running, order = 0; struct packet_ring_buffer *rb; struct sk_buff_head *rb_queue; - int was_running; __be16 num; int err = -EINVAL; /* Added to avoid minimal code churn */ @@ -4258,7 +4274,8 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, goto out; err = -ENOMEM; - pg_vec = alloc_pg_vec(req); + order = get_order(req->tp_block_size); + pg_vec = alloc_pg_vec(req, order); if (unlikely(!pg_vec)) goto out; switch (po->tp_version) { @@ -4312,6 +4329,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, rb->frame_size = req->tp_frame_size; spin_unlock_bh(_queue->lock); + swap(rb->pg_vec_order, order); swap(rb->pg_vec_len, req->tp_block_nr); rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE; @@ -4337,7 +4355,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, } if
Kernel warnings from tcp.c
Anyone with experience in tcp.c have an idea what's causing this on our busy database server? Comes from this macro at line 2278: WARN_ON(sock_owned_by_user(sk)); [726780.788201] WARNING: CPU: 15 PID: 52245 at net/ipv4/tcp.c:2278 tcp_close+0x40f/0x430 [726780.794947] Modules linked in: binfmt_misc nf_conntrack_netlink nfnetlink_queue tcp_diag inet_diag isofs ip6table_mangle ip6table_raw ip6table_nat nf_nat_ipv6 iptable_security xt_CT iptable_raw iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_pkttype xt_NFLOG nfnetlink_log xt_u32 xt_multiport xt_set xt_conntrack ip_set_hash_netport ip_set_hash_ipport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_conntrack_proto_gre nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_LOG nf_conntrack_tftp nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd cirrus snd_seq zcommon(PO) ttm intel_rapl_perf snd_seq_device [726780.848696] drm_kms_helper znvpair(PO) snd_pcm snd_timer spl(O) drm snd soundcore syscopyarea sysfillrect pcspkr sysimgblt input_leds fb_sys_fops i2c_piix4 ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix xen_blkfront crc32c_intel libata ena(O) serio_raw floppy sunrpc [726780.863916] CPU: 15 PID: 52245 Comm: mysqld Tainted: PW O 4.16.13-1.el7.elrepo.x86_64 #1 [726780.869486] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 [726780.873756] RIP: 0010:tcp_close+0x40f/0x430 [726780.877049] RSP: 0018:c90063a87dd0 EFLAGS: 00010202 [726780.880913] RAX: 0001 RBX: 8839cbb4b300 RCX: 0001 [726780.885708] RDX: 0041 RSI: 00023540 RDI: 002b [726780.890412] RBP: c90063a87df0 R08: R09: 0101 [726780.895192] R10: 03ff R11: 2057 R12: 8839cbb4b388 [726780.900029] R13: 0009 R14: 8839cbb4b3c8 R15: 883c5ed14900 [726780.904777] FS: 7f6acd467700() GS:883c8b1c() knlGS: [726780.909964] CS: 0010 DS: ES: CR0: 80050033 [726780.914078] CR2: 7f6cd5d430c8 CR3: 003c7e6b6005 CR4: 001606e0 [726780.918837] DR0: DR1: DR2: [726780.923433] DR3: DR6: fffe0ff0 DR7: 0400 [726780.927882] Call Trace: [726780.930410] inet_release+0x42/0x70 [726780.933423] inet6_release+0x30/0x40 [726780.936439] sock_release+0x25/0x80 [726780.939421] sock_close+0x12/0x20 [726780.942342] __fput+0xea/0x220 [726780.945170] fput+0xe/0x10 [726780.947963] task_work_run+0x8c/0xb0 [726780.951052] exit_to_usermode_loop+0x6b/0x95 [726780.954480] do_syscall_64+0x182/0x1b0 [726780.957586] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [726780.961288] RIP: 0033:0x7fd82b27085d [726780.964349] RSP: 002b:7f6acd466b50 EFLAGS: 0293 ORIG_RAX: 0003 [726780.969280] RAX: RBX: 7f65b2418220 RCX: 7fd82b27085d [726780.973987] RDX: 0003 RSI: 7f6d5e9990c0 RDI: 02d8 [726780.978689] RBP: 7f6acd466c00 R08: 01622848 R09: 01f8 [726780.983373] R10: R11: 0293 R12: [726780.988049] R13: 7f6d5e9990c0 R14: 01d87cc0 R15: 02d8 [726780.992700] Code: ff 48 8b 43 28 31 f6 48 89 df 48 8b 40 10 e8 49 2d 4e 00 48 8b 43 30 48 8b 80 98 01 00 00 65 48 ff 80 90 01 00 00 e9 49 ff ff ff <0f> 0b e9 ab fc ff ff 48 8b 43 28 31 f6 48 89 df 48 8b 40 10 e8 [726781.004156] ---[ end trace 8525f27644ac4631 ]---
Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions
On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote: > On 23 Aug 2018, at 20:14, Jakub Kicinski wrote: > > > On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote: > >> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote: > >>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote: > On 11 Aug 2018, at 21:06, David Miller wrote: > > > From: Jakub Kicinski > > Date: Thu, 9 Aug 2018 20:26:08 -0700 > > > >> It is not immediately clear why this is needed. The memory and > >> updating two sets of counters won't come for free, so perhaps a > >> stronger justification than troubleshooting is due? :S > >> > >> Netdev has counters for fallback vs forwarded traffic, so you'd > >> know > >> that traffic hits the SW datapath, plus the rules which are in_hw > >> will > >> most likely not match as of today for flower (assuming > >> correctness). > > I strongly believe that these counters are a requirement for a > mixed > software/hardware (flow) based forwarding environment. The global > counters will not help much here as you might have chosen to have > certain traffic forwarded by software. > > These counters are probably the only option you have to figure out > why > forwarding is not as fast as expected, and you want to blame the TC > offload NIC. > >>> > >>> The suggested debugging flow would be: > >>> (1) check the global counter for fallback are incrementing; > >>> (2) find a flow with high stats but no in_hw flag set. > >>> > >>> The in_hw indication should be sufficient in most cases (unless > >>> there > >>> are shared blocks between netdevs of different ASICs...). > >> > >> I guess the aim is to find miss behaving hardware, i.e. having the > >> in_hw > >> flag set, but flows still coming to the kernel. > > > > For misbehaving hardware in_hw will not work indeed. Whether we need > > these extra always-on stats for such use case could be debated :) > > > >> I'm slightly concerned about potential performance impact, would > >> you > >> be able to share some numbers for non-trivial number of flows > >> (100k > >> active?)? > > > > Agreed, features used for diagnostics cannot have a harmful > > penalty > > for fast path performance. > > Fast path performance is not affected as these counters are not > incremented there. They are only incremented by the nic driver when > they > gather their statistics from hardware. > >>> > >>> Not by much, you are adding state to performance-critical > >>> structures, > >>> though, for what is effectively debugging purposes. > >>> > >>> I was mostly talking about the HW offload stat updates (sorry for > >>> not > >>> being clear). > >>> > >>> We can have some hundreds of thousands active offloaded flows, each > >>> of > >>> them can have multiple actions, and stats have to be updated > >>> multiple > >>> times per second and dumped probably around once a second, too. On > >>> a > >>> busy system the stats will get evicted from cache between each > >>> round. > >>> > >>> But I'm speculating let's see if I can get some numbers on it (if > >>> you > >>> could get some too, that would be great!). > >> > >> I’ll try to measure some of this later this week/early next week. > > > > I asked Louis to run some tests while I'm travelling, and he reports > > that my worry about reporting the extra stats was unfounded. Update > > function does not show up in traces at all. It seems under stress > > (generated with stress-ng) the thread dumping the stats in userspace > > (in OvS it would be the revalidator) actually consumes less CPU in > > __gnet_stats_copy_basic (0.4% less for ~2.0% total). > > > > Would this match with your results? I'm not sure why dumping would be > > faster with your change.. > > Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC > rules installed in HW. > > For __gnet_stats_copy_basic() being faster I have (had) a theory. Now > this function is called twice, and I assumed the first call would cache > memory and the second call would be faster. > > Sampling a lot of perf data, I get an average of 1115ns with the base > kernel and 954ns with the fix applied, so about ~14%. > > Thought I would perf tcf_action_copy_stats() as it is the place updating > the additional counter. But even in this case, I see a better > performance with the patch applied. > > In average 13581ns with the fix, vs base kernel at 1391ns, so about > 2.3%. > > I guess the changes to the tc_action structure got better cache > alignment. Interesting you could reproduce the speed up too! +1 for the guess. Seems like my caution about slowing down SW paths to support HW offload landed on a very unfortunate patch :)
Re: [bpf-next, 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
From: Maciej Fijalkowski > From: Björn Töpel > > This commit adds proper MEM_TYPE_ZERO_COPY support for > convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an > xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a > MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might > make sense to implement a more sophisticated thread-safe alloc/free > scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is > required in the fast-path. > > Signed-off-by: Björn Töpel > --- > include/net/xdp.h | 5 +++-- > net/core/xdp.c| 39 +++ > 2 files changed, 42 insertions(+), 2 deletions(-) > > diff --git a/include/net/xdp.h b/include/net/xdp.h > index 76b95256c266..0d5c6fb4b2e2 100644 > --- a/include/net/xdp.h > +++ b/include/net/xdp.h > @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame) > frame->dev_rx = NULL; > } > > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp); > + > /* Convert xdp_buff to xdp_frame */ > static inline > struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) > @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff > *xdp) > int metasize; > int headroom; > > - /* TODO: implement clone, copy, use "native" MEM_TYPE */ > if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) > - return NULL; > + return xdp_convert_zc_to_xdp_frame(xdp); > > /* Assure headroom is available for storing info */ > headroom = xdp->data - xdp->data_hard_start; > diff --git a/net/core/xdp.c b/net/core/xdp.c > index 89b6785cef2a..be6cb2f0e722 100644 > --- a/net/core/xdp.c > +++ b/net/core/xdp.c > @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info > *info, > info->flags = bpf->flags; > } > EXPORT_SYMBOL_GPL(xdp_attachment_setup); > + > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp) > +{ > + unsigned int metasize, headroom, totsize; > + void *addr, *data_to_copy; > + struct xdp_frame *xdpf; > + struct page *page; > + > + /* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */ > + metasize = xdp_data_meta_unsupported(xdp) ? 0 : > +xdp->data - xdp->data_meta; > + headroom = xdp->data - xdp->data_hard_start; You actually don't use the headroom calculated here; xdpf->headroom is assigned to 0 - was this your intention? If so, the local variable can be removed. > + totsize = xdp->data_end - xdp->data + metasize; > + > + if (sizeof(*xdpf) + totsize > PAGE_SIZE) > + return NULL; > + > + page = dev_alloc_page(); > + if (!page) > + return NULL; > + > + addr = page_to_virt(page); > + xdpf = addr; > + memset(xdpf, 0, sizeof(*xdpf)); > + > + addr += sizeof(*xdpf); > + data_to_copy = metasize ? xdp->data_meta : xdp->data; > + memcpy(addr, data_to_copy, totsize); > + > + xdpf->data = addr + metasize; > + xdpf->len = totsize - metasize; > + xdpf->headroom = 0; > + xdpf->metasize = metasize; > + xdpf->mem.type = MEM_TYPE_PAGE_ORDER0; > + > + xdp_return_buff(xdp); > + return xdpf; > +} > +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions
On Wed, 29 Aug 2018 12:23:15 +0200, Paolo Abeni wrote: > On Thu, 2018-08-23 at 20:14 +0200, Jakub Kicinski wrote: > > I asked Louis to run some tests while I'm travelling, and he reports > > that my worry about reporting the extra stats was unfounded. Update > > function does not show up in traces at all. It seems under stress > > (generated with stress-ng) the thread dumping the stats in userspace > > (in OvS it would be the revalidator) actually consumes less CPU in > > __gnet_stats_copy_basic (0.4% less for ~2.0% total). > > > > Would this match with your results? I'm not sure why dumping would be > > faster with your change.. > > Wild guess on my side: the relevant patch changes a bit the binary > layout of the 'tc_action' struct, possibly (I still need to check with > pahole) moving the tcf_lock and the stats field on different > cachelines, reducing false sharing that could affect badly such test. I think in our tests we tried with and without pinning relevant processing to one core, and both results shown improvement. I don't have the actual samples any more, just perf script dump without CPU IDs to confirm things were pinned correctly.. :(
Re: [PATCH bpf 0/3] Three fixes for bpf_msg_pull_data
On Wed, Aug 29, 2018 at 04:50:33PM +0200, Daniel Borkmann wrote: > This set contains three more fixes for the bpf_msg_pull_data() > mainly for correcting scatterlist ring wrap-arounds as well as > fixing up data pointers. For details please see individual patches. > Thanks! Applied to bpf tree, Thanks
[Patch net-nnext] Revert "net: sched: act: add extack for lookup callback"
This reverts commit 331a9295de23 ("net: sched: act: add extack for lookup callback"). This extack is never used after 6 months... In fact, it can be just set in the caller, right after ->lookup(). Cc: Alexander Aring Signed-off-by: Cong Wang --- include/net/act_api.h | 3 +-- net/sched/act_api.c| 6 -- net/sched/act_bpf.c| 3 +-- net/sched/act_connmark.c | 3 +-- net/sched/act_csum.c | 3 +-- net/sched/act_gact.c | 3 +-- net/sched/act_ife.c| 3 +-- net/sched/act_ipt.c| 6 ++ net/sched/act_mirred.c | 3 +-- net/sched/act_nat.c| 3 +-- net/sched/act_pedit.c | 3 +-- net/sched/act_police.c | 3 +-- net/sched/act_sample.c | 3 +-- net/sched/act_simple.c | 3 +-- net/sched/act_skbedit.c| 3 +-- net/sched/act_skbmod.c | 3 +-- net/sched/act_tunnel_key.c | 3 +-- net/sched/act_vlan.c | 3 +-- 18 files changed, 22 insertions(+), 38 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 970303448c90..c6f195b3c706 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -85,8 +85,7 @@ struct tc_action_ops { struct tcf_result *); /* called under RCU BH lock*/ int (*dump)(struct sk_buff *, struct tc_action *, int, int); void(*cleanup)(struct tc_action *); - int (*lookup)(struct net *net, struct tc_action **a, u32 index, - struct netlink_ext_ack *extack); + int (*lookup)(struct net *net, struct tc_action **a, u32 index); int (*init)(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **act, int ovr, int bind, bool rtnl_held, diff --git a/net/sched/act_api.c b/net/sched/act_api.c index db83dac1e7f4..398c752ff529 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -1067,12 +1067,14 @@ static struct tc_action *tcf_action_get_1(struct net *net, struct nlattr *nla, err = -EINVAL; ops = tc_lookup_action(tb[TCA_ACT_KIND]); if (!ops) { /* could happen in batch of actions */ - NL_SET_ERR_MSG(extack, "Specified TC action not found"); + NL_SET_ERR_MSG(extack, "Specified TC action kind not found"); goto err_out; } err = -ENOENT; - if (ops->lookup(net, , index, extack) == 0) + if (ops->lookup(net, , index) == 0) { + NL_SET_ERR_MSG(extack, "TC action with specified index not found"); goto err_mod; + } module_put(ops->owner); return a; diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 0c68bc9cf0b4..c7633843e223 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -387,8 +387,7 @@ static int tcf_bpf_walker(struct net *net, struct sk_buff *skb, return tcf_generic_walker(tn, skb, cb, type, ops, extack); } -static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index, - struct netlink_ext_ack *extack) +static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index) { struct tc_action_net *tn = net_generic(net, bpf_net_id); diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index 6f0f273f1139..e869c0ee63c8 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -190,8 +190,7 @@ static int tcf_connmark_walker(struct net *net, struct sk_buff *skb, return tcf_generic_walker(tn, skb, cb, type, ops, extack); } -static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 index, - struct netlink_ext_ack *extack) +static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 index) { struct tc_action_net *tn = net_generic(net, connmark_net_id); diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index b8a67ae3105a..3dc25b7806d7 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -646,8 +646,7 @@ static int tcf_csum_walker(struct net *net, struct sk_buff *skb, return tcf_generic_walker(tn, skb, cb, type, ops, extack); } -static int tcf_csum_search(struct net *net, struct tc_action **a, u32 index, - struct netlink_ext_ack *extack) +static int tcf_csum_search(struct net *net, struct tc_action **a, u32 index) { struct tc_action_net *tn = net_generic(net, csum_net_id); diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index cd1d9bd32ef9..aa44d14b43c7 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -222,8 +222,7 @@ static int tcf_gact_walker(struct net *net, struct sk_buff *skb, return tcf_generic_walker(tn, skb, cb, type, ops, extack); } -static int tcf_gact_search(struct net *net, struct tc_action **a, u32 index, - struct netlink_ext_ack *extack) +static int tcf_gact_search(struct net *net, struct tc_action **a, u32 index) { struct tc_action_net *tn =
[Patch net-nnext] net_sched: add missing tcf_lock for act_connmark
According to the new locking rule, we have to take tcf_lock for both ->init() and ->dump(), as RTNL will be removed. However, it is missing for act_connmark. Cc: Vlad Buslov Signed-off-by: Cong Wang --- net/sched/act_connmark.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index e869c0ee63c8..8475913f2070 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -143,8 +143,10 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla, return -EEXIST; } /* replacing action and zone */ + spin_lock_bh(>tcf_lock); ci->tcf_action = parm->action; ci->zone = parm->zone; + spin_unlock_bh(>tcf_lock); ret = 0; } @@ -156,16 +158,16 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, struct tc_action *a, { unsigned char *b = skb_tail_pointer(skb); struct tcf_connmark_info *ci = to_connmark(a); - struct tc_connmark opt = { .index = ci->tcf_index, .refcnt = refcount_read(>tcf_refcnt) - ref, .bindcnt = atomic_read(>tcf_bindcnt) - bind, - .action = ci->tcf_action, - .zone = ci->zone, }; struct tcf_t t; + spin_lock_bh(>tcf_lock); + opt.action = ci->tcf_action; + opt.zone = ci->zone; if (nla_put(skb, TCA_CONNMARK_PARMS, sizeof(opt), )) goto nla_put_failure; @@ -173,9 +175,12 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, struct tc_action *a, if (nla_put_64bit(skb, TCA_CONNMARK_TM, sizeof(t), , TCA_CONNMARK_PAD)) goto nla_put_failure; + spin_unlock_bh(>tcf_lock); return skb->len; + nla_put_failure: + spin_unlock_bh(>tcf_lock); nlmsg_trim(skb, b); return -1; } -- 2.14.4
Re: [PATCH 2/4] r8169: Get and enable optional ether_clk clock
Hi, On 27-08-18 21:14, Stephen Boyd wrote: Quoting Hans de Goede (2018-08-27 11:53:19) On 27-08-18 20:47, Stephen Boyd wrote: How would you know that a clk device driver hasn't probed yet and isn't the driver that's actually providing the clk to this device on x86 systems? With DT systems we can figure that out by looking at the DT and seeing if the device driver requesting the clk has the clocks property. On x86 systems it's all clkdev which doesn't really lend itself to solving this problem. Right on x86 the assumption is that the clk driver will be builtin and will probe before the consumer. In this case that is true as the pmc-atom-clk driver can only be builtin and its platform device is instantiated from the acpi_lpss code and acpi init happens before the PCI bus is scanned. If we can go with this assumption then we can make the optional clk API work even on clkdev based systems. Maybe if x86 had some way of indicating that all builtin clks are registered? Unfortunately there is no such thing I'm afraid. That might work but it's not very clean. Or if we could check to see if we're running on an ACPI based system in clkdev we could use that to assume that clk_get() will only be called after all providers have registered their lookups. Yes some check for x86 + ACPI (ARM also uses ACPI, but there we should no do this AFAICT) is probably best. That or not use the new optional clk API on x86, but that means that any cross platform driver cannot use it, which would be a pain. BTW does your Acked-by indicate you are ok with merging this series through the netdev tree as I suggested in the cover-letter? If so can I also add your Acked-by to the 3th patch ? Regards, Hans
[Patch iproute2 v2] ss: add UNIX_DIAG_VFS and UNIX_DIAG_ICONS for unix sockets
UNIX_DIAG_VFS and UNIX_DIAG_ICONS are never used by ss, make them available in ss -e output. Cc: Stephen Hemminger Signed-off-by: Cong Wang --- misc/ss.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/misc/ss.c b/misc/ss.c index 41e7762b..b2c634c8 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -3604,6 +3605,21 @@ static int unix_show_sock(const struct sockaddr_nl *addr, struct nlmsghdr *nlh, out(" %c-%c", mask & 1 ? '-' : '<', mask & 2 ? '-' : '>'); } + if (tb[UNIX_DIAG_VFS]) { + struct unix_diag_vfs *uv = RTA_DATA(tb[UNIX_DIAG_VFS]); + + out(" ino:%u dev:%u/%u", uv->udiag_vfs_ino, major(uv->udiag_vfs_dev), +minor(uv->udiag_vfs_dev)); + } + if (tb[UNIX_DIAG_ICONS]) { + int len = RTA_PAYLOAD(tb[UNIX_DIAG_ICONS]); + __u32 *peers = RTA_DATA(tb[UNIX_DIAG_ICONS]); + int i; + + out(" peers:"); + for (i = 0; i < len / sizeof(__u32); i++) + out(" %u", peers[i]); + } } return 0; @@ -3641,6 +3657,8 @@ static int unix_show_netlink(struct filter *f) req.r.udiag_show = UDIAG_SHOW_NAME | UDIAG_SHOW_PEER | UDIAG_SHOW_RQLEN; if (show_mem) req.r.udiag_show |= UDIAG_SHOW_MEMINFO; + if (show_details) + req.r.udiag_show |= UDIAG_SHOW_VFS | UDIAG_SHOW_ICONS; return handle_netlink_request(f, , sizeof(req), unix_show_sock); } -- 2.14.4
Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues
Hi, On 29-08-18 18:31, Andy Shevchenko wrote: On Mon, Aug 27, 2018 at 04:31:56PM +0200, Hans de Goede wrote: Hi All, This series has as goal to revert commit d31fd43c0f9a ("clk: x86: Do not gate clocks enabled by the firmware"), because that commit causes almost all Cherry Trail devices to not use the S0i3 powerstate when suspending, leading to them quickly draining their battery when suspended. This commit was added to fix issues with the r8169 NIC on some Bay Trail devices, where the pmc_plt_clk_4 was used as clk for the r8169 chip. This series fixes this properly, so that we can undo the trouble some commit. The first patch adds an "ether_clk" alias to the clk-pmc-atom driver so that the r8169 can pass "ether_clk" as generic id to clk_get instead of having to add x86 specific code to the r8169 driver. The second patch makes the r8169 driver do a clk_get for "ether_clk" (ignoring -ENOENT errors so this is optional) and if that succeeds then it enables the clock. The third patch effectively revert the troublesome commit. This series has been tested on a HP Stream x360 - 11-p000nd, which uses a Bay Trail SoC with a r8169 ethernet chip and I can confirm that the pmc_plt_clk_4 gets properly enabled, so this series should not cause any regressions wrt devices needing pmc_plt_clk_4 enabled (which is not the case on the Stream x360). To avoid regressions we need to have patches 1 and 2 merged before 3, so it is probably best if this entire series gets merged through a single tree. Given that clk-pmc-atom.c has not seen any changes for over a year I suggest that all 3 patches are merged through the netdev tree, with an ack from the clk maintainers. Assuming that is ok with the clk maintainers of course. Note there is a fourth patch in this series, this patch is necessary to reach S0i3 state on Bay Trail devices wich have a r8169 ethernet chip, since I do not have access to hardware where the r8169 actually needs the pmc_plt_clk_4 I have not been able to test this, hence it is marked as RFC for now. What a nice stuff, thanks! Reviewed-by: Andy Shevchenko Thank you (and also thank you for the other reviews) Btw, you probably better to refer to https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for S0ix issue. Ok, I've added a link to that. I've also added: Reported-by: Johannes Stezenbach To honor Johannes for figuring out that leaving the clocks enabled was a problem in the first place. This will all be included in v2 of the series when I send it out. Another thing, clk_prepare_enable() can fail, I dunno what you can do at ->resume() stage with it failed. Right, I know that it can fail, but in practice it never fails unless things are seriously foo-barred already and there is not much we can do when it fails, so I decided to just ignore the return code. Regards, Hans
Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31
On Wed, Aug 29, 2018 at 8:43 AM Alex Vesker wrote: > > > > On Wed, Aug 1, 2018 at 4:13 PM, Saeed Mahameed > > wrote: > >> On Wed, Aug 1, 2018 at 3:34 PM, Alexander Duyck > >> wrote: > >>> On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed > >>> wrote: > Hi Dave, > > This series provides devlink parameters updates to both devlink API > and > mlx5 driver, it is a 2nd iteration of the dropped patches sent in a > previous > mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via > Devlink" to address review comments [1]. > > Changes from the original series: > - According to the discussion outcome, we are keeping the > congestion control > setting as mlx5 device specific for the current HW generation. > - Changed the congestion_mode and congestion action param type to > string > - Added patches to fix devlink handling of param type string > - Added a patch which adds extack messages support for param set. > - At the end of this series, I've added yet another mlx5 devlink > related > feature, firmware snapshot support. > > For more information please see tag log below. > > Please pull and let me know if there's any problem. > > [1] https://patchwork.ozlabs.org/patch/945996/ > > Thanks, > Saeed. > > --- > > The following changes since commit > e6476c21447c4b17c47e476aade6facf050f31e8: > > net: remove bogus RCU annotations on socket.wq (2018-07-31 > 12:40:22 -0700) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git > tags/mlx5-updates-2018-08-01 > > for you to fetch changes up to > 2ac6108c65ffcb1e5eab1fba1fd59272604d1c32: > > net/mlx5: Use devlink region_snapshot parameter (2018-08-01 > 14:49:09 -0700) > > > mlx5-updates-2018-08-01 > > This series provides devlink parameters updates to both devlink API > and > mlx5 driver, > > 1) Devlink changes: (Moshe Shemesh) > The first two patches fix devlink param infrastructure for string type > params. > The third patch adds a devlink helper function to safely copy > string from > driver to devlink. > The forth patch adds extack support for param set. > > 2) mlx5 specific congestion parameters: (Eran Ben Elisha) > Next three patches add new devlink driver specific params for > controlling > congestion action and mode, using string type params and extack > messages support. > > This congestion mode enables hw workaround in specific devices > which is > controlled by devlink driver-specific params. The workaround is device > specific for this NIC generation, the next NIC will not need it. > > Congestion parameters: > - Congestion action > HW W/A mechanism in the PCIe buffer which monitors the > amount of > consumed PCIe buffer per host. This mechanism supports > the > following actions in case of threshold overflow: > - Disabled - NOP (Default) > - Drop > - Mark - Mark CE bit in the CQE of received packet > - Congestion mode > - Aggressive - Aggressive static trigger threshold > (Default) > - Dynamic - Dynamically change the trigger threshold > > 3) mlx5 firmware snapshot support via devlink: (Alex Vesker) > Last three patches, add the support for capturing region snapshot > of the > firmware crspace during critical errors, using devlink region_snapshot > parameter. > > -Saeed. > > > Alex Vesker (3): > net/mlx5: Add Vendor Specific Capability access gateway > net/mlx5: Add Crdump FW snapshot support > net/mlx5: Use devlink region_snapshot parameter > > Eran Ben Elisha (3): > net/mlx5: Move all devlink related functions calls to devlink.c > net/mlx5: Add MPEGC register configuration functionality > net/mlx5: Enable PCIe buffer congestion handling workaround > via devlink > > Moshe Shemesh (4): > devlink: Fix param set handling for string type > devlink: Fix param cmode driverinit for string type > devlink: Add helper function for safely copy string param > devlink: Add extack messages support to param set > > drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c | 3 +- > drivers/net/ethernet/mellanox/mlx4/main.c | 6 +- > drivers/net/ethernet/mellanox/mlx5/core/Makefile | 3 +- >
Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues
On Mon, Aug 27, 2018 at 04:31:56PM +0200, Hans de Goede wrote: > Hi All, > > This series has as goal to revert commit d31fd43c0f9a ("clk: x86: Do not gate > clocks enabled by the firmware"), because that commit causes almost all > Cherry Trail devices to not use the S0i3 powerstate when suspending, leading > to them quickly draining their battery when suspended. > > This commit was added to fix issues with the r8169 NIC on some Bay Trail > devices, where the pmc_plt_clk_4 was used as clk for the r8169 chip. > > This series fixes this properly, so that we can undo the trouble some > commit. > > The first patch adds an "ether_clk" alias to the clk-pmc-atom driver so > that the r8169 can pass "ether_clk" as generic id to clk_get instead of > having to add x86 specific code to the r8169 driver. > > The second patch makes the r8169 driver do a clk_get for "ether_clk" > (ignoring -ENOENT errors so this is optional) and if that succeeds then > it enables the clock. > > The third patch effectively revert the troublesome commit. > > This series has been tested on a HP Stream x360 - 11-p000nd, which uses > a Bay Trail SoC with a r8169 ethernet chip and I can confirm that the > pmc_plt_clk_4 gets properly enabled, so this series should not cause any > regressions wrt devices needing pmc_plt_clk_4 enabled (which is not the > case on the Stream x360). > > To avoid regressions we need to have patches 1 and 2 merged before 3, > so it is probably best if this entire series gets merged through a single > tree. Given that clk-pmc-atom.c has not seen any changes for over a > year I suggest that all 3 patches are merged through the netdev tree, > with an ack from the clk maintainers. Assuming that is ok with the clk > maintainers of course. > > Note there is a fourth patch in this series, this patch is necessary to > reach S0i3 state on Bay Trail devices wich have a r8169 ethernet chip, > since I do not have access to hardware where the r8169 actually needs > the pmc_plt_clk_4 I have not been able to test this, hence it is marked > as RFC for now. > What a nice stuff, thanks! Reviewed-by: Andy Shevchenko Btw, you probably better to refer to https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for S0ix issue. Another thing, clk_prepare_enable() can fail, I dunno what you can do at ->resume() stage with it failed. > Carlos can you test the 4th patch (when you have time) and let us know if > it works? > > Regards, > > Hans > -- With Best Regards, Andy Shevchenko
[PATCH net-next 2/2] hv_netvsc: pair VF based on serial number
Matching network device based on MAC address is problematic since a non-VF network device can be created with a duplicate MAC address causing confusion and problems. The VMBus API provides a serial number that is a better matching method. Signed-off-by: Stephen Hemminger --- drivers/net/hyperv/netvsc.c | 3 ++ drivers/net/hyperv/netvsc_drv.c | 58 +++-- 2 files changed, 36 insertions(+), 25 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 31c3d77b4733..fe01e141c8f8 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -1203,6 +1203,9 @@ static void netvsc_send_vf(struct net_device *ndev, net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated; net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial; + netdev_info(ndev, "VF slot %u %s\n", + net_device_ctx->vf_serial, + net_device_ctx->vf_alloc ? "added" : "removed"); } static void netvsc_receive_inband(struct net_device *ndev, diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 1121a1ec407c..9dedc1463e88 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -1894,20 +1894,6 @@ static void netvsc_link_change(struct work_struct *w) rtnl_unlock(); } -static struct net_device *get_netvsc_bymac(const u8 *mac) -{ - struct net_device_context *ndev_ctx; - - list_for_each_entry(ndev_ctx, _dev_list, list) { - struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx); - - if (ether_addr_equal(mac, dev->perm_addr)) - return dev; - } - - return NULL; -} - static struct net_device *get_netvsc_byref(struct net_device *vf_netdev) { struct net_device_context *net_device_ctx; @@ -2036,26 +2022,48 @@ static void netvsc_vf_setup(struct work_struct *w) rtnl_unlock(); } +/* Find netvsc by VMBus serial number. + * The PCI hyperv controller records the serial number as the slot. + */ +static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev) +{ + struct device *parent = vf_netdev->dev.parent; + struct net_device_context *ndev_ctx; + struct pci_dev *pdev; + + if (!parent || !dev_is_pci(parent)) + return NULL; /* not a PCI device */ + + pdev = to_pci_dev(parent); + if (!pdev->slot) { + netdev_notice(vf_netdev, "no PCI slot information\n"); + return NULL; + } + + list_for_each_entry(ndev_ctx, _dev_list, list) { + if (!ndev_ctx->vf_alloc) + continue; + + if (ndev_ctx->vf_serial == pdev->slot->number) + return hv_get_drvdata(ndev_ctx->device_ctx); + } + + netdev_notice(vf_netdev, + "no netdev found for slot %u\n", pdev->slot->number); + return NULL; +} + static int netvsc_register_vf(struct net_device *vf_netdev) { - struct net_device *ndev; struct net_device_context *net_device_ctx; - struct device *pdev = vf_netdev->dev.parent; struct netvsc_device *netvsc_dev; + struct net_device *ndev; int ret; if (vf_netdev->addr_len != ETH_ALEN) return NOTIFY_DONE; - if (!pdev || !dev_is_pci(pdev) || dev_is_pf(pdev)) - return NOTIFY_DONE; - - /* -* We will use the MAC address to locate the synthetic interface to -* associate with the VF interface. If we don't find a matching -* synthetic interface, move on. -*/ - ndev = get_netvsc_bymac(vf_netdev->perm_addr); + ndev = get_netvsc_byslot(vf_netdev); if (!ndev) return NOTIFY_DONE; -- 2.18.0
[PATCH net-next 0/2] hv_netvsc: associate VF and PV device by serial number
The Hyper-V implementation of PCI controller has concept of 32 bit serial number (not to be confused with PCI-E serial number). This value is sent in the protocol from the host to indicate SR-IOV VF device is attached to a synthetic NIC. Using the serial number (instead of MAC address) to associate the two devices avoids lots of potential problems when there are duplicate MAC addresses from tunnels or layered devices. The patch set is broken into two parts, one is for the PCI controller and the other is for the netvsc device. Normally, these go through different trees but sending them together here for better review. The PCI changes were submitted previously, but the main review comment was "why do you need this?". This is why. Stephen Hemminger (2): PCI: hv: support reporting serial number as slot information hv_netvsc: pair VF based on serial number drivers/net/hyperv/netvsc.c | 3 ++ drivers/net/hyperv/netvsc_drv.c | 58 - drivers/pci/controller/pci-hyperv.c | 30 +++ 3 files changed, 66 insertions(+), 25 deletions(-) -- 2.18.0
[PATCH net-next 1/2] PCI: hv: support reporting serial number as slot information
The Hyper-V host API for PCI provides a unique "serial number" which can be used as basis for sysfs PCI slot table. This can be useful for cases where userspace wants to find the PCI device based on serial number. When an SR-IOV NIC is added, the host sends an attach message with serial number. The kernel doesn't use the serial number, but it is useful when doing the same thing in a userspace driver such as the DPDK. By having /sys/bus/pci/slots/N it provides a direct way to find the matching PCI device. There may be some cases where serial number is not unique such as when using GPU's. But the PCI slot infrastructure will handle that by adding suffix "2-1" etc. This has a side effect which may also be useful. The common udev network device naming policy uses the slot information (rather than PCI address). This causes udev to give shorter network device names for VF devices. It does not break applications or startup because the VF device must never be configured directly. Signed-off-by: Stephen Hemminger --- drivers/pci/controller/pci-hyperv.c | 30 + 1 file changed, 30 insertions(+) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index c00f82cc54aa..e6a6c1146a41 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -89,6 +89,8 @@ static enum pci_protocol_version_t pci_protocol_version; #define STATUS_REVISION_MISMATCH 0xC059 +#define SLOT_NAME_SIZE 21 + /* * Message Types */ @@ -494,6 +496,7 @@ struct hv_pci_dev { struct list_head list_entry; refcount_t refs; enum hv_pcichild_state state; + struct pci_slot *pci_slot; struct pci_function_description desc; bool reported_missing; struct hv_pcibus_device *hbus; @@ -1457,6 +1460,28 @@ static void prepopulate_bars(struct hv_pcibus_device *hbus) spin_unlock_irqrestore(>device_list_lock, flags); } +static void hv_pci_assign_slots(struct hv_pcibus_device *hbus) +{ + struct hv_pci_dev *hpdev; + char name[SLOT_NAME_SIZE]; + unsigned long flags; + int slot_nr; + + spin_lock_irqsave(>device_list_lock, flags); + list_for_each_entry(hpdev, >children, list_entry) { + if (hpdev->pci_slot) + continue; + + slot_nr = PCI_SLOT(wslot_to_devfn(hpdev->desc.win_slot.slot)); + snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser); + hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr, + name, NULL); + if (!hpdev->pci_slot) + pr_warn("pci_create slot %s failed\n", name); + } + spin_unlock_irqrestore(>device_list_lock, flags); +} + /** * create_root_hv_pci_bus() - Expose a new root PCI bus * @hbus: Root PCI bus, as understood by this driver @@ -1480,6 +1505,7 @@ static int create_root_hv_pci_bus(struct hv_pcibus_device *hbus) pci_lock_rescan_remove(); pci_scan_child_bus(hbus->pci_bus); pci_bus_assign_resources(hbus->pci_bus); + hv_pci_assign_slots(hbus); pci_bus_add_devices(hbus->pci_bus); pci_unlock_rescan_remove(); hbus->state = hv_pcibus_installed; @@ -1742,6 +1768,7 @@ static void pci_devices_present_work(struct work_struct *work) */ pci_lock_rescan_remove(); pci_scan_child_bus(hbus->pci_bus); + hv_pci_assign_slots(hbus); pci_unlock_rescan_remove(); break; @@ -1858,6 +1885,9 @@ static void hv_eject_device_work(struct work_struct *work) list_del(>list_entry); spin_unlock_irqrestore(>hbus->device_list_lock, flags); + if (hpdev->pci_slot) + pci_destroy_slot(hpdev->pci_slot); + memset(, 0, sizeof(ctxt)); ejct_pkt = (struct pci_eject_response *) ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE; -- 2.18.0
Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
On 08/28/2018 02:44 PM, Björn Töpel wrote: > From: Björn Töpel > > This patch set introduces zero-copy AF_XDP support for Intel's i40e > driver. In the first preparatory patch we also add support for > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can > redirect them. This was a ToDo from the first AF_XDP zero-copy patch > set from early June. Special thanks to Alex Duyck and Jesper Dangaard > Brouer for reviewing earlier versions of this patch set. > > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note > that in the interest of time, to get an AF_XDP zero-copy implementation > out there for people to try, some code paths have been copied from the > XDP path to the zero-copy path. It is out goal to merge the two paths > in later patch sets. > > In contrast to the implementation from beginning of June, this patch > set does not require any extra HW queues for AF_XDP zero-copy > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and > AF_XDP zero-copy TX. > > Jeff, given that most of changes are in i40e, it is up to you how you > would like to route these patches. The set is tagged bpf-next, but > if taking it via the Intel driver tree is easier, let us know. > > We have run some benchmarks on a dual socket system with two Broadwell > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 > cores which gives a total of 28, but only two cores are used in these > experiments. One for TR/RX and one for the user space application. The > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The > NIC is Intel I40E 40Gbit/s using the i40e driver. > > Below are the results in Mpps of the I40E NIC benchmark runs for 64 > and 1500 byte packets, generated by a commercial packet generator HW > outputing packets at full 40 Gbit/s line rate. The results are with > retpoline and all other spectre and meltdown fixes, so these results > are not comparable to the ones from the zero-copy patch set in June. > > AF_XDP performance 64 byte packets. > Benchmark XDP_SKBXDP_DRVXDP_DRV with zerocopy > rxdrop 2.68.2 15.0 > txpush 2.2- 21.9 > l2fwd1.72.3 11.3 > > AF_XDP performance 1500 byte packets: > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy > rxdrop 2.03.3 3.3 > l2fwd1.31.7 3.1 > > XDP performance on our system as a base line: > > 64 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 18.4M 0 > > 1500 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 3.3M0 > > The structure of the patch set is as follows: > > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames > Patches 2-4: Preparatory patches to common xsk and net code > Patches 5-7: Preparatory patches to i40e driver code for RX > Patch 8: i40e zero-copy support for RX > Patch 9: Preparatory patch to i40e driver code for TX > Patch 10: i40e zero-copy support for TX > Patch 11: Add flags to sample application to force zero-copy/copy mode > > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") > > > Magnus & Björn > > Björn Töpel (8): > xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY > xdp: export xdp_rxq_info_unreg_mem_model > xsk: expose xdp_umem_get_{data,dma} to drivers > i40e: added queue pair disable/enable functions > i40e: refactor Rx path for re-use > i40e: move common Rx functions to i40e_txrx_common.h > i40e: add AF_XDP zero-copy Rx support > samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock > > Magnus Karlsson (3): > net: add napi_if_scheduled_mark_missed > i40e: move common Tx functions to i40e_txrx_common.h > i40e: add AF_XDP zero-copy Tx support Thanks for working on this, LGTM! Are you also planning to get ixgbe out after that? For the series: Acked-by: Daniel Borkmann Thanks, Daniel
Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31
On Wed, Aug 1, 2018 at 4:13 PM, Saeed Mahameed wrote: On Wed, Aug 1, 2018 at 3:34 PM, Alexander Duyck wrote: On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed wrote: Hi Dave, This series provides devlink parameters updates to both devlink API and mlx5 driver, it is a 2nd iteration of the dropped patches sent in a previous mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via Devlink" to address review comments [1]. Changes from the original series: - According to the discussion outcome, we are keeping the congestion control setting as mlx5 device specific for the current HW generation. - Changed the congestion_mode and congestion action param type to string - Added patches to fix devlink handling of param type string - Added a patch which adds extack messages support for param set. - At the end of this series, I've added yet another mlx5 devlink related feature, firmware snapshot support. For more information please see tag log below. Please pull and let me know if there's any problem. [1] https://patchwork.ozlabs.org/patch/945996/ Thanks, Saeed. --- The following changes since commit e6476c21447c4b17c47e476aade6facf050f31e8: net: remove bogus RCU annotations on socket.wq (2018-07-31 12:40:22 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2018-08-01 for you to fetch changes up to 2ac6108c65ffcb1e5eab1fba1fd59272604d1c32: net/mlx5: Use devlink region_snapshot parameter (2018-08-01 14:49:09 -0700) mlx5-updates-2018-08-01 This series provides devlink parameters updates to both devlink API and mlx5 driver, 1) Devlink changes: (Moshe Shemesh) The first two patches fix devlink param infrastructure for string type params. The third patch adds a devlink helper function to safely copy string from driver to devlink. The forth patch adds extack support for param set. 2) mlx5 specific congestion parameters: (Eran Ben Elisha) Next three patches add new devlink driver specific params for controlling congestion action and mode, using string type params and extack messages support. This congestion mode enables hw workaround in specific devices which is controlled by devlink driver-specific params. The workaround is device specific for this NIC generation, the next NIC will not need it. Congestion parameters: - Congestion action HW W/A mechanism in the PCIe buffer which monitors the amount of consumed PCIe buffer per host. This mechanism supports the following actions in case of threshold overflow: - Disabled - NOP (Default) - Drop - Mark - Mark CE bit in the CQE of received packet - Congestion mode - Aggressive - Aggressive static trigger threshold (Default) - Dynamic - Dynamically change the trigger threshold 3) mlx5 firmware snapshot support via devlink: (Alex Vesker) Last three patches, add the support for capturing region snapshot of the firmware crspace during critical errors, using devlink region_snapshot parameter. -Saeed. Alex Vesker (3): net/mlx5: Add Vendor Specific Capability access gateway net/mlx5: Add Crdump FW snapshot support net/mlx5: Use devlink region_snapshot parameter Eran Ben Elisha (3): net/mlx5: Move all devlink related functions calls to devlink.c net/mlx5: Add MPEGC register configuration functionality net/mlx5: Enable PCIe buffer congestion handling workaround via devlink Moshe Shemesh (4): devlink: Fix param set handling for string type devlink: Fix param cmode driverinit for string type devlink: Add helper function for safely copy string param devlink: Add extack messages support to param set drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c | 3 +- drivers/net/ethernet/mellanox/mlx4/main.c | 6 +- drivers/net/ethernet/mellanox/mlx5/core/Makefile | 3 +- drivers/net/ethernet/mellanox/mlx5/core/devlink.c | 388 + drivers/net/ethernet/mellanox/mlx5/core/devlink.h | 13 + .../net/ethernet/mellanox/mlx5/core/diag/crdump.c | 223 drivers/net/ethernet/mellanox/mlx5/core/health.c | 3 + drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h | 4 + .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 320 + .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h | 56 +++ drivers/net/ethernet/mellanox/mlx5/core/main.c | 10 +- include/linux/mlx5/driver.h | 5 + include/net/devlink.h | 15 +- net/core/devlink.c | 44 ++- 14 files changed, 1076 insertions(+), 17 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c create mode 100644
Re: [Patch iproute2] ss: add UNIX_DIAG_VFS and UNIX_DIAG_ICONS for unix sockets
On Tue, 28 Aug 2018 16:16:49 -0700 Cong Wang wrote: > On Mon, Aug 27, 2018 at 3:27 PM Stephen Hemminger > wrote: > > > > On Mon, 27 Aug 2018 14:46:52 -0700 > > Cong Wang wrote: > > > > > UNIX_DIAG_VFS and UNIX_DIAG_ICONS are never used by ss, > > > make them available in ss -e output. > > > > > > Cc: Stephen Hemminger > > > Signed-off-by: Cong Wang > > > --- > > > misc/ss.c | 25 + > > > 1 file changed, 25 insertions(+) > > > > > > diff --git a/misc/ss.c b/misc/ss.c > > > index 41e7762b..d28bc1ec 100644 > > > --- a/misc/ss.c > > > +++ b/misc/ss.c > > > @@ -16,6 +16,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > Why is this included, it isn't on my system. > > It is for major() and minor(). Ok on Debian, these are in the architecture include, so this will work fine.
Re: [bpf-next PATCH 0/2] bpf: test_sockmap updates
On 08/28/2018 06:10 PM, John Fastabend wrote: > Two small test sockmap updates for bpf-next. These help me run some > additional tests with test_sockmap. Applied to bpf-next, thanks!
Re: [PATCH bpf-next] bpf: remove duplicated include from syscall.c
On 08/28/2018 09:42 AM, YueHaibing wrote: > Remove duplicated include. > > Signed-off-by: YueHaibing Applied to bpf-next, thanks!
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
Hi Stephen, >>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs >>> file as DEVTYPE= information. To avoid any kind of race conditions >>> between netlink messages and reading from sysfs, it is useful to add the >>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK >>> messages. >>> >>> For network managing daemons that have to classify ARPHRD_ETHER network >>> devices into different types (like Wireless LAN, Bluetooth etc.), this >>> avoids the extra round trip to sysfs and parsing of the uevent file. >>> >>> Signed-off-by: Marcel Holtmann >>> --- >>> include/uapi/linux/if_link.h | 2 ++ >>> net/core/rtnetlink.c | 12 >>> 2 files changed, 14 insertions(+) >>> >>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h >>> index 43391e2d1153..781294972bb4 100644 >>> --- a/include/uapi/linux/if_link.h >>> +++ b/include/uapi/linux/if_link.h >>> @@ -166,6 +166,8 @@ enum { >>> IFLA_NEW_IFINDEX, >>> IFLA_MIN_MTU, >>> IFLA_MAX_MTU, >>> + IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ >> >> This is not something netdev-related. dev->dev.type is struct device_type. >> This is a generic "device" thing. Incorrect to expose over >> netdev-specific API. Please use "device" API for this. > > There is no device API in netlink. The whole point of this patch is to > do it in one message. It might be a performance optimization, but I can't > see how it could be a race condition. Devices set type before registering. the only way right now to pick up the DEVTYPE= value is from the /sys/class/net/*/uevent file. That is based on the ifname and not the index. When udev + systemd start renaming things behind your back, your daemon does not have a clean one-shot way of getting that information. As stated, the information in DEVTYPE= are a sub-classification of ARPHRD_ETHER and allows to differentiate a wired Ethernet card from a WiFi interface, from a Bluetooth interface, from WiMAX and so on. They just happen to be in dev.dev_type data structure at the moment and I didn't want to duplicate that information. I am actually fine doing this via IFLA_INFO_KIND since NetworkManager seems to be fixed now or do this via a different method or maybe just a different attribute name. I really just want to get the sub-classification of ARPHRD_ETHER that we need in userspace networking daemons from the kernel without having to go poke left and right over sysfs or interact with udev or systemd. Regards Marcel
Re: Kernel Panic on high bandwidth transfer over wifi
On 08/29/2018 04:42 AM, Nathaniel Munk wrote: > Hi all, > I'm running Arch Linux on kernel 4.18.5 (same issue on both arch-provided > kernel and mainline built-from-source). There is an issue whereby the kernel > crashes when transferring at high bandwidths (approx 6mB/s) over a specific > wifi connection. I can only reproduce the issue when using the Personal > Hotspot on my iPhone 6S+, but can reproduce it very consistently on that > connection. > > More often than not, any download reaching this speed will cause a panic, but > if the download is immediately terminated at the first error the system can > recover (and doing this I have obtained the attached logs). Unfortunately, I > have not had access to a second machine to obtain the netconsole printout of > the panic. > > As above, high-bandwidth transfers on other wifi networks do not cause the > issue (nor on ethernet connections). > > As you can see from the attached log, the issue appears at tcp_recvmsg+0x579 > and net_tx_action+0x1fe. At both these positions (net/ipv4/tcp.c:2000 and > net/core/dev.c:4279 in mainline 4.18.5), a member of the skb struct is called. > > Thank you for your time (and I apologize if this is spurious or badly worded, > this is my first bug report), and please don't hesitate to let me know if > there's anything else I can do to help work this out. > > Regards, > --- > Nathaniel Munk > nathan...@munk.com.au > Unfortunately there is no attached log ;)
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
On Wed, 29 Aug 2018 09:18:55 +0200 Jiri Pirko wrote: > Tue, Aug 28, 2018 at 10:58:11PM CEST, mar...@holtmann.org wrote: > >The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs > >file as DEVTYPE= information. To avoid any kind of race conditions > >between netlink messages and reading from sysfs, it is useful to add the > >same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK > >messages. > > > >For network managing daemons that have to classify ARPHRD_ETHER network > >devices into different types (like Wireless LAN, Bluetooth etc.), this > >avoids the extra round trip to sysfs and parsing of the uevent file. > > > >Signed-off-by: Marcel Holtmann > >--- > > include/uapi/linux/if_link.h | 2 ++ > > net/core/rtnetlink.c | 12 > > 2 files changed, 14 insertions(+) > > > >diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h > >index 43391e2d1153..781294972bb4 100644 > >--- a/include/uapi/linux/if_link.h > >+++ b/include/uapi/linux/if_link.h > >@@ -166,6 +166,8 @@ enum { > > IFLA_NEW_IFINDEX, > > IFLA_MIN_MTU, > > IFLA_MAX_MTU, > >+IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ > > This is not something netdev-related. dev->dev.type is struct device_type. > This is a generic "device" thing. Incorrect to expose over > netdev-specific API. Please use "device" API for this. There is no device API in netlink. The whole point of this patch is to do it in one message. It might be a performance optimization, but I can't see how it could be a race condition. Devices set type before registering.
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
Hi Jiri, >> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs >> file as DEVTYPE= information. To avoid any kind of race conditions >> between netlink messages and reading from sysfs, it is useful to add the >> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK >> messages. >> >> For network managing daemons that have to classify ARPHRD_ETHER network >> devices into different types (like Wireless LAN, Bluetooth etc.), this >> avoids the extra round trip to sysfs and parsing of the uevent file. >> >> Signed-off-by: Marcel Holtmann >> --- >> include/uapi/linux/if_link.h | 2 ++ >> net/core/rtnetlink.c | 12 >> 2 files changed, 14 insertions(+) >> >> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h >> index 43391e2d1153..781294972bb4 100644 >> --- a/include/uapi/linux/if_link.h >> +++ b/include/uapi/linux/if_link.h >> @@ -166,6 +166,8 @@ enum { >> IFLA_NEW_IFINDEX, >> IFLA_MIN_MTU, >> IFLA_MAX_MTU, >> +IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ > > This is not something netdev-related. dev->dev.type is struct device_type. > This is a generic "device" thing. Incorrect to expose over > netdev-specific API. Please use "device" API for this. it is not just "device" related since this is a sub-classification of ARPHRD_ETHER type as a wrote above. Don't get hang up that this information is part of struct device. The dev->dev.type contains strings like "wlan", "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of Ethernet card you have. We can revive the patches for adding this information as IFLA_INFO_KIND, but last time they where reverted since NetworkManager did all the wrong decisions based on that. That part is fixed now and we can put that back and declare the sub-type classification of Ethernet device in two places. Or we just take the information that we added a long time ago for exactly this sub-classification and provide them to userspace via RTNL. Regards Marcel
mlx5 driver loading failing on v4.19 / net-next / bpf-next
Hi Saeed, I'm having issues loading mlx5 driver on v4.19 kernels (tested both net-next and bpf-next), while kernel v4.18 seems to work. It happens with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that from the system now). One pain point is very long boot-time, caused by some timeout code in the driver. The kernel console log (dmesg) says: [5.763330] mlx5_core :03:00.0: firmware version: 16.22.1002 [5.769367] mlx5_core :03:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 GT/s x16 link) (...) other drivers loading [ 66.816635] mlx5_core :03:00.0: wait_func:964:(pid 112): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource [ 66.828123] mlx5_core :03:00.0: enable hca failed [ 66.845516] mlx5_core :03:00.0: mlx5_load_one failed with error code -110 [ 66.852802] mlx5_core: probe of :03:00.0 failed with error -110 [ 66.859347] mlx5_core :03:00.1: firmware version: 16.22.1002 [ 66.865388] mlx5_core :03:00.1: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 GT/s x16 link) [ 125.787395] XFS (sda3): Mounting V5 Filesystem [ 125.848509] XFS (sda3): Ending clean mount [ 127.984784] mlx5_core :03:00.1: wait_func:964:(pid 5): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource [ 127.996090] mlx5_core :03:00.1: enable hca failed [ 128.013819] mlx5_core :03:00.1: mlx5_load_one failed with error code -110 [ 128.021076] mlx5_core: probe of :03:00.1 failed with error -110 Do you have any idea what could be causing this? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v2 iproute2-next 2/5] bridge: colorize output and use JSON print library
On Sat, 14 Jul 2018 18:41:03 -0700 Roopa Prabhu wrote: > On Tue, Feb 20, 2018 at 11:24 AM, Stephen Hemminger > wrote: > > From: Stephen Hemminger > > > > Use new functions from json_print to simplify code. > > Provide standard flag for colorizing output. > > > > The shortened -c flag is ambiguous it could mean color or > > compressvlan; it is now changed to mean color for consistency > > with other iproute2 commands. > > > > Signed-off-by: Stephen Hemminger > > --- > > bridge/br_common.h | 2 +- > > bridge/bridge.c| 10 +- > > bridge/fdb.c | 281 +++-- > > bridge/mdb.c | 362 > > ++--- > > bridge/vlan.c | 276 +++- > > 5 files changed, 363 insertions(+), 568 deletions(-) > > > > diff --git a/bridge/br_common.h b/bridge/br_common.h > > index b25f61e50e05..2f1cb8fd9f3d 100644 > > --- a/bridge/br_common.h > > +++ b/bridge/br_common.h > > @@ -6,7 +6,7 @@ > > #define MDB_RTR_RTA(r) \ > > ((struct rtattr *)(((char *)(r)) + > > RTA_ALIGN(sizeof(__u32 > > > > -extern void print_vlan_info(FILE *fp, struct rtattr *tb, int ifindex); > > +extern void print_vlan_info(FILE *fp, struct rtattr *tb); > > extern int print_linkinfo(const struct sockaddr_nl *who, > > struct nlmsghdr *n, > > void *arg); > > diff --git a/bridge/bridge.c b/bridge/bridge.c > > index 4b112e3b8da9..e5b4c3c2198f 100644 > > --- a/bridge/bridge.c > > +++ b/bridge/bridge.c > > @@ -16,12 +16,15 @@ > > #include "utils.h" > > #include "br_common.h" > > #include "namespace.h" > > +#include "color.h" > > > > struct rtnl_handle rth = { .fd = -1 }; > > int preferred_family = AF_UNSPEC; > > int oneline; > > int show_stats; > > int show_details; > > +int show_pretty; > > +int color; > > int compress_vlans; > > int json; > > int timestamp; > > @@ -39,7 +42,7 @@ static void usage(void) > > "where OBJECT := { link | fdb | mdb | vlan | monitor }\n" > > " OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] |\n" > > " -o[neline] | -t[imestamp] | -n[etns] name |\n" > > -" -c[ompressvlans] -p[retty] -j{son} }\n"); > > +" -c[ompressvlans] -color -p[retty] -j{son} }\n"); > > exit(-1); > > } > > > > @@ -170,6 +173,8 @@ main(int argc, char **argv) > > NEXT_ARG(); > > if (netns_switch(argv[1])) > > exit(-1); > > + } else if (matches(opt, "-color") == 0) { > > + enable_color(); > > } else if (matches(opt, "-compressvlans") == 0) { > > ++compress_vlans; > > } else if (matches(opt, "-force") == 0) { > > @@ -195,6 +200,9 @@ main(int argc, char **argv) > > > > _SL_ = oneline ? "\\" : "\n"; > > > > + if (json) > > + check_if_color_enabled(); > > + > > if (batch_file) > > return batch(batch_file); > > > > diff --git a/bridge/fdb.c b/bridge/fdb.c > > index 93b5b2e694e3..b4f6e8b3a01b 100644 > > --- a/bridge/fdb.c > > +++ b/bridge/fdb.c > > @@ -22,9 +22,9 @@ > > #include > > #include > > #include > > -#include > > #include > > > > +#include "json_print.h" > > #include "libnetlink.h" > > #include "br_common.h" > > #include "rt_names.h" > > @@ -32,8 +32,6 @@ > > > > static unsigned int filter_index, filter_vlan, filter_state; > > > > -json_writer_t *jw_global; > > - > > static void usage(void) > > { > > fprintf(stderr, > > @@ -83,13 +81,46 @@ static int state_a2n(unsigned int *s, const char *arg) > > return 0; > > } > > > > -static void start_json_fdb_flags_array(bool *fdb_flags) > > +static void fdb_print_flags(FILE *fp, unsigned int flags) > > +{ > > + open_json_array(PRINT_JSON, > > + is_json_context() ? "flags" : ""); > > + > > + if (flags & NTF_SELF) > > + print_string(PRINT_ANY, NULL, "%s ", "self"); > > + > > + if (flags & NTF_ROUTER) > > + print_string(PRINT_ANY, NULL, "%s ", "router"); > > + > > + if (flags & NTF_EXT_LEARNED) > > + print_string(PRINT_ANY, NULL, "%s ", "extern_learn"); > > + > > + if (flags & NTF_OFFLOADED) > > + print_string(PRINT_ANY, NULL, "%s ", "offload"); > > + > > + if (flags & NTF_MASTER) > > + print_string(PRINT_ANY, NULL, "%s ", "master"); > > + > > + close_json_array(PRINT_JSON, NULL); > > +} > > + > > +static void fdb_print_stats(FILE *fp, const struct nda_cacheinfo *ci) > > { > > - if (*fdb_flags) > > - return; > > - jsonw_name(jw_global, "flags"); > > - jsonw_start_array(jw_global); > > - *fdb_flags = true; > > + static int hz; > > + > > + if (!hz) > > + hz = get_user_hz(); > > + > > + if
Fw: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg
Begin forwarded message: Date: Sun, 26 Aug 2018 22:24:12 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg https://bugzilla.kernel.org/show_bug.cgi?id=200943 Bug ID: 200943 Summary: Repeating tcp_mark_head_lost in dmesg Product: Networking Version: 2.5 Kernel Version: 4.14.66 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: step...@networkplumber.org Reporter: rm+...@romanrm.net Regression: No Getting a bunch of these now every hour during continuous ~100 Mbit of network traffic. What's up with that? Seems harmless, as in the kernel doesn't crash and the network connection is not interrupted. (Maybe the particular TCP session is?) If there are no ill-effects from this condition, is such spammy WARN_ON really necessary? [Mon Aug 27 02:16:11 2018] [ cut here ] [Mon Aug 27 02:16:11 2018] WARNING: CPU: 5 PID: 0 at net/ipv4/tcp_input.c:2263 tcp_mark_head_lost+0x247/0x260 [Mon Aug 27 02:16:11 2018] Modules linked in: dm_snapshot loop vhost_net vhost tap tun ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_DSCP xt_mark ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_owner xt_tcpudp xt_set ip_set_hash_net ip_set nfnetlink xt_limit xt_length xt_multiport xt_conntrack ip6t_rpfilter ipt_rpfilter ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw wireguard ip6_udp_tunnel udp_tunnel ip6table_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle ip6table_filter ip6_tables matroxfb_base matroxfb_g450 matroxfb_Ti3026 matroxfb_accel matroxfb_DAC1064 g450_pll matroxfb_misc iptable_filter ip_tables x_tables cpufreq_powersave cpufreq_userspace cpufreq_conservative 8021q garp mrp [Mon Aug 27 02:16:11 2018] bridge stp llc bonding tcp_bbr sch_fq tcp_illinois fuse radeon ttm drm_kms_helper drm i2c_algo_bit it87 hwmon_vid eeepc_wmi asus_wmi sparse_keymap rfkill video wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm snd_pcm snd_timer snd soundcore joydev evdev pcspkr k10temp fam15h_power sp5100_tco sg shpchp wmi pcc_cpufreq acpi_cpufreq button ext4 crc16 mbcache jbd2 fscrypto btrfs zstd_decompress zstd_compress xxhash algif_skcipher af_alg dm_crypt dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel [Mon Aug 27 02:16:11 2018] pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd r8169 ahci xhci_pci libahci ohci_hcd ehci_pci mii xhci_hcd ehci_hcd i2c_piix4 libata usbcore scsi_mod bnx2 [Mon Aug 27 02:16:11 2018] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW 4.14.66-rm1+ #132 [Mon Aug 27 02:16:11 2018] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016 [Mon Aug 27 02:16:11 2018] task: 8ba79c679dc0 task.stack: b4d741928000 [Mon Aug 27 02:16:11 2018] RIP: 0010:tcp_mark_head_lost+0x247/0x260 [Mon Aug 27 02:16:11 2018] RSP: 0018:8ba7aed437d8 EFLAGS: 00010202 [Mon Aug 27 02:16:11 2018] RAX: 0018 RBX: 8ba3901a0800 RCX: [Mon Aug 27 02:16:11 2018] RDX: 0017 RSI: 0001 RDI: 8ba4d47e9000 [Mon Aug 27 02:16:11 2018] RBP: 8ba4d47e9000 R08: 000d R09: [Mon Aug 27 02:16:11 2018] R10: 100c R11: R12: 0001 [Mon Aug 27 02:16:11 2018] R13: 8ba4d47e9158 R14: 0001 R15: 9d0b6708 [Mon Aug 27 02:16:11 2018] FS: () GS:8ba7aed4() knlGS: [Mon Aug 27 02:16:11 2018] CS: 0010 DS: ES: CR0: 80050033 [Mon Aug 27 02:16:11 2018] CR2: 01fbdff0 CR3: 00040c8d6000 CR4: 000406e0 [Mon Aug 27 02:16:11 2018] Call Trace: [Mon Aug 27 02:16:11 2018] [Mon Aug 27 02:16:11 2018] tcp_fastretrans_alert+0x5c3/0xa20 [Mon Aug 27 02:16:11 2018] tcp_ack+0x95a/0x1170 [Mon Aug 27 02:16:11 2018] ? __slab_free.isra.70+0x79/0x200 [Mon Aug 27 02:16:11 2018] tcp_rcv_established+0x16a/0x5a0 [Mon Aug 27 02:16:11 2018] ? tcp_v4_inbound_md5_hash+0x76/0x1e0 [Mon Aug 27 02:16:11 2018] tcp_v4_do_rcv+0x130/0x1f0 [Mon Aug 27 02:16:11 2018] tcp_v4_rcv+0x9ac/0xaa0 [Mon Aug 27 02:16:11 2018] ip_local_deliver_finish+0x9a/0x1c0 [Mon Aug 27 02:16:11 2018] ip_local_deliver+0x6b/0xe0 [Mon Aug 27 02:16:11 2018] ? ip_rcv_finish+0x440/0x440 [Mon Aug 27 02:16:11 2018] ip_rcv+0x2b0/0x3c0 [Mon Aug 27 02:16:11 2018] ?
Fw: [Bug 200967] New: No network with U.S. Robotics USR997902
Begin forwarded message: Date: Wed, 29 Aug 2018 04:36:20 + From: bugzilla-dae...@bugzilla.kernel.org To: step...@networkplumber.org Subject: [Bug 200967] New: No network with U.S. Robotics USR997902 https://bugzilla.kernel.org/show_bug.cgi?id=200967 Bug ID: 200967 Summary: No network with U.S. Robotics USR997902 Product: Networking Version: 2.5 Kernel Version: 4.18.5 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: IPV4 Assignee: step...@networkplumber.org Reporter: kyri...@alumni.princeton.edu Regression: No Created attachment 278193 --> https://bugzilla.kernel.org/attachment.cgi?id=278193=edit dmesg and hwinfo output (I am reporting this upstream as instructed by OpenSUSE, where I had originally reported this problem.) After upgrading to kernel 4.18 (originally noticed the problem with kernel 4.18.0, the problem persists with kernel 4.18.5), I could no longer connect to the network. A similar machine, which I upgraded at the same time, had no problem, so I thought that the problem might be due to my using a network card instead of the motherboard's built-in network controller. Sure enough, configuring the built-in controller and connecting that to the network worked fine. According to lspci, the card that doesn't work with the new kernel is: U.S. Robotics USR997902 10/100/1000 Mbps PCI Network Card (rev 10) Ifconfig shows that the network controller has been recognized as such, but that it has not obtained an IP address. The problem did not occur with kernel 4.17.14. I have attached the output of hwinfo and dmesg for kernels 4.17.14 and 4.18.5, with the network cable connected to the U.S. Robotics controller (enp5s0). -- You are receiving this mail because: You are the assignee for the bug.
[PATCH 2/2] net: ethernet: cpsw-phy-sel: prefer phandle for phy sel
The cpsw-phy-sel device is not a child of the cpsw interconnect target module. It lives in the system control module. Let's fix this issue by trying to use cpsw-phy-sel phandle first if it exists and if not fall back to current usage of trying to find the cpsw-phy-sel child. That way the phy sel driver can be a child of the system control module where it belongs in the device tree. Without this fix, we cannot have a proper interconnect target module hierarchy in device tree for things like genpd. Note that deferred probe is mostly not supported by cpsw and this patch does not attempt to fix that. In case deferred probe support is needed, this could be added to cpsw_slave_open() and phy_connect() so they start handling and returning errors. For documenting it, looks like the cpsw-phy-sel is used for all cpsw device tree nodes. It's missing the related binding documentation, so let's also update the binding documentation accordingly. Cc: devicet...@vger.kernel.org Cc: Andrew Lunn Cc: Grygorii Strashko Cc: Ivan Khoronzhuk Cc: Mark Rutland Cc: Murali Karicheri Cc: Rob Herring Signed-off-by: Tony Lindgren --- drivers/net/ethernet/ti/cpsw-phy-sel.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/ti/cpsw-phy-sel.c b/drivers/net/ethernet/ti/cpsw-phy-sel.c --- a/drivers/net/ethernet/ti/cpsw-phy-sel.c +++ b/drivers/net/ethernet/ti/cpsw-phy-sel.c @@ -170,10 +170,13 @@ void cpsw_phy_sel(struct device *dev, phy_interface_t phy_mode, int slave) struct device_node *node; struct cpsw_phy_sel_priv *priv; - node = of_get_child_by_name(dev->of_node, "cpsw-phy-sel"); + node = of_parse_phandle(dev->of_node, "cpsw-phy-sel", 0); if (!node) { - dev_err(dev, "Phy mode driver DT not found\n"); - return; + node = of_get_child_by_name(dev->of_node, "cpsw-phy-sel"); + if (!node) { + dev_err(dev, "Phy mode driver DT not found\n"); + return; + } } dev = bus_find_device(_bus_type, NULL, node, match); -- 2.18.0
[PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
The current cpsw usage for cpsw-phy-sel is undocumented but is used for all the boards using cpsw. And cpsw-phy-sel is not really a child of the cpsw device, it lives in the system control module instead. Let's document the existing usage, and improve it a bit where we prefer to use a phandle instead of a child device for it. That way we can properly describe the hardware in dts files for things like genpd. Cc: devicet...@vger.kernel.org Cc: Andrew Lunn Cc: Grygorii Strashko Cc: Ivan Khoronzhuk Cc: Mark Rutland Cc: Murali Karicheri Cc: Rob Herring Signed-off-by: Tony Lindgren --- Documentation/devicetree/bindings/net/cpsw.txt | 6 ++ 1 file changed, 6 insertions(+) diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -19,6 +19,10 @@ Required properties: - slaves : Specifies number for slaves - active_slave : Specifies the slave to use for time stamping, ethtool and SIOCGMIIPHY +- cpsw-phy-sel : Specifies the phandle to the CPSW phy mode selection + device. See also cpsw-phy-sel.txt for it's binding. + Note that in legacy cases cpsw-phy-sel may be + a child device instead of a phandle. Optional properties: - ti,hwmods: Must be "cpgmac0" @@ -75,6 +79,7 @@ Examples: cpts_clock_mult = <0x8000>; cpts_clock_shift = <29>; syscon = <>; + cpsw-phy-sel = <_sel>; cpsw_emac0: slave@0 { phy_id = <_mdio>, <0>; phy-mode = "rgmii-txid"; @@ -103,6 +108,7 @@ Examples: cpts_clock_mult = <0x8000>; cpts_clock_shift = <29>; syscon = <>; + cpsw-phy-sel = <_sel>; cpsw_emac0: slave@0 { phy_id = <_mdio>, <0>; phy-mode = "rgmii-txid"; -- 2.18.0
[PATCH bpf 3/3] bpf: fix sg shift repair start offset in bpf_msg_pull_data
When we perform the sg shift repair for the scatterlist ring, we currently start out at i = first_sg + 1. However, this is not correct since the first_sg could point to the sge sitting at slot MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var() helper for iterating through the ring, and apply the same rule for advancing to the next ring element as we do elsewhere. Later work will use this helper also in other places. Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") Signed-off-by: Daniel Borkmann Acked-by: John Fastabend --- net/core/filter.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 43ba5f8..2c7801f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2282,6 +2282,13 @@ static const struct bpf_func_proto bpf_msg_cork_bytes_proto = { .arg2_type = ARG_ANYTHING, }; +#define sk_msg_iter_var(var) \ + do {\ + var++; \ + if (var == MAX_SKB_FRAGS) \ + var = 0;\ + } while (0) + BPF_CALL_4(bpf_msg_pull_data, struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags) { @@ -2302,9 +2309,7 @@ BPF_CALL_4(bpf_msg_pull_data, if (start < offset + len) break; offset += len; - i++; - if (i == MAX_SKB_FRAGS) - i = 0; + sk_msg_iter_var(i); } while (i != msg->sg_end); if (unlikely(start >= offset + len)) @@ -2330,9 +2335,7 @@ BPF_CALL_4(bpf_msg_pull_data, */ do { copy += sg[i].length; - i++; - if (i == MAX_SKB_FRAGS) - i = 0; + sk_msg_iter_var(i); if (bytes_sg_total <= copy) break; } while (i != msg->sg_end); @@ -2358,9 +2361,7 @@ BPF_CALL_4(bpf_msg_pull_data, sg[i].length = 0; put_page(sg_page([i])); - i++; - if (i == MAX_SKB_FRAGS) - i = 0; + sk_msg_iter_var(i); } while (i != last_sg); sg[first_sg].length = copy; @@ -2377,7 +2378,8 @@ BPF_CALL_4(bpf_msg_pull_data, if (!shift) goto out; - i = first_sg + 1; + i = first_sg; + sk_msg_iter_var(i); do { int move_from; @@ -2394,9 +2396,7 @@ BPF_CALL_4(bpf_msg_pull_data, sg[move_from].page_link = 0; sg[move_from].offset = 0; - i++; - if (i == MAX_SKB_FRAGS) - i = 0; + sk_msg_iter_var(i); } while (1); msg->sg_end -= shift; if (msg->sg_end < 0) -- 2.9.5
[PATCH bpf 0/3] Three fixes for bpf_msg_pull_data
This set contains three more fixes for the bpf_msg_pull_data() mainly for correcting scatterlist ring wrap-arounds as well as fixing up data pointers. For details please see individual patches. Thanks! Daniel Borkmann (3): bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data bpf: fix sg shift repair start offset in bpf_msg_pull_data net/core/filter.c | 36 +++- 1 file changed, 19 insertions(+), 17 deletions(-) -- 2.9.5
[PATCH bpf 2/3] bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data
If first_sg and last_sg wraps around in the scatterlist ring, then we need to account for that in the shift as well. E.g. crafting such msgs where this is the case leads to a hang as shift becomes negative. E.g. consider the following scenario: first_sg := 14 |=>shift := -12 msg->sg_start := 10 last_sg := 3 | msg->sg_end := 5 round 1: i := 15, move_from := 3, sg[15] := sg[ 3] round 2: i := 0, move_from := -12, sg[ 0] := sg[-12] round 3: i := 1, move_from := -11, sg[ 1] := sg[-11] round 4: i := 2, move_from := -10, sg[ 2] := sg[-10] [...] round 13: i := 11, move_from := -1, sg[ 2] := sg[ -1] round 14: i := 12, move_from := 0, sg[ 2] := sg[ 0] round 15: i := 13, move_from := 1, sg[ 2] := sg[ 1] round 16: i := 14, move_from := 2, sg[ 2] := sg[ 2] round 17: i := 15, move_from := 3, sg[ 2] := sg[ 3] [...] This means we will loop forever and never hit the msg->sg_end condition to break out of the loop. When we see that the ring wraps around, then the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning, the remainder slots from the tail of the ring and the head until last_sg combined. Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") Signed-off-by: Daniel Borkmann Acked-by: John Fastabend --- net/core/filter.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/core/filter.c b/net/core/filter.c index b9225c5..43ba5f8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2370,7 +2370,10 @@ BPF_CALL_4(bpf_msg_pull_data, * had a single entry though we can just replace it and * be done. Otherwise walk the ring and shift the entries. */ - shift = last_sg - first_sg - 1; + WARN_ON_ONCE(last_sg == first_sg); + shift = last_sg > first_sg ? + last_sg - first_sg - 1 : + MAX_SKB_FRAGS - first_sg + last_sg - 1; if (!shift) goto out; -- 2.9.5