Re: [PATCH] net: bridge: add max_fdb_count
On 17/11/17 08:14, Nikolay Aleksandrov wrote: > On 17/11/17 07:26, Willy Tarreau wrote: >> Hi Stephen, >> >> On Thu, Nov 16, 2017 at 04:27:18PM -0800, Stephen Hemminger wrote: >>> On Thu, 16 Nov 2017 21:21:55 +0100 >>> Vincent Bernat wrote: >>> ? 16 novembre 2017 20:23 +0100, Andrew Lunn : > struct net_bridge_fdb_entry is 40 bytes. > > My WiFi access point which is also a 5 port bridge, currently has 97MB > free RAM. That is space for about 2.5M FDB entries. So even Roopa's > 128K is not really a problem, in terms of memory. I am also interested in Sarah's patch because we can now have bridge with many ports through VXLAN. The FDB can be replicated to an external daemon with BGP and the cost of each additional MAC address is therefore higher than just a few bytes. It seems simpler to implement a limiting policy early (at the port or bridge level). Also, this is a pretty standard limit to have for a bridge (switchport port-security maximum on Cisco, set interface X mac-limit on Juniper). And it's not something easy to do with ebtables. >>> >>> I want an optional limit per port, it makes a lot of sense. >>> If for no other reason that huge hash tables are a performance problems. >> >> Except its not a limit in that it doesn't prevent new traffic from going >> in, it only prevents new MACs from being learned, so suddenly you start >> flooding all ports with new traffic once the limit is reached, which is >> not trivial to detect nor diagnose. >> >>> There is a bigger question about which fdb to evict but just dropping the >>> new one seems to be easiest and as good as any other solution. >> >> Usually it's better to apply LRU or random here in my opinion, as the >> new entry is much more likely to be needed than older ones by definition. >> In terms of CPU usage it may even be better to kill an entire series in >> the hash table (eg: all nodes in the same table entry for example), as >> the operation can be almost as cheap and result in not being needed for >> a while again. >> >> Willy >> > > Hi, > I have been thinking about this and how to try and keep everyone happy > while maintaining performance, so how about this: > > * Add a per-port fdb LRU list which is used only when there are >= 2000 >global fdb entries _or_ a per-port limit is set. If the list is in use, >update the fdb list position once per second. I think these properties will >allow us to scale and have a better LRU locking granularity (per-port), > and in >smaller setups (not needing LRU) the cost will be a single test in fast > path. > It seems I spoke too soon, I just tried a quick and dirty version of this and the per-port LRU becomes a nightmare with the completely lockless fdb dst port changes. So I think going back to the original proposition with new fdb dropping is best for now, otherwise we need to add some synchronization on fdb port move. In general the per-port locks would be tricky to handle with moving fdbs, so even if we add a per-port LRU list we'll need some global lock to handle moves and updates in a simple manner, and I'd like to avoid adding a new bridge lock. > * Use the above LRU list for per-port limit evictions > > * More importantly use the LRU list for fdb expire, removing the need to > > walk >over all fdbs every time we expire entries. This would be of great help for >larger setups with many fdbs (it will kick in on >= 2000 fdb entries). > Will have to drop the above point for now, even though I was mostly interested in that. > * (optional) Make the global LRU kick in limit an option, people might want > to >minimize traffic blocking due to expire process. > > Any comments and suggestions are welcome. When we agree on the details I'll do > the RFC patches and run some tests before submitting. Defaults will be kept as > they are now. I've chosen the 2000 limit arbitrarily and am happy to change it > if people have something else in mind. This should play nice with the > resizeable > hashtable change. > > Thanks, > Nik > > > > >
Re: [PATCH] macvlan: verify MTU before lowerdev xmit
Hi Dave, > This is an area where we really haven't set down some clear rules > for behavior. > > If an interface has a particular MTU, it must be able to successfully > send MTU sized packets on that link be it virtual or physical. > > Only a "next hop" can have a different MTU and thus drop packets. > This requirement is absolutely necessary in order for proper > signalling (path MTU messages) to make their way back to the sending > host. > > In this VM-->macvlan case it's more like a point to point connection > and there lacks a "next hop" to serve and the provider of proper > signalling. > > This whole situation seems to be handled quite poorly in virtualized > setups. Allowing one end of the virtual networking "link" into the > guest have a different MTU from the other end is a HUGE mistake. I agree, but users do it, so I'm just trying to figure out the best way to handle it. Currently the bridge code and openvswitch will detect when a large packet arrives and drop the packet* - the bridge code with is_skb_forwardable() and openvswitch with it's own approach in vport.c. This patch tries to bring macvlan in line with those. *except for GSO packets - they are assumed to be ok, which isn't always true. I am working on some patches for this - but my approach may need to be changed in light of what you're saying. > There needs to be control path signalling between the guest and the > provider of the virtual link so that they can synchronize their MTU > settings. We have these sorts of problems on probably every type of virtual interface, some of which are easier to deal with than others. I'm particularly worried about interfaces like ibmveth where the 'other end' is usually an AIX partition on a big powerpc system. AIX tends to bring up these interfaces with MTUs of around 64k as well. (This is what originially got me down the rabbit hole of caring about mismatched-MTU handling!) > Yes this is hard, but what is happening now doesn't fly in the long > term. I'm at a bit of a loss about how we would do a proper fix. The only thing that comes to mind would be for the receive routines of the virtual network interfaces to check the size of incoming packets, but - if I understand correctly - we would need to know what the maximum size for a recieved packet should be. Regards, Daniel
Re: [PATCH] macvlan: verify MTU before lowerdev xmit
From: Daniel Axtens Date: Fri, 17 Nov 2017 19:34:27 +1100 > Hi Dave, > >> This is an area where we really haven't set down some clear rules >> for behavior. >> >> If an interface has a particular MTU, it must be able to successfully >> send MTU sized packets on that link be it virtual or physical. >> >> Only a "next hop" can have a different MTU and thus drop packets. >> This requirement is absolutely necessary in order for proper >> signalling (path MTU messages) to make their way back to the sending >> host. >> >> In this VM-->macvlan case it's more like a point to point connection >> and there lacks a "next hop" to serve and the provider of proper >> signalling. >> >> This whole situation seems to be handled quite poorly in virtualized >> setups. Allowing one end of the virtual networking "link" into the >> guest have a different MTU from the other end is a HUGE mistake. > > I agree, but users do it, so I'm just trying to figure out the best way > to handle it. Currently the bridge code and openvswitch will detect when > a large packet arrives and drop the packet* - the bridge code with > is_skb_forwardable() and openvswitch with it's own approach in > vport.c. This patch tries to bring macvlan in line with those. > > *except for GSO packets - they are assumed to be ok, which isn't always > true. I am working on some patches for this - but my approach may need > to be changed in light of what you're saying. So how exactly do the oversized packets get to the macvlan device from the VM in this scenerio? That detail seems to be missing from the diagrams you provided earlier. The VM and the macvlan boxes are just connected with a line. Thank you.
Re: [PATCH] sfp: Add support for DWDM SFP modules
On Fri, Nov 17, 2017 at 02:58:05PM +0900, David Miller wrote: > From: Jan Kundrát > Date: Wed, 15 Nov 2017 12:39:33 +0100 > > > Without this patch, but with CONFIG_SFP enabled, my NIC won't detect > > module unplugging, which is suboptimal. > > > > I'm using an OEM "Cisco compatible" DWDM fixed-frequency 100Ghz-grid SFP > > module. It reports itself as a 0x0b 0x24. According to SFF-8024, byte 0 > > value "0Bh" refers to a "DWDM-SFP/SFP+ (not using SFF-8472)". In > > practice, there's a lot of shared properties here. > > > > Everything is apparently defined in a document called "DWDM SFP MSA > > (Multi-source Agreement), Revision 1.0, 19th September 2005". I don't > > have access to that ocument (yet). Its likely source, the > > http://www.dwdmsfpmsa.org/ has been down for years. > > > > From the datasheets that I was able to find on random vendors' web, the > > second byte can vary -- 0x27 is used, too. > > > > Tested on Clearfog Base with v4.14 and Russell King's SFP patches. > > > > Signed-off-by: Jan Kundrát > > Russell, Florian, Amdrew, can I get a review? Hi David, I've been discussing the topic with Jan off-list. There's nothing wrong per-se with the patch in that it will allow a DWDM module to be functional, but (eg) the data read by ethtool -m won't be correct. The DWDM module EEPROM format is not compatible with any of the ETH_MODULE_SFF_ formats. I also need to validate that the fields we do read from the EEPROM contents in the sfp code are meaningful for DWDM. As Jan says, the DWDM MSA isn't available - from what I can tell, having looked on archive.org and extensive google searching, the MSA was never made public and dwdmsfpmsa.org has been down for years. When the site was operational, it looked like you had to be a member to have access - archive.org has all the legal agreements for that, but no specification. However, the module datasheets give some information about the EEPROM format, which has been useful. Also, this patch will conflict with Florian's SFF module patch (a soldered down version of SFP modules). I already have a stack of patches for phy, phylink and sfp that I need to send, including documentation patches which Florian has already found very useful and helpful. I had assumed that net-next was already closed, being almost a week into the merge window. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
[PATCH v2] net: qcom/emac: extend DMA mask to 46bits
Since PTP doesn't support yet, so extend the DMA address to 46bits. When PTP is supported, the dma_mask should be fixed dynamically. Signed-off-by: Wang Dongsheng --- v2: - Changes PATCH subject. - Dynamic fix TPD_BUFFER_ADDR_H_SET size. - Modify DMA MASK to 46bits. - Add Comments for DMA MASK. --- drivers/net/ethernet/qualcomm/emac/emac-mac.c | 9 ++--- drivers/net/ethernet/qualcomm/emac/emac-mac.h | 5 - drivers/net/ethernet/qualcomm/emac/emac.c | 8 ++-- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c b/drivers/net/ethernet/qualcomm/emac/emac-mac.c index 9cbb2726..db73348 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c +++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c @@ -1362,7 +1362,8 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt, goto error; TPD_BUFFER_ADDR_L_SET(tpd, lower_32_bits(tpbuf->dma_addr)); - TPD_BUFFER_ADDR_H_SET(tpd, upper_32_bits(tpbuf->dma_addr)); + TPD_BUFFER_ADDR_H_SET(adpt, tpd, + upper_32_bits(tpbuf->dma_addr)); TPD_BUF_LEN_SET(tpd, tpbuf->length); emac_tx_tpd_create(adpt, tx_q, tpd); count++; @@ -1380,7 +1381,8 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt, goto error; TPD_BUFFER_ADDR_L_SET(tpd, lower_32_bits(tpbuf->dma_addr)); - TPD_BUFFER_ADDR_H_SET(tpd, upper_32_bits(tpbuf->dma_addr)); + TPD_BUFFER_ADDR_H_SET(adpt, tpd, + upper_32_bits(tpbuf->dma_addr)); TPD_BUF_LEN_SET(tpd, tpbuf->length); emac_tx_tpd_create(adpt, tx_q, tpd); count++; @@ -1402,7 +1404,8 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt, goto error; TPD_BUFFER_ADDR_L_SET(tpd, lower_32_bits(tpbuf->dma_addr)); - TPD_BUFFER_ADDR_H_SET(tpd, upper_32_bits(tpbuf->dma_addr)); + TPD_BUFFER_ADDR_H_SET(adpt, tpd, + upper_32_bits(tpbuf->dma_addr)); TPD_BUF_LEN_SET(tpd, tpbuf->length); emac_tx_tpd_create(adpt, tx_q, tpd); count++; diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.h b/drivers/net/ethernet/qualcomm/emac/emac-mac.h index 5028fb4..31c1f21 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac-mac.h +++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.h @@ -115,7 +115,10 @@ struct emac_tpd { /* High-14bit Buffer Address, So, the 64b-bit address is * {DESC_CTRL_11_TX_DATA_HIADDR[17:0],(register) BUFFER_ADDR_H, BUFFER_ADDR_L} */ -#define TPD_BUFFER_ADDR_H_SET(tpd, val)BITS_SET((tpd)->word[3], 18, 30, val) +#define TPD_BUFFER_ADDR_H_SET(adpt, tpd, val) BITS_SET((tpd)->word[3], 18, \ + TX_TS_ENABLE & \ + readl((adpt)->csr + EMAC_EMAC_WRAPPER_CSR1) ? \ + 30 : 31, val) /* Format D. Word offset from the 1st byte of this packet to start to calculate * the custom checksum. */ diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c index 70c92b6..de5a4fb 100644 --- a/drivers/net/ethernet/qualcomm/emac/emac.c +++ b/drivers/net/ethernet/qualcomm/emac/emac.c @@ -615,8 +615,12 @@ static int emac_probe(struct platform_device *pdev) u32 reg; int ret; - /* The TPD buffer address is limited to 45 bits. */ - ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(45)); + /* +* The TPD buffer address is limited to: +* 1. PTP: 45bits. (Driver doesn't support yet.) +* 2. NON-PTP: 46bits. +*/ + ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(46)); if (ret) { dev_err(&pdev->dev, "could not set DMA mask\n"); return ret; -- 2.7.4
Re: Bisected 4.14 Regression: IPsec transport mode breakage
On Wed, Nov 15, 2017 at 09:46:19AM -0700, Kevin Locke wrote: > Hi all, > > I am using an L2TP/IPsec (transport mode) VPN connection from a client > behind a NAT running Debian with strongswan 5.6.0-2 and xl2tpd > 1.3.10-1 to a Cisco Meraki MX60 with a public IP. The connection > works with kernel 4.13 but not with kernel 4.14. With 4.14 the IPsec > connection appears to be established correctly but xl2tpd is unable to > establish the L2TP connection. The relevant error from syslog is: > > charon: 09[KNL] creating acquire job for policy 192.168.21.10/32[udp/l2f] === > X.X.X.X/32[udp/l2f] with reqid {1} > charon: 12[CFG] trap not found, unable to acquire reqid 1 > > I have bisected the issue to commit c9f3f813d462. I have attached the > client ipsec.conf as well as the syslog during the connection attempt > for both c9f3f813d462 (bad) and cf3796675174 (good). Meraki IPs have > been redacted to protect the innocent. > > I'd appreciate any assistance in fixing the issue. Let me know if > there's anything else I can do to help troubleshoot or test. The offending commit is already reverted in the 'net' tree and will be available in mainline soon. Thanks for the report!
[PATCH] [net] ibmvnic: fix dma_mapping_error call
This patch fixes the dma_mapping_error call to use the correct dma_addr which is inside the ibmvnic_vpd struct. Moreover, it fixes an uninitialized warning regarding a local dma_addr variable which is not used anymore. Fixes: 4e6759be28e4 ("ibmvnic: Feature implementation of VPD for the ibmvnic driver") Reported-by: Stephen Rothwell Signed-off-by: Desnes A. Nunes do Rosario --- drivers/net/ethernet/ibm/ibmvnic.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 04aaacb..1dc4aef 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -849,7 +849,6 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter) { struct device *dev = &adapter->vdev->dev; union ibmvnic_crq crq; - dma_addr_t dma_addr; int len = 0; if (adapter->vpd->buff) @@ -879,7 +878,7 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter) adapter->vpd->dma_addr = dma_map_single(dev, adapter->vpd->buff, adapter->vpd->len, DMA_FROM_DEVICE); - if (dma_mapping_error(dev, dma_addr)) { + if (dma_mapping_error(dev, adapter->vpd->dma_addr)) { dev_err(dev, "Could not map VPD buffer\n"); kfree(adapter->vpd->buff); return -ENOMEM; -- 2.9.5
Re: [PATCH] macvlan: verify MTU before lowerdev xmit
Hi Dave, > So how exactly do the oversized packets get to the macvlan device from > the VM in this scenerio? That detail seems to be missing from the > diagrams you provided earlier. The VM and the macvlan boxes are just > connected with a line. Inside the VM I'm using netperf talking on an interface which the guest believes to have a MTU of 1500. I'm setting up the VM using libvirt - my understanding is that libvirt creates a macvtap device in private mode, and qemu opens that tap device and writes data from the emulated network card (I see the same behaviour with a emulated rtl8139, e1000, and with virtio). I think I could replicate this with any userspace program - qemu is just the easiest for me at the moment. Hopefully that's what you had in mind? Let me know if you wanted different info. Regards, Daniel [The gory details, in case it matters: The VM has a network adaptor with the following XML: ] > > Thank you.
Re: [PATCH 26/31] nds32: Build infrastructure
2017-11-10 16:26 GMT+08:00 Greentime Hu : > 2017-11-09 18:33 GMT+08:00 Arnd Bergmann : >> On Thu, Nov 9, 2017 at 10:02 AM, Greentime Hu wrote: >>> 2017-11-08 18:16 GMT+08:00 Arnd Bergmann : On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu wrote: >> > +config GENERIC_CALIBRATE_DELAY > + def_bool y It's better to avoid the delay loop completely and skip the calibration, if your hardware allows. >>> >>> Thanks. >>> Do you mean that this config should be def_bool n? >>> why? Almost all arch enable it. >> >> It depends on what your hardware can do. If you have a way to see how much >> time has passed that is guaranteed to be reliable on all machines, then >> use that instead. >> >> On a lot of architectures, it's not possible, so they have to fall back to >> using >> the delay loop. > > I get it. I will discuss it with our HW colleagues. > We may get these informations in some registers. Hi, Arnd: I think I can't set it to default n because it will be called in start_kernel. start_kernel() -> calibrate_delay() If I don't enable this config, it will link error because it didn't build init/calibrate.c
Re: [PATCH 26/31] nds32: Build infrastructure
On Fri, Nov 17, 2017 at 1:39 PM, Greentime Hu wrote: > 2017-11-10 16:26 GMT+08:00 Greentime Hu : >> 2017-11-09 18:33 GMT+08:00 Arnd Bergmann : >>> On Thu, Nov 9, 2017 at 10:02 AM, Greentime Hu wrote: 2017-11-08 18:16 GMT+08:00 Arnd Bergmann : > On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu wrote: >>> >> +config GENERIC_CALIBRATE_DELAY >> + def_bool y > > It's better to avoid the delay loop completely and skip the calibration, > if your hardware allows. Thanks. Do you mean that this config should be def_bool n? why? Almost all arch enable it. >>> >>> It depends on what your hardware can do. If you have a way to see how much >>> time has passed that is guaranteed to be reliable on all machines, then >>> use that instead. >>> >>> On a lot of architectures, it's not possible, so they have to fall back to >>> using >>> the delay loop. >> >> I get it. I will discuss it with our HW colleagues. >> We may get these informations in some registers. > > Hi, Arnd: > > I think I can't set it to default n because it will be called in start_kernel. > > start_kernel() -> calibrate_delay() > > If I don't enable this config, it will link error because it didn't > build init/calibrate.c You will have to provide an architecture-specific implementation of this function, please ahve a look at what others are doing, e.g. tile, openrisc or h8300. Arnd
Re: [PATCH] ARM64: dts: meson-axg: add ethernet mac controller
On 13/11/2017 09:01, Yixun Lan wrote: > Add DT info for the stmmac ethernet MAC which found in > the Amlogic's Meson-AXG SoC, also describe the ethernet > pinctrl & clock information here. > > This is tested in the S400 dev board which use a RTL8211F PHY, > and the pins connect to the 'eth_rgmii_y_pins' group. > > Signed-off-by: Yixun Lan > --- > arch/arm64/boot/dts/amlogic/meson-axg-s400.dts | 7 > arch/arm64/boot/dts/amlogic/meson-axg.dtsi | 53 > ++ > 2 files changed, 60 insertions(+) > > diff --git a/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts > b/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts > index 9eb6aaee155d..7b39a9fe2b0f 100644 > --- a/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts > +++ b/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts > @@ -21,3 +21,10 @@ > status = "okay"; > }; > > +ðmac { > + status = "okay"; > + phy-mode = "rgmii"; > + > + pinctrl-0 = <ð_rgmii_y_pins>; > + pinctrl-names = "default"; > +}; > diff --git a/arch/arm64/boot/dts/amlogic/meson-axg.dtsi > b/arch/arm64/boot/dts/amlogic/meson-axg.dtsi > index 65945c6c8b65..57faaa9d8013 100644 > --- a/arch/arm64/boot/dts/amlogic/meson-axg.dtsi > +++ b/arch/arm64/boot/dts/amlogic/meson-axg.dtsi > @@ -157,6 +157,19 @@ > #address-cells = <0>; > }; > > + ethmac: ethernet@ff3f { > + compatible = "amlogic,meson-axg-dwmac", > "amlogic,meson-gxbb-dwmac", "snps,dwmac"; > + reg = <0x0 0xff3f 0x0 0x1 > + 0x0 0xff634540 0x0 0x8>; > + interrupts = ; > + interrupt-names = "macirq"; > + status = "disabled"; > + clocks = <&clkc CLKID_ETH>, > + <&clkc CLKID_FCLK_DIV2>, > + <&clkc CLKID_MPLL2>; > + clock-names = "stmmaceth", "clkin0", "clkin1"; > + }; > + > hiubus: hiubus@ff63c000 { > compatible = "simple-bus"; > reg = <0x0 0xff63c000 0x0 0x1c00>; > @@ -203,6 +216,46 @@ > #gpio-cells = <2>; > gpio-ranges = <&pinctrl_periphs 0 0 86>; > }; > + > + eth_rgmii_x_pins: eth-x-rgmii { > + mux { > + groups = "eth_mdio_x", > +"eth_mdc_x", > +"eth_rgmii_rx_clk_x", > +"eth_rx_dv_x", > +"eth_rxd0_x", > +"eth_rxd1_x", > +"eth_rxd2_rgmii", > +"eth_rxd3_rgmii", > +"eth_rgmii_tx_clk", > +"eth_txen_x", > +"eth_txd0_x", > +"eth_txd1_x", > +"eth_txd2_rgmii", > +"eth_txd3_rgmii"; > + function = "eth"; > + }; > + }; > + > + eth_rgmii_y_pins: eth-y-rgmii { > + mux { > + groups = "eth_mdio_y", > +"eth_mdc_y", > +"eth_rgmii_rx_clk_y", > +"eth_rx_dv_y", > +"eth_rxd0_y", > +"eth_rxd1_y", > +"eth_rxd2_rgmii", > +"eth_rxd3_rgmii", > +"eth_rgmii_tx_clk", > +"eth_txen_y", > +"eth_txd0_y", > +"eth_txd1_y", > +"eth_txd2_rgmii", > +"eth_txd3_rgmii"; > + function = "eth"; > + }; > + }; > }; > }; > > Reviewed-by: Neil Armstrong
[PATCH net-next v2 0/2] net: thunderx: add support for PTP clock
This series adds support for IEEE 1588 Precision Time Protocol to Cavium ethernet driver. The first patch adds support for the Precision Time Protocol Clocks and Timestamping coprocessor (PTP) found on Cavium processors. It registers a new PTP clock in the PTP core and provides functions to use the counter in BGX, TNS, GTI, and NIC blocks. The second patch introduces support for the PTP protocol to the Cavium ThunderX ethernet driver. v2: - use readq()/writeq() in place of cavium_ptp_reg_read()/cavium_ptp_reg_write(), don't use readq_relaxed()/writeq_relaxed() (David Daney) v1: https://lkml.kernel.org/r/20171107190704.15458-1-aleksey.maka...@cavium.com Radoslaw Biernacki (1): net: add support for Cavium PTP coprocessor Sunil Goutham (1): net: thunderx: add timestamping support drivers/net/ethernet/cavium/Kconfig| 14 + drivers/net/ethernet/cavium/Makefile | 1 + drivers/net/ethernet/cavium/common/Makefile| 1 + drivers/net/ethernet/cavium/common/cavium_ptp.c| 334 + drivers/net/ethernet/cavium/common/cavium_ptp.h| 78 + drivers/net/ethernet/cavium/thunder/nic.h | 15 + drivers/net/ethernet/cavium/thunder/nic_main.c | 58 +++- drivers/net/ethernet/cavium/thunder/nic_reg.h | 1 + .../net/ethernet/cavium/thunder/nicvf_ethtool.c| 29 +- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 173 ++- drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 26 ++ drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 29 ++ drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 4 + 13 files changed, 757 insertions(+), 6 deletions(-) create mode 100644 drivers/net/ethernet/cavium/common/Makefile create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.c create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.h -- 2.15.0
[PATCH net-next v2 1/2] net: add support for Cavium PTP coprocessor
From: Radoslaw Biernacki This patch adds support for the Precision Time Protocol Clocks and Timestamping hardware found on Cavium ThunderX processors. Signed-off-by: Radoslaw Biernacki Signed-off-by: Aleksey Makarov --- drivers/net/ethernet/cavium/Kconfig | 13 + drivers/net/ethernet/cavium/Makefile| 1 + drivers/net/ethernet/cavium/common/Makefile | 1 + drivers/net/ethernet/cavium/common/cavium_ptp.c | 334 drivers/net/ethernet/cavium/common/cavium_ptp.h | 78 ++ 5 files changed, 427 insertions(+) create mode 100644 drivers/net/ethernet/cavium/common/Makefile create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.c create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.h diff --git a/drivers/net/ethernet/cavium/Kconfig b/drivers/net/ethernet/cavium/Kconfig index 63be75eb34d2..fabe0ffcc2d4 100644 --- a/drivers/net/ethernet/cavium/Kconfig +++ b/drivers/net/ethernet/cavium/Kconfig @@ -50,6 +50,19 @@ config THUNDER_NIC_RGX This driver supports configuring XCV block of RGX interface present on CN81XX chip. +config CAVIUM_PTP + tristate "Cavium PTP coprocessor as PTP clock" + depends on 64BIT + depends on PTP_1588_CLOCK + select CAVIUM_RST + default y + ---help--- + This driver adds support for the Precision Time Protocol Clocks and + Timestamping coprocessor (PTP) found on Cavium processors. + PTP provides timestamping mechanism that is suitable for use in IEEE 1588 + Precision Time Protocol or other purposes. Timestamps can be used in + BGX, TNS, GTI, and NIC blocks. + config LIQUIDIO tristate "Cavium LiquidIO support" depends on 64BIT diff --git a/drivers/net/ethernet/cavium/Makefile b/drivers/net/ethernet/cavium/Makefile index 872da9f7c31a..946bba84e81d 100644 --- a/drivers/net/ethernet/cavium/Makefile +++ b/drivers/net/ethernet/cavium/Makefile @@ -1,6 +1,7 @@ # # Makefile for the Cavium ethernet device drivers. # +obj-$(CONFIG_NET_VENDOR_CAVIUM) += common/ obj-$(CONFIG_NET_VENDOR_CAVIUM) += thunder/ obj-$(CONFIG_NET_VENDOR_CAVIUM) += liquidio/ obj-$(CONFIG_NET_VENDOR_CAVIUM) += octeon/ diff --git a/drivers/net/ethernet/cavium/common/Makefile b/drivers/net/ethernet/cavium/common/Makefile new file mode 100644 index ..dd8561b8060b --- /dev/null +++ b/drivers/net/ethernet/cavium/common/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_CAVIUM_PTP) += cavium_ptp.o diff --git a/drivers/net/ethernet/cavium/common/cavium_ptp.c b/drivers/net/ethernet/cavium/common/cavium_ptp.c new file mode 100644 index ..f4c738db27fd --- /dev/null +++ b/drivers/net/ethernet/cavium/common/cavium_ptp.c @@ -0,0 +1,334 @@ +/* + * cavium_ptp.c - PTP 1588 clock on Cavium hardware + * + * Copyright (c) 2003-2015, 2017 Cavium, Inc. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License. + * + * This file may also be available under a different license from Cavium. + * Contact Cavium, Inc. for more information + */ + +#include +#include +#include +#include + +#include "cavium_ptp.h" + +#define DRV_NAME "Cavium PTP Driver" + +#define PCI_DEVICE_ID_CAVIUM_PTP 0xA00C +#define PCI_DEVICE_ID_CAVIUM_RST 0xA00E + +#define PCI_PTP_BAR_NO 0 +#define PCI_RST_BAR_NO 0 + +#define PTP_CLOCK_CFG 0xF00ULL +#define PTP_CLOCK_CFG_PTP_EN BIT(0) +#define PTP_CLOCK_LO 0xF08ULL +#define PTP_CLOCK_HI 0xF10ULL +#define PTP_CLOCK_COMP 0xF18ULL + +#define RST_BOOT 0x1600ULL +#define CLOCK_BASE_RATE5000ULL + +static u64 ptp_cavium_clock_get(void) +{ + struct pci_dev *pdev; + void __iomem *base; + u64 ret = CLOCK_BASE_RATE * 16; + + pdev = pci_get_device(PCI_VENDOR_ID_CAVIUM, + PCI_DEVICE_ID_CAVIUM_RST, NULL); + if (!pdev) + goto error; + + base = pci_ioremap_bar(pdev, PCI_RST_BAR_NO); + if (!base) + goto error_put_pdev; + + ret = CLOCK_BASE_RATE * ((readq(base + RST_BOOT) >> 33) & 0x3f); + + iounmap(base); + +error_put_pdev: + pci_dev_put(pdev); + +error: + return ret; +} + +struct cavium_ptp *cavium_ptp_get(void) +{ + struct cavium_ptp *ptp; + struct pci_dev *pdev; + + pdev = pci_get_device(PCI_VENDOR_ID_CAVIUM, + PCI_DEVICE_ID_CAVIUM_PTP, NULL); + if (!pdev) + return ERR_PTR(-ENODEV); + + ptp = pci_get_drvdata(pdev); + if (!ptp) { + pci_dev_put(pdev); + ptp = ERR_PTR(-EPROBE_DEFER); + } + + return ptp; +} +EXPORT_SYMBOL(cavium_ptp_get); + +void cavium_ptp_put(struct cavium_ptp *ptp) +{ + pci_dev_put(ptp->pdev); +} +EXPORT_SYMBOL(cavium_ptp_put); + +/** +
[PATCH net-next v2 2/2] net: thunderx: add timestamping support
From: Sunil Goutham This adds timestamping support for both receive and transmit paths. On the receive side no filters are supported i.e either all pkts will get a timestamp appended infront of the packet or none. On the transmit side HW doesn't support timestamp insertion but only generates a separate CQE with transmitted packet's timestamp. Also HW supports only one packet at a time for timestamping on the transmit side. Signed-off-by: Sunil Goutham Signed-off-by: Aleksey Makarov --- drivers/net/ethernet/cavium/Kconfig| 1 + drivers/net/ethernet/cavium/thunder/nic.h | 15 ++ drivers/net/ethernet/cavium/thunder/nic_main.c | 58 ++- drivers/net/ethernet/cavium/thunder/nic_reg.h | 1 + .../net/ethernet/cavium/thunder/nicvf_ethtool.c| 29 +++- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 173 - drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 26 drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 29 drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 4 + 9 files changed, 330 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/cavium/Kconfig b/drivers/net/ethernet/cavium/Kconfig index fabe0ffcc2d4..6d003cbe197c 100644 --- a/drivers/net/ethernet/cavium/Kconfig +++ b/drivers/net/ethernet/cavium/Kconfig @@ -27,6 +27,7 @@ config THUNDER_NIC_PF config THUNDER_NIC_VF tristate "Thunder Virtual function driver" + select CAVIUM_PTP depends on 64BIT ---help--- This driver supports Thunder's NIC virtual function diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h index 4a02e618e318..204b234beb9d 100644 --- a/drivers/net/ethernet/cavium/thunder/nic.h +++ b/drivers/net/ethernet/cavium/thunder/nic.h @@ -263,6 +263,8 @@ struct nicvf_drv_stats { struct u64_stats_sync syncp; }; +struct cavium_ptp; + struct nicvf { struct nicvf*pnicvf; struct net_device *netdev; @@ -312,6 +314,12 @@ struct nicvf { struct tasklet_struct qs_err_task; struct work_struct reset_task; + /* PTP timestamp */ + struct cavium_ptp *ptp_clock; + boolhw_rx_tstamp; + struct sk_buff *ptp_skb; + atomic_ttx_ptp_skbs; + /* Interrupt coalescing settings */ u32 cq_coalesce_usecs; u32 msg_enable; @@ -371,6 +379,7 @@ struct nicvf { #defineNIC_MBOX_MSG_LOOPBACK 0x16/* Set interface in loopback */ #defineNIC_MBOX_MSG_RESET_STAT_COUNTER 0x17/* Reset statistics counters */ #defineNIC_MBOX_MSG_PFC0x18/* Pause frame control */ +#defineNIC_MBOX_MSG_PTP_CFG0x19/* HW packet timestamp */ #defineNIC_MBOX_MSG_CFG_DONE 0xF0/* VF configuration done */ #defineNIC_MBOX_MSG_SHUTDOWN 0xF1/* VF is being shutdown */ @@ -521,6 +530,11 @@ struct pfc { u8fc_tx; }; +struct set_ptp { + u8msg; + bool enable; +}; + /* 128 bit shared memory between PF and each VF */ union nic_mbx { struct { u8 msg; } msg; @@ -540,6 +554,7 @@ union nic_mbx { struct set_loopback lbk; struct reset_stat_cfg reset_stat; struct pfc pfc; + struct set_ptp ptp; }; #define NIC_NODE_ID_MASK 0x03 diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c index 8f1dd55b3e08..4c1c5414a162 100644 --- a/drivers/net/ethernet/cavium/thunder/nic_main.c +++ b/drivers/net/ethernet/cavium/thunder/nic_main.c @@ -426,13 +426,22 @@ static void nic_init_hw(struct nicpf *nic) /* Enable backpressure */ nic_reg_write(nic, NIC_PF_BP_CFG, (1ULL << 6) | 0x03); - /* TNS and TNS bypass modes are present only on 88xx */ + /* TNS and TNS bypass modes are present only on 88xx +* Also offset of this CSR has changed in 81xx and 83xx. +*/ if (nic->pdev->subsystem_device == PCI_SUBSYS_DEVID_88XX_NIC_PF) { /* Disable TNS mode on both interfaces */ nic_reg_write(nic, NIC_PF_INTF_0_1_SEND_CFG, - (NIC_TNS_BYPASS_MODE << 7) | BGX0_BLOCK); + (NIC_TNS_BYPASS_MODE << 7) | + BGX0_BLOCK | (1ULL << 16)); nic_reg_write(nic, NIC_PF_INTF_0_1_SEND_CFG | (1 << 8), - (NIC_TNS_BYPASS_MODE << 7) | BGX1_BLOCK); + (NIC_TNS_BYPASS_MODE << 7) | + BGX1_BLOCK | (1ULL << 16)); + } else { + /* Configure timestamp generation timeout to 10us */ + for (i = 0; i < nic->hw->bgx_cnt; i++) + nic_reg_write(nic, NIC_PF_INTFX_SEND_CFG | (i <<
Re: [PATCH 26/31] nds32: Build infrastructure
2017-11-17 20:50 GMT+08:00 Arnd Bergmann : > On Fri, Nov 17, 2017 at 1:39 PM, Greentime Hu wrote: >> 2017-11-10 16:26 GMT+08:00 Greentime Hu : >>> 2017-11-09 18:33 GMT+08:00 Arnd Bergmann : On Thu, Nov 9, 2017 at 10:02 AM, Greentime Hu wrote: > 2017-11-08 18:16 GMT+08:00 Arnd Bergmann : >> On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu wrote: >>> +config GENERIC_CALIBRATE_DELAY >>> + def_bool y >> >> It's better to avoid the delay loop completely and skip the calibration, >> if your hardware allows. > > Thanks. > Do you mean that this config should be def_bool n? > why? Almost all arch enable it. It depends on what your hardware can do. If you have a way to see how much time has passed that is guaranteed to be reliable on all machines, then use that instead. On a lot of architectures, it's not possible, so they have to fall back to using the delay loop. >>> >>> I get it. I will discuss it with our HW colleagues. >>> We may get these informations in some registers. >> >> Hi, Arnd: >> >> I think I can't set it to default n because it will be called in >> start_kernel. >> >> start_kernel() -> calibrate_delay() >> >> If I don't enable this config, it will link error because it didn't >> build init/calibrate.c > > You will have to provide an architecture-specific implementation of > this function, please ahve a look at what others are doing, e.g. tile, > openrisc or > h8300. Thanks. I got you.
Re: GRO disabled with IPv4 options
On Thu, Nov 16, 2017 at 5:47 PM, Tom Herbert wrote: > On Thu, Nov 16, 2017 at 1:40 PM, Herbert Xu > wrote: >> On Thu, Nov 16, 2017 at 04:12:43PM +0100, Cristian Klein wrote: >>> >>> Does somebody know the rationale for this? Is it because IPv4 >>> options are rarely used, hence implementing GRO in that case does >>> not pay off or are there some caveats? Specifically would it make >> >> Precisely. GRO is about optimising for the common case. At the >> time there was no impetus to support IP options. >> >>> sense to do GRO when the IPv4 options are byte-identical in >>> consecutive packets? >> >> Yes there is no reason why we can't do this. As long as it doesn't >> penalise the non-IP-option case too much. >> > Of course it would also be nice to have GRO support for various IPv6 > extension headers, at this point we're more likely to see those rather > than IP options in real deployment! ipv6_gro_receive already pulls common ones with ipv6_gso_pull_exthdrs. And to add a counterpoint: GRO has to be resilient to malformed packets, so there is value in keeping it simple and limited to the common case.
Re: [PATCH 26/31] nds32: Build infrastructure
2017-11-16 18:25 GMT+08:00 Arnd Bergmann : > On Thu, Nov 16, 2017 at 11:03 AM, Greentime Hu wrote: >> 2017-11-13 18:45 GMT+08:00 Geert Uytterhoeven : >>> Given the checks for __NDS32_EB__, NDS32 can be either big or little endian, >>> so you should have (excatly one of) CPU_BIG_ENDIAN or CPU_LITTLE_ENDIAN set. >> >> Thanks. >> I will check if we need this config or not and update in the next version >> patch. > > I think we have one architecture in the kernel that determines endianess from > the way that the toolchain is built. What all the others do it to have a > Kconfig > option, at least CONFIG_CPU_BIG_ENDIAN that is used to pass -mbig-endian > or -mlittle-endian to the compiler. You should do it that way so you can use > any toolchain with any kernel configuration. Thanks I will add these 2 configs in the next version patch. Pass -EL or -EB to compiler based on different configs.
Re: GRO disabled with IPv4 options
From: Willem de Bruijn Date: Fri, 17 Nov 2017 08:51:57 -0500 > ipv6_gro_receive already pulls common ones with ipv6_gso_pull_exthdrs. > And to add a counterpoint: GRO has to be resilient to malformed packets, > so there is value in keeping it simple and limited to the common case. Agreed. Also it is not exactly clear what should happen with all IP option types when reconstructing the original packet stream on the GSO side. It just sounds incredibly complicated to me for little to no gain.
Re: [PATCH] net: bridge: add max_fdb_count
> Usually it's better to apply LRU or random here in my opinion, as the > new entry is much more likely to be needed than older ones by definition. Hi Willy I think this depends on why you need to discard. If it is normal operation and the limits are simply too low, i would agree. If however it is a DoS, throwing away the new entries makes sense, leaving the old ones which are more likely to be useful. Most of the talk in this thread has been about limits for DoS prevention... Andrew
[PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define
There is no need to #define the license of the driver, just put it in the MODULE_LICENSE() line directly as a text string. This allows tools that check that the module license matches the source code license to work properly, as there is no need to unwind the unneeded dereference. Cc: "David S. Miller" Cc: Andreas Kemnade Cc: Johan Hovold Reported-by: Philippe Ombredanne Signed-off-by: Greg Kroah-Hartman --- drivers/net/usb/hso.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/usb/hso.c b/drivers/net/usb/hso.c index d7a3379ea668..f40d1423cf46 100644 --- a/drivers/net/usb/hso.c +++ b/drivers/net/usb/hso.c @@ -76,7 +76,6 @@ #define MOD_AUTHOR "Option Wireless" #define MOD_DESCRIPTION"USB High Speed Option driver" -#define MOD_LICENSE"GPL" #define HSO_MAX_NET_DEVICES10 #define HSO__MAX_MTU 2048 @@ -3288,7 +3287,7 @@ module_exit(hso_exit); MODULE_AUTHOR(MOD_AUTHOR); MODULE_DESCRIPTION(MOD_DESCRIPTION); -MODULE_LICENSE(MOD_LICENSE); +MODULE_LICENSE("GPL"); /* change the debug level (eg: insmod hso.ko debug=0x04) */ MODULE_PARM_DESC(debug, "debug level mask [0x01 | 0x02 | 0x04 | 0x08 | 0x10]"); -- 2.15.0
Re: [PATCH v2] net: qcom/emac: extend DMA mask to 46bits
On 11/17/17 3:56 AM, Wang Dongsheng wrote: -#define TPD_BUFFER_ADDR_H_SET(tpd, val)BITS_SET((tpd)->word[3], 18, 30, val) +#define TPD_BUFFER_ADDR_H_SET(adpt, tpd, val) BITS_SET((tpd)->word[3], 18, \ + TX_TS_ENABLE & \ + readl((adpt)->csr + EMAC_EMAC_WRAPPER_CSR1) ? \ + 30 : 31, val) NAK. Sorry, but this is terrible. Every write to the TPD forces a secret read from another register? No thank you. Just do what you had in v1, but also set the DMA mask. Add a comment saying that we can support 46 bits because we never enable timestamping. -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: regression: UFO removal breaks kvm live migration
On Fri, Nov 10, 2017 at 12:32 AM, Willem de Bruijn wrote: > On Wed, Nov 8, 2017 at 9:53 PM, Jason Wang wrote: >> >> >> On 2017年11月08日 20:32, David Miller wrote: >>> >>> From: Jason Wang >>> Date: Wed, 8 Nov 2017 17:25:48 +0900 >>> On 2017年11月08日 17:08, Willem de Bruijn wrote: > > That won't help in the short term. I'm still reading up to see if > there are > any other options besides reimplement or advertise-but-drop, such as > an implicit trigger that would make the guest renegotiate. It's > unlikely, but > worth a look.. Yes, this looks hard. And even if we can manage to do this, it looks an overkill since it will impact all guest after migration. >>> >>> Like Willem I would much prefer "advertise-but-drop" if it works. >> >> >> This makes migration work but all guest UFO traffic will stall. >> >>> >>> In the long term feature renegotiation triggers are a must. >>> >>> There is no way for us to remove features otherwise. >> >> >> We can remove if we don't break userspace(guest). >> >>> In my opinion >>> this will even make migrations more powerful. >> >> >> But this does not help for guest running old version of kernel which still >> think UFO work. > > Indeed, if we have to support live migration of arbitrary old guests > without any expectations on hypervisor version either, features can > simply never be reverted, even if a negotiation interface exists. > > At least for upcoming features and devices, guest code should not > have this expectation, but from the start allow renegation such as > CTRL_GUEST_OFFLOADS [1] based on a host trigger. But for > tuntap TUNSETOFFLOAD it seems that ship has sailed. > > Okay, I will send a patch to reinstate UFO for this use case (only). There > is some related work in tap_handle_frame and packet_direct_xmit to > segment directly in the device. I will be traveling the next few days, so > it won't be in time for 4.14 (but can go in stable later, of course). I'm finishing up and running some tests. The majority of the patch is a straightforward partial revert of the patchset, so while fairly large for a patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all thoroughly tested code. Notably absent are the protocol layer and hardware support (NETIF_F_UFO) portions. The only open issue is whether to rely on existing skb_gso_segment processing in the transmit path from validate_xmit_skb or to add new skb_gso_segment calls directly to tun_get_user, tap_get_user and pf_packet. Tun has to loop around four different ways of injecting packets into the device. Something like the below snippet. More conservative is to introduce no completely new code and rely on validate_xmit_skb, but that means having to protect the entire stack against skbs with SKB_GSO_UDP, so also bringing back some checksum and fragment handling snippets in gre_gso_segment, __skb_udp_tunnel_segment, act_csum and openvswitch. A third option is to send the conservative approach to net, then in net-next follow up with a patch to plug the SKB_GSO_UDP directly in the devices and revert the tunnel/act/openvswitch stanzas I'm leaning towards that approach. @@ -1380,7 +1380,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, int noblock, bool more) { struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) }; - struct sk_buff *skb; + struct sk_buff *skb, *segs = NULL; size_t total_len = iov_iter_count(from); size_t len = total_len, align = tun->align, linear; struct virtio_net_hdr gso = { 0 }; @@ -1552,12 +1552,33 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, } rxhash = __skb_get_hash_symmetric(skb); + + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) { + skb_push(skb, ETH_HLEN); + segs = __skb_gso_segment(skb, netif_skb_features(skb), false); + + if (IS_ERR(segs)) { + kfree_skb(skb); + return PTR_ERR(segs); + } + + if (segs) { + consume_skb(skb); + skb = segs; + } +again: + skb_pull(skb, ETH_HLEN); + segs = skb->next; + skb->next = NULL; + } + #ifndef CONFIG_4KSTACKS -tun_rx_batched(tun, tfile, skb, more); + tun_rx_batched(tun, tfile, skb, more || segs); #else netif_rx_ni(skb); #endif + if (segs) { + skb = segs; + goto again; + }
Re: [PATCH] sfp: Add support for DWDM SFP modules
From: Russell King - ARM Linux Date: Fri, 17 Nov 2017 09:52:10 + > I already have a stack of patches for phy, phylink and sfp that I > need to send, including documentation patches which Florian has > already found very useful and helpful. I had assumed that net-next > was already closed, being almost a week into the merge window. Yes it is. Thanks for the info, I'll mark this 'deferred' in patchwork. Please have this respun and posted once net-next is openned back up and the various issues have been sorted out. Thank you.
Re: regression: UFO removal breaks kvm live migration
>> Okay, I will send a patch to reinstate UFO for this use case (only). There >> is some related work in tap_handle_frame and packet_direct_xmit to >> segment directly in the device. I will be traveling the next few days, so >> it won't be in time for 4.14 (but can go in stable later, of course). > > I'm finishing up and running some tests. The majority of the patch is a > straightforward partial revert of the patchset, so while fairly large for a > patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all > thoroughly tested code. Notably absent are the protocol layer and > hardware support (NETIF_F_UFO) portions. > > The only open issue is whether to rely on existing skb_gso_segment > processing in the transmit path from validate_xmit_skb or to add new > skb_gso_segment calls directly to tun_get_user, tap_get_user and > pf_packet. Tun has to loop around four different ways of injecting > packets into the device. Something like the below snippet. > > More conservative is to introduce no completely new code and rely on > validate_xmit_skb, but that means having to protect the entire stack > against skbs with SKB_GSO_UDP, so also bringing back some > checksum and fragment handling snippets in gre_gso_segment, > __skb_udp_tunnel_segment, act_csum and openvswitch. Come to think of it, as this patch does not bring back NETIF_F_UFO support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be excluded. Then this is probably the simpler and more obviously correct approach.
Re: [PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define
On Fri, Nov 17, 2017 at 3:19 PM, Greg Kroah-Hartman wrote: > There is no need to #define the license of the driver, just put it in > the MODULE_LICENSE() line directly as a text string. > > This allows tools that check that the module license matches the source > code license to work properly, as there is no need to unwind the > unneeded dereference. > > Cc: "David S. Miller" > Cc: Andreas Kemnade > Cc: Johan Hovold > Reported-by: Philippe Ombredanne > Signed-off-by: Greg Kroah-Hartman Reviewed-by: Philippe Ombredanne -- Cordially Philippe Ombredanne
Re: [PATCH net] sctp: set frag_point in sctp_setsockopt_maxseg correctly
On Fri, Nov 17, 2017 at 02:11:11PM +0800, Xin Long wrote: > Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with > val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect. > > val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT. > Then in sctp_datamsg_from_user(), when it's value is greater than cookie > echo len and trying to bundle with cookie echo chunk, the first_len will > overflow. > > The worse case is when it's value is equal as cookie echo len, first_len > becomes 0, it will go into a dead loop for fragment later on. In Hangbin > syzkaller testing env, oom was even triggered due to consecutive memory > allocation in that loop. > > Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should > deduct the data header for frag_point or user_frag check. > > This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting > the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when > setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg > codes. > > Suggested-by: Marcelo Ricardo Leitner > Reported-by: Hangbin Liu > Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner > --- > include/net/sctp/sctp.h | 3 ++- > net/sctp/socket.c | 29 +++-- > 2 files changed, 21 insertions(+), 11 deletions(-) > > diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h > index d7d8cba..749a428 100644 > --- a/include/net/sctp/sctp.h > +++ b/include/net/sctp/sctp.h > @@ -444,7 +444,8 @@ static inline int sctp_frag_point(const struct > sctp_association *asoc, int pmtu) > if (asoc->user_frag) > frag = min_t(int, frag, asoc->user_frag); > > - frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN)); > + frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN - > + sizeof(struct sctp_data_chunk))); > > return frag; > } > diff --git a/net/sctp/socket.c b/net/sctp/socket.c > index 4c0a772..3204a9b 100644 > --- a/net/sctp/socket.c > +++ b/net/sctp/socket.c > @@ -3140,9 +3140,9 @@ static int sctp_setsockopt_mappedv4(struct sock *sk, > char __user *optval, unsign > */ > static int sctp_setsockopt_maxseg(struct sock *sk, char __user *optval, > unsigned int optlen) > { > + struct sctp_sock *sp = sctp_sk(sk); > struct sctp_assoc_value params; > struct sctp_association *asoc; > - struct sctp_sock *sp = sctp_sk(sk); > int val; > > if (optlen == sizeof(int)) { > @@ -3158,26 +3158,35 @@ static int sctp_setsockopt_maxseg(struct sock *sk, > char __user *optval, unsigned > if (copy_from_user(¶ms, optval, optlen)) > return -EFAULT; > val = params.assoc_value; > - } else > + } else { > return -EINVAL; > + } > > - if ((val != 0) && ((val < 8) || (val > SCTP_MAX_CHUNK_LEN))) > - return -EINVAL; > + if (val) { > + int min_len, max_len; > > - asoc = sctp_id2assoc(sk, params.assoc_id); > - if (!asoc && params.assoc_id && sctp_style(sk, UDP)) > - return -EINVAL; > + min_len = SCTP_DEFAULT_MINSEGMENT - sp->pf->af->net_header_len; > + min_len -= sizeof(struct sctphdr) + > +sizeof(struct sctp_data_chunk); > + > + max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk); > > + if (val < min_len || val > max_len) > + return -EINVAL; > + } > + > + asoc = sctp_id2assoc(sk, params.assoc_id); > if (asoc) { > if (val == 0) { > - val = asoc->pathmtu; > - val -= sp->pf->af->net_header_len; > + val = asoc->pathmtu - sp->pf->af->net_header_len; > val -= sizeof(struct sctphdr) + > - sizeof(struct sctp_data_chunk); > +sizeof(struct sctp_data_chunk); > } > asoc->user_frag = val; > asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu); > } else { > + if (params.assoc_id && sctp_style(sk, UDP)) > + return -EINVAL; > sp->user_frag = val; > } > > -- > 2.1.0 >
Product Enquiry
Hello, We recently visited your website and we are interested in your models, We will like to make an order from your list of products. However, we would like to see your company's latest catalogs with the; minimum order quantity, delivery time/FOB, payment terms etc. Official order placement will follow as soon as possible. Awaiting your prompt reply. Thanks and best regards, Carol Merck Purchasing Manager
Re: [PATCH net] sctp: report SCTP_ERROR_INV_STRM as cpu endian
On Fri, Nov 17, 2017 at 02:15:02PM +0800, Xin Long wrote: > rfc6458 demands the send_error in SCTP_SEND_FAILED_EVENT should > be in cpu endian, while SCTP_ERROR_INV_STRM is in big endian. > > This issue is there since very beginning, Eric noticed it by > running 'make C=2 M=net/sctp/'. > > This patch is to convert it before reporting it. Unfortunatelly we can't fix this as this will break UAPI. It will break applications that are currently matching on the current value. > > Reported-by: Eric Dumazet > Signed-off-by: Xin Long > --- > net/sctp/stream.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c > index a11db21..f86ceee 100644 > --- a/net/sctp/stream.c > +++ b/net/sctp/stream.c > @@ -64,7 +64,7 @@ static void sctp_stream_outq_migrate(struct sctp_stream > *stream, >*/ > > /* Mark as failed send. */ > - sctp_chunk_fail(ch, SCTP_ERROR_INV_STRM); > + sctp_chunk_fail(ch, be16_to_cpu(SCTP_ERROR_INV_STRM)); > if (asoc->peer.prsctp_capable && > SCTP_PR_PRIO_ENABLED(ch->sinfo.sinfo_flags)) > asoc->sent_cnt_removable--; > -- > 2.1.0 >
Re: [PATCH] qed: fix unnecessary call to memset cocci warnings
On Fri, Nov 17, 2017 at 12:04 AM, Vasyl Gomonovych wrote: > Use kzalloc rather than kmalloc followed by memset with 0 > > drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING: > kzalloc should be used for dcbx_info, instead of kmalloc/memset > Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci While this looks okay per se now, it would be good if you put version of the patch and add a changelog to it. I think no need to resend this one, just for your information. Reviewed-by: Andy Shevchenko > Signed-off-by: Vasyl Gomonovych > --- > drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > index 8f6ccc0c39e5..cc9e0dfcee48 100644 > --- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > +++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c > @@ -1277,11 +1277,10 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct > qed_hwfn *hwfn, > { > struct qed_dcbx_get *dcbx_info; > > - dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC); > + dcbx_info = kzalloc(sizeof(*dcbx_info), GFP_ATOMIC); > if (!dcbx_info) > return NULL; > > - memset(dcbx_info, 0, sizeof(*dcbx_info)); > if (qed_dcbx_query_params(hwfn, dcbx_info, type)) { > kfree(dcbx_info); > return NULL; > -- > 1.9.1 > -- With Best Regards, Andy Shevchenko
[PATCH 2/2] ip6_tunnel: pass tun_dst arg from ip6_tnl_rcv() to __ip6_tnl_rcv()
Otherwise tun_dst argument is unused there. Currently, ip6_tnl_rcv() invoked with tun_dst set to NULL, so there is no actual functional changes introduced in this patch. Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path") Signed-off-by: Alexey Kodanev --- net/ipv6/ip6_tunnel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index a1c2444..bc050e8 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -869,7 +869,7 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb, struct metadata_dst *tun_dst, bool log_ecn_err) { - return __ip6_tnl_rcv(t, skb, tpi, NULL, ip6ip6_dscp_ecn_decapsulate, + return __ip6_tnl_rcv(t, skb, tpi, tun_dst, ip6ip6_dscp_ecn_decapsulate, log_ecn_err); } EXPORT_SYMBOL(ip6_tnl_rcv); -- 1.8.3.1
[PATCH 1/2] gre6: use log_ecn_error module parameter in ip6_tnl_rcv()
After commit 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") it's not used anywhere in the module, but previously was used in ip6gre_rcv(). Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") Signed-off-by: Alexey Kodanev --- net/ipv6/ip6_gre.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 59c121b..5d6bee0 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -461,7 +461,7 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi) &ipv6h->saddr, &ipv6h->daddr, tpi->key, tpi->proto); if (tunnel) { - ip6_tnl_rcv(tunnel, skb, tpi, NULL, false); + ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); return PACKET_RCVD; } -- 1.8.3.1
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 15.11.2017 19:29, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 09:25, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> Curently mutex is used to protect pernet operations list. It makes cleanup_net() to execute ->exit methods of the same operations set, which was used on the time of ->init, even after net namespace is unlinked from net_namespace_list. But the problem is it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex)<--- Sleep there for ages The solution is to convert net_mutex to the rw_semaphore. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list. This gives signify performance increase, like you may see below. There is measured sequential net namespace creation in a cycle, in single thread, without other tasks (single user mode): 1)int main(int argc, char *argv[]) { unsigned nr; if (argc < 2) { fprintf(stderr, "Provide nr iterations arg\n"); return 1; } nr = atoi(argv[1]); while (nr-- > 0) { if (unshare(CLONE_NEWNET)) { perror("Can't unshare"); return 1; } } return 0; } Origin, 10 unshare(): 0.03user 23.14system 1:39.85elapsed 23%CPU Patched, 10 unshare(): 0.03user 67.49system 1:08.34elapsed 98%CPU 2)for i in {1..1}; do unshare -n bash -c exit; done Origin: real 1m24,190s user 0m6,225s sys 0m15,132s Patched: real 0m18,235s (4.6 times faster) user 0m4,544s sys 0m13,796s This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()" from Linus tree (not in net-next yet). >>> >>> Using a rwsem to protect the list of operations makes sense. >>> >>> That should allow removing the sing >>> >>> I am not wild about taking a the rwsem down_write in >>> rtnl_link_unregister, and net_ns_barrier. I think that works but it >>> goes from being a mild hack to being a pretty bad hack and something >>> else that can kill the parallelism you are seeking it add. >>> >>> There are about 204 instances of struct pernet_operations. That is a >>> lot of code to have carefully audited to ensure it can in parallel all >>> at once. The existence of the exit_batch method, net_ns_barrier, >>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify >>> to the fact that there are data structures accessed by multiple network >>> namespaces. >>> >>> My preference would be to: >>> >>> - Add the net_sem in addition to net_mutex with down_write only held in >>> register and unregister, and maybe net_ns_barrier and >>> rtnl_link_unregister. >>> >>> - Factor out struct pernet_ops out of struct pernet_operations. With >>> struct pernet_ops not having the exit_batch method. With pernet_ops >>> being embedded an anonymous member of the old struct pernet_operations. >>> >>> - Add [un]register_pernet_{sys,dev} functions that take a struct >>> pernet_ops, that don't take net_mutex. Have them order the >>> pernet_list as: >>> >>> pernet_sys >>> pernet_subsys >>> pernet_device >>> pernet_dev >>> >>> With the chunk in the middle taking the net_mutex. >> >> I think this approach will work. Thanks for the suggestion. Some more >> thoughts to the plan below. >> >> The only difficult thing there will be to choose the right order >> to move ops from pernet_subsys to pernet_sys and from pernet_device >> to pernet_dev one by one. >> >> This is rather easy in case of tristate drivers, as modules may be loaded >> at any time, and the only important order is dependences between them. >> So, it's possible to start from a module, who has no dependences, >> and move it to pernet_sys, and then continue with
Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86
On 11/15, Yonghong Song wrote: > > v3 -> v4: > . Revert most of v3 change as 32bit emulation is not really working > on x86_64 platform as among other issues, function emulate_push_stack() > needs to account for 32bit app on 64bit platform. > A separate effort is ongoing to address this issue. Reviewed-by: Oleg Nesterov Please test your patch with the fix below, in this particular case the TIF_IA32 check should be fine. Although this is not what we really want, we should probably use user_64bit_mode(regs) which checks ->cs. But this needs more changes and doesn't solve other problems (get_unmapped_area) so I still can't decide what should we do right now... Oleg. --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -516,7 +516,7 @@ struct uprobe_xol_ops { static inline int sizeof_long(void) { - return in_ia32_syscall() ? 4 : 8; + return test_thread_flag(TIF_IA32) ? 4 : 8; } static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
Re: Bisected 4.14 Regression: IPsec transport mode breakage
On Fri, 2017-11-17 at 11:03 +0100, Steffen Klassert wrote: > On Wed, Nov 15, 2017 at 09:46:19AM -0700, Kevin Locke wrote: >> I have bisected the issue to commit c9f3f813d462. I have attached the >> client ipsec.conf as well as the syslog during the connection attempt >> for both c9f3f813d462 (bad) and cf3796675174 (good). > > The offending commit is already reverted in the 'net' tree > and will be available in mainline soon. Great, thank you! I tested davem/net#94802151894d and can confirm that it works and fixes the issue for me. Thanks again. -- Cheers, | ke...@kevinlocke.name| XMPP: ke...@kevinlocke.name Kevin| https://kevinlocke.name | IRC: kevinoid on freenode
[PATCH 0/4][v6] Add the ability to do BPF directed error injection
I've reworked this to be opt-in only as per Igno and Alexei. Still needs to go through Dave because of the bpf bits, but I need tracing guys to weigh in and sign off on my approach please. v5->v6: - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this feature. This way only functions that opt-in will be allowed to be overridden. - added a btrfs patch to allow error injection for open_ctree() so that the bpf sample actually works. v4->v5: - disallow kprobe_override programs from being put in the prog map array so we don't tail call into something we didn't check. This allows us to make the normal path still fast without a bunch of percpu operations. v3->v4: - fix a build error found by kbuild test bot (I didn't wait long enough apparently.) - Added a warning message as per Daniels suggestion. v2->v3: - added a ->kprobe_override flag to bpf_prog. - added some sanity checks to disallow attaching bpf progs that have ->kprobe_override set that aren't for ftrace kprobes. - added the trace_kprobe_ftrace helper to check if the trace_event_call is a ftrace kprobe. - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this value in the kprobe path, and thus only write to it if we're overriding or clearing the override. v1->v2: - moved things around to make sure that bpf_override_return could really only be used for an ftrace kprobe. - killed the special return values from trace_call_bpf. - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if it was being called from an ftrace kprobe context. - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state. - updated the test as per Alexei's review. - Original message - A lot of our error paths are not well tested because we have no good way of injecting errors generically. Some subystems (block, memory) have ways to inject errors, but they are random so it's hard to get reproduceable results. With BPF we can add determinism to our error injection. We can use kprobes and other things to verify we are injecting errors at the exact case we are trying to test. This patch gives us the tool to actual do the error injection part. It is very simple, we just set the return value of the pt_regs we're given to whatever we provide, and then override the PC with a dummy function that simply returns. Right now this only works on x86, but it would be simple enough to expand to other architectures. Thanks, Josef
[PATCH 2/4] btrfs: make open_ctree error injectable
From: Josef Bacik This allows us to do error injection with BPF for open_ctree. Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index dfdab849037b..c6b4e1f07072 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "hash.h" @@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb, goto fail_block_groups; goto retry_root_backup; } +BPF_ALLOW_ERROR_INJECTION(open_ctree); static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate) { -- 2.7.5
[PATCH 1/4] add infrastructure for tagging functions as error injectable
From: Josef Bacik Using BPF we can override kprob'ed functions and return arbitrary values. Obviously this can be a bit unsafe, so make this feature opt-in for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in order to give BPF access to that function for error injection purposes. Signed-off-by: Josef Bacik --- arch/x86/include/asm/asm.h| 6 ++ include/asm-generic/kprobes.h | 9 +++ include/asm-generic/vmlinux.lds.h | 10 +++ include/linux/kprobes.h | 1 + include/linux/module.h| 5 ++ kernel/kprobes.c | 163 ++ kernel/module.c | 6 +- 7 files changed, 199 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index b0dc91f4bedc..340f4cc43255 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -85,6 +85,12 @@ _ASM_PTR (entry); \ .popsection +# define _ASM_KPROBE_ERROR_INJECT(entry) \ + .pushsection "_kprobe_error_inject_list","aw" ; \ + _ASM_ALIGN ;\ + _ASM_PTR (entry); \ + .popseciton + .macro ALIGN_DESTINATION /* check for bad alignment of destination */ movl %edi,%ecx diff --git a/include/asm-generic/kprobes.h b/include/asm-generic/kprobes.h index 57af9f21d148..f96c4de5d7b0 100644 --- a/include/asm-generic/kprobes.h +++ b/include/asm-generic/kprobes.h @@ -22,4 +22,13 @@ static unsigned long __used \ #endif #endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */ +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +#define BPF_ALLOW_ERROR_INJECTION(fname) \ +static unsigned long __used\ + __attribute__((__section__("_kprobe_error_inject_list"))) \ + _eil_addr_##fname = (unsigned long)fname; +#else +#define BPF_ALLOW_ERROR_INJECTION(fname) +#endif + #endif /* _ASM_GENERIC_KPROBES_H */ diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 8acfc1e099e1..85822804861e 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -136,6 +136,15 @@ #define KPROBE_BLACKLIST() #endif +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +#define ERROR_INJECT_LIST(). = ALIGN(8); \ + VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .; \ + KEEP(*(_kprobe_error_inject_list)) \ + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) = .; +#else +#define ERROR_INJECT_LIST() +#endif + #ifdef CONFIG_EVENT_TRACING #define FTRACE_EVENTS(). = ALIGN(8); \ VMLINUX_SYMBOL(__start_ftrace_events) = .; \ @@ -560,6 +569,7 @@ FTRACE_EVENTS() \ TRACE_SYSCALLS()\ KPROBE_BLACKLIST() \ + ERROR_INJECT_LIST() \ MEM_DISCARD(init.rodata)\ CLK_OF_TABLES() \ RESERVEDMEM_OF_TABLES() \ diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index bd2684700b74..4f501cb73aec 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset); extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset); extern bool within_kprobe_blacklist(unsigned long addr); +extern bool within_kprobe_error_injection_list(unsigned long addr); struct kprobe_insn_cache { struct mutex mutex; diff --git a/include/linux/module.h b/include/linux/module.h index fe5aa3736707..7bb1a9b9a322 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -475,6 +475,11 @@ struct module { ctor_fn_t *ctors; unsigned int num_ctors; #endif + +#ifdef CONFIG_BPF_KPROBE_OVERRIDE + unsigned int num_kprobe_ei_funcs; + unsigned long *kprobe_ei_funcs; +#endif } cacheline_aligned __randomize_layout; #ifndef MODULE_ARCH_INIT #define MODULE_ARCH_INIT {} diff --git a/kernel/kprobes.c b/kernel/kprobes.c index a1606a4224e1..7afadf07b34e 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned long hash) return &(kretprobe_table_locks[hash].lock); } +/* List of symbols that can be overriden for error injection. */ +static LIST_HEAD(kprobe_error_injection_list); +static DEFIN
[PATCH 3/4] bpf: add a bpf_override_function helper
From: Josef Bacik Error injection is sloppy and very ad-hoc. BPF could fill this niche perfectly with it's kprobe functionality. We could make sure errors are only triggered in specific call chains that we care about with very specific situations. Accomplish this with the bpf_override_funciton helper. This will modify the probe'd callers return value to the specified value and set the PC to an override function that simply returns, bypassing the originally probed function. This gives us a nice clean way to implement systematic error injection for all of our code paths. Acked-by: Alexei Starovoitov Signed-off-by: Josef Bacik --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/kprobes.h | 4 +++ arch/x86/include/asm/ptrace.h| 5 arch/x86/kernel/kprobes/ftrace.c | 14 ++ include/linux/filter.h | 3 ++- include/linux/trace_events.h | 1 + include/uapi/linux/bpf.h | 7 - kernel/bpf/core.c| 3 +++ kernel/bpf/verifier.c| 2 ++ kernel/events/core.c | 7 + kernel/trace/Kconfig | 11 kernel/trace/bpf_trace.c | 38 +++ kernel/trace/trace_kprobe.c | 55 +++- kernel/trace/trace_probe.h | 12 + 15 files changed, 157 insertions(+), 9 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index d789a89cb32c..4fb618082259 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -195,6 +195,9 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_KPROBE_OVERRIDE + bool + config HAVE_NMI bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..5126d2750dd0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -152,6 +152,7 @@ config X86 select HAVE_KERNEL_XZ select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE + select HAVE_KPROBE_OVERRIDE select HAVE_KRETPROBES select HAVE_KVM select HAVE_LIVEPATCH if X86_64 diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 6cf65437b5e5..c6c3b1f4306a 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size; void arch_remove_kprobe(struct kprobe *p); asmlinkage void kretprobe_trampoline(void); +#ifdef CONFIG_KPROBES_ON_FTRACE +extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs); +#endif + /* Architecture specific copy of original instruction*/ struct arch_specific_insn { /* copy of the original instruction */ diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 91c04c8e67fa..f04e71800c2f 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs) return regs->ax; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->ax = rc; +} + /* * user_mode(regs) determines whether a register set came from user * mode. On x86_32, this is true if V8086 mode was enabled OR if the diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c index 041f7b6dfa0f..3c455bf490cb 100644 --- a/arch/x86/kernel/kprobes/ftrace.c +++ b/arch/x86/kernel/kprobes/ftrace.c @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p) p->ainsn.boostable = false; return 0; } + +asmlinkage void override_func(void); +asm( + ".type override_func, @function\n" + "override_func:\n" + " ret\n" + ".size override_func, .-override_func\n" +); + +void arch_ftrace_kprobe_override_function(struct pt_regs *regs) +{ + regs->ip = (unsigned long)&override_func; +} +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function); diff --git a/include/linux/filter.h b/include/linux/filter.h index cdd78a7beaae..dfa44fd74bae 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -458,7 +458,8 @@ struct bpf_prog { locked:1, /* Program image locked? */ gpl_compatible:1, /* Is filter GPL compatible? */ cb_access:1,/* Is control block accessed? */ - dst_needed:1; /* Do we need dst entry? */ + dst_needed:1, /* Do we need dst entry? */ + kprobe_override:1; /* Do we override a kprobe? */ kmemcheck_bitfield_end(meta); enum bpf_prog_type type; /* Type of BPF program */ u32 len;/* Number of filter blocks */ diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index fc6aeca945db..be8bd5a8efaa 100644 --- a/include/linux/trace_events.h +++ b/include/linu
[PATCH 4/4] samples/bpf: add a test for bpf_override_return
From: Josef Bacik This adds a basic test for bpf_override_return to verify it works. We override the main function for mounting a btrfs fs so it'll return -ENOMEM and then make sure that trying to mount a btrfs fs will fail. Acked-by: Alexei Starovoitov Signed-off-by: Josef Bacik --- samples/bpf/Makefile | 4 samples/bpf/test_override_return.sh | 15 +++ samples/bpf/tracex7_kern.c| 16 samples/bpf/tracex7_user.c| 28 tools/include/uapi/linux/bpf.h| 7 ++- tools/testing/selftests/bpf/bpf_helpers.h | 3 ++- 6 files changed, 71 insertions(+), 2 deletions(-) create mode 100755 samples/bpf/test_override_return.sh create mode 100644 samples/bpf/tracex7_kern.c create mode 100644 samples/bpf/tracex7_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ea2b9e6135f3..83d06bc1f710 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -14,6 +14,7 @@ hostprogs-y += tracex3 hostprogs-y += tracex4 hostprogs-y += tracex5 hostprogs-y += tracex6 +hostprogs-y += tracex7 hostprogs-y += test_probe_write_user hostprogs-y += trace_output hostprogs-y += lathist @@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o +tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o @@ -100,6 +102,7 @@ always += tracex3_kern.o always += tracex4_kern.o always += tracex5_kern.o always += tracex6_kern.o +always += tracex7_kern.o always += sock_flags_kern.o always += test_probe_write_user_kern.o always += trace_output_kern.o @@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf HOSTLOADLIBES_tracex4 += -lelf -lrt HOSTLOADLIBES_tracex5 += -lelf HOSTLOADLIBES_tracex6 += -lelf +HOSTLOADLIBES_tracex7 += -lelf HOSTLOADLIBES_test_cgrp2_sock2 += -lelf HOSTLOADLIBES_load_sock_ops += -lelf HOSTLOADLIBES_test_probe_write_user += -lelf diff --git a/samples/bpf/test_override_return.sh b/samples/bpf/test_override_return.sh new file mode 100755 index ..e68b9ee6814b --- /dev/null +++ b/samples/bpf/test_override_return.sh @@ -0,0 +1,15 @@ +#!/bin/bash + +rm -f testfile.img +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 +DEVICE=$(losetup --show -f testfile.img) +mkfs.btrfs -f $DEVICE +mkdir tmpmnt +./tracex7 $DEVICE +if [ $? -eq 0 ] +then + echo "SUCCESS!" +else + echo "FAILED!" +fi +losetup -d $DEVICE diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c new file mode 100644 index ..1ab308a43e0f --- /dev/null +++ b/samples/bpf/tracex7_kern.c @@ -0,0 +1,16 @@ +#include +#include +#include +#include "bpf_helpers.h" + +SEC("kprobe/open_ctree") +int bpf_prog1(struct pt_regs *ctx) +{ + unsigned long rc = -12; + + bpf_override_return(ctx, rc); + return 0; +} + +char _license[] SEC("license") = "GPL"; +u32 _version SEC("version") = LINUX_VERSION_CODE; diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c new file mode 100644 index ..8a52ac492e8b --- /dev/null +++ b/samples/bpf/tracex7_user.c @@ -0,0 +1,28 @@ +#define _GNU_SOURCE + +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" + +int main(int argc, char **argv) +{ + FILE *f; + char filename[256]; + char command[256]; + int ret; + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + snprintf(command, 256, "mount %s tmpmnt/", argv[1]); + f = popen(command, "r"); + ret = pclose(f); + + return ret ? 0 : 1; +} diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4a4b6e78c977..3756dde69834 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -673,6 +673,10 @@ union bpf_attr { * @buf: buf to fill * @buf_size: size of the buf * Return : 0 on success or negative error code + * + * int bpf_override_return(pt_regs, rc) + * @pt_regs: pointer to struct pt_regs + * @rc: the return value to set */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -732,7 +736,8 @@ union bpf_attr { FN(xdp_adjust_meta),\ FN(perf_event_read_value), \ FN(perf_prog_read_value), \ - FN(getsockopt), + FN(getsockopt), \ + FN(override_return), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/to
Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86
On 11/17/17 9:25 AM, Oleg Nesterov wrote: On 11/15, Yonghong Song wrote: v3 -> v4: . Revert most of v3 change as 32bit emulation is not really working on x86_64 platform as among other issues, function emulate_push_stack() needs to account for 32bit app on 64bit platform. A separate effort is ongoing to address this issue. Reviewed-by: Oleg Nesterov Please test your patch with the fix below, in this particular case the TIF_IA32 check should be fine. Although this is not what we really want, we should probably use user_64bit_mode(regs) which checks ->cs. But this needs more changes and doesn't solve other problems (get_unmapped_area) so I still can't decide what should we do right now... I tested the below change with my patch. On x86_64, both 64bit and 32bit program can be uprobe emulated properly. On x86_32, however, there is a compilation error like below: In function ‘check_copy_size’, inlined from ‘copy_to_user’ at /home/yhs/work/tip/include/linux/uaccess.h:154:6, inlined from ‘emulate_push_stack.isra.9’ at /home/yhs/work/tip/arch/x86/kernel/uprobes.c:535:6: /home/yhs/work/tip/include/linux/thread_info.h:139:4: error: call to ‘__bad_copy_from’ declared with attribute error: copy source size is too small __bad_copy_from(); Basically, test_thread_flag(TIF_IA32) returns 0 on x86_32 system. Oleg. --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -516,7 +516,7 @@ struct uprobe_xol_ops { static inline int sizeof_long(void) { - return in_ia32_syscall() ? 4 : 8; + return test_thread_flag(TIF_IA32) ? 4 : 8; } static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
[PATCH net-next 1/2] net-next: use five-tuple hash for sk_txhash
From: Shaohua Li We are using sk_txhash to calculate flowlabel, but sk_txhash isn't always available, for example, in inet_timewait_sock. This causes problem for reset packet, which will have a different flowlabel. This causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. Per Tom's suggestion, we switch back to five-tuple hash, so we can reconstruct correct flowlabel for reset packet. At most places, we already have the flowi info, so we directly use it build sk_txhash. For synack, we do this after route search. At that time, we have the flowi info ready, so don't need to create the flowi info again. I don't change sk_rethink_txhash() though, it still uses random hash, which is the whole point to select a different path after a negative routing advise. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- include/net/sock.h| 18 -- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 17 - net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 18 +- 10 files changed, 39 insertions(+), 31 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index f8715c5..85a6192 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1732,22 +1732,12 @@ static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk) return sk ? sk->sk_uid : make_kuid(net->user_ns, 0); } -static inline u32 net_tx_rndhash(void) -{ - u32 v = prandom_u32(); - - return v ?: 1; -} - -static inline void sk_set_txhash(struct sock *sk) -{ - sk->sk_txhash = net_tx_rndhash(); -} - static inline void sk_rethink_txhash(struct sock *sk) { - if (sk->sk_txhash) - sk_set_txhash(sk); + if (sk->sk_txhash) { + u32 v = prandom_u32(); + sk->sk_txhash = v ?: 1; + } } static inline struct dst_entry * diff --git a/include/net/tcp.h b/include/net/tcp.h index 85ea578..8d68fde 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops { __u16 *mss); #endif struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl, - const struct request_sock *req); + struct request_sock *req); u32 (*init_seq)(const struct sk_buff *skb); u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb); int (*send_synack)(const struct sock *sk, struct dst_entry *dst, diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c index f915abf..ed9ccb7 100644 --- a/net/ipv4/datagram.c +++ b/net/ipv4/datagram.c @@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len inet->inet_daddr = fl4->daddr; inet->inet_dport = usin->sin_port; sk->sk_state = TCP_ESTABLISHED; - sk_set_txhash(sk); + sk->sk_txhash = get_hash_from_flowi4(fl4); inet->inet_id = jiffies; sk_dst_set(sk, &rt->dst); diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index fda37f2..76f1cf6 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) treq->rcv_isn = ntohl(th->seq) - 1; treq->snt_isn = cookie; treq->ts_off= 0; - treq->txhash= net_tx_rndhash(); req->mss= mss; ireq->ir_num= ntohs(th->dest); ireq->ir_rmt_port = th->source; @@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) opt->srr ? opt->faddr : ireq->ir_rmt_addr, ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid); security_req_classify_flow(req, flowi4_to_flowi(&fl4)); + + treq->txhash = get_hash_from_flowi4(&fl4); + rt = ip_route_output_key(sock_net(sk), &fl4); if (IS_ERR(rt)) { reqsk_free(req); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dabbf1d..92b4a10 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6289,7 +6289,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } tcp_rsk(req)->snt_isn = isn; - tcp_rsk(req)->txhash = net_tx_rndhash(); tcp_openreq_init_rwin(req, sk, dst); if (!want_cookie) { tcp_reqsk_record_syn(sk, req, skb); diff --git a/net/ipv4
[PATCH net-next 0/2] net: fix flowlabel inconsistency in reset packet
From: Shaohua Li Hi, Please see below tcpdump output: 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 30 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 ecr 2500903437], length 0 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 24 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 ecr 2500903437], length 0 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 ecr 2500903438], length 0 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 2500904438 ecr 2500903438], length 24 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0 The tcp reset packet has a different flowlabel, which causes our router doesn't correctly close tcp connection. We are using flowlabel to do load balance. Routers in the path maintain connection state. So if flow label changes, the packet is routed through a different router. In this case, the old router doesn't get the reset packet to close the tcp connection. The reason is the normal packet gets the skb->hash from sk->sk_txhash, which is generated randomly. ip6_make_flowlabel then uses the hash to create a flowlabel. The reset packet doesn't get assigned a hash, so the flowlabel is calculated with flowi6. The patches fix the issue. Thanks, Shaohua Shaohua Li (2): net-next: use five-tuple hash for sk_txhash net-next: copy user configured flowlabel to reset packet include/net/sock.h| 18 -- include/net/tcp.h | 2 +- net/ipv4/datagram.c | 2 +- net/ipv4/syncookies.c | 4 +++- net/ipv4/tcp_input.c | 1 - net/ipv4/tcp_ipv4.c | 17 - net/ipv4/tcp_output.c | 1 - net/ipv6/datagram.c | 4 +++- net/ipv6/syncookies.c | 3 ++- net/ipv6/tcp_ipv6.c | 36 ++-- 10 files changed, 56 insertions(+), 32 deletions(-) -- 2.9.5
[PATCH net-next 2/2] net-next: copy user configured flowlabel to reset packet
From: Shaohua Li Reset packet doesn't use user configured flowlabel, instead, it always uses 0. This will cause inconsistency for flowlabel. tw sock already records flowlabel info, so we can directly use it. Cc: Martin KaFai Lau Cc: Eric Dumazet Cc: Florent Fourcot Cc: Cong Wang Cc: Tom Herbert Signed-off-by: Shaohua Li --- net/ipv6/tcp_ipv6.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index a1a5802..9b678cd 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -901,6 +901,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) struct sock *sk1 = NULL; #endif int oif = 0; + u8 tclass = 0; + __be32 flowlabel = 0; if (th->rst) return; @@ -954,7 +956,21 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) trace_tcp_send_reset(sk, skb); } - tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0); + if (sk) { + if (sk_fullsock(sk)) { + struct ipv6_pinfo *np = inet6_sk(sk); + + tclass = np->tclass; + flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK; + } else { + struct inet_timewait_sock *tw = inet_twsk(sk); + + tclass = tw->tw_tclass; + flowlabel = cpu_to_be32(tw->tw_flowlabel); + } + } + tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, + tclass, flowlabel); #ifdef CONFIG_TCP_MD5SIG out: -- 2.9.5
[PATCH RFC 01/25] net: Assign net to net_namespace_list in setup_net()
This patch merges two repeating pieces of code in one, and they will live in setup_net() now. It acts as cleanup even despite init_net_initialized assignment is reordered with the linking of net now. This variable is need for proc_net_init() called from: start_kernel()->proc_root_init()->proc_net_init(), which can't race with net_ns_init(), called from initcall. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index b797832565d3..7ecf71050ffa 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -296,6 +296,9 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) if (error < 0) goto out_undo; } + rtnl_lock(); + list_add_tail_rcu(&net->list, &net_namespace_list); + rtnl_unlock(); out: return error; @@ -417,11 +420,6 @@ struct net *copy_net_ns(unsigned long flags, net->ucounts = ucounts; rv = setup_net(net, user_ns); - if (rv == 0) { - rtnl_lock(); - list_add_tail_rcu(&net->list, &net_namespace_list); - rtnl_unlock(); - } mutex_unlock(&net_mutex); if (rv < 0) { dec_net_namespaces(ucounts); @@ -847,11 +845,6 @@ static int __init net_ns_init(void) panic("Could not setup the initial network namespace"); init_net_initialized = true; - - rtnl_lock(); - list_add_tail_rcu(&init_net.list, &net_namespace_list); - rtnl_unlock(); - mutex_unlock(&net_mutex); register_pernet_subsys(&net_ns_ops);
[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore
Hi, this is continuation of discussion from here: https://lkml.org/lkml/2017/11/14/298 The plan has changed a little bit, so I'd be happy to hear people's comments, before I dived into all 400+ pernet subsys and devices. The patch set adds pernet sys list ahead of subsys and device, and it's used for pernet_operations, which may be executed in parallel with any other pernet_operations methods. Also, some high-priority ops converted (up to registered using postcore_initcall(), and some subsys_initcall()) in order of appearance. The sequence in setup_net() is following: 1)execute all the callbacks from pernet_sys list 2)lock net_mutex 3)execute all the callbacks from pernet_subsys list 4)execute all the callbacks from pernet_device list 5)unlock net_mutex There was not pernet_operations, requiring additional synchronization, yet, but I've bumped in another problem. The problem is that some drivers may be compiled as modules and as kernel-image part. They register pernet_operations from device_initcall() for example. This initcall executes in different time comparing to in-kernel built-in only drivers. Imagine, we have three state driverA, and boolean driverB. driverA registers pernet_subsys from subsys_initcall(). driverB registers pernet_subsys from fs_initcall(). So, here we have two cases: driverA is module driverA is built-in --- register driverB ops register driverA ops register driverA ops register driverB ops So, the order is different. When converting driver one-by-one, it's impossible to make the order true for all .config states, because of the above. So, the bisect won't work. And it seems, it's just the same as to convert pernet_operations from all the files in file alphabetical order. What do you think about this? (Note, the patches has no such a problem at the moment, as there are all in-kernel early core drivers). Maybe there are another comments on the code. --- Kirill Tkhai (25): net: Assign net to net_namespace_list in setup_net() net: Cleanup copy_net_ns() net: Introduce net_sem for protection of pernet_list net: Move mutex_unlock() in cleanup_net() up net: Add primitives to update heads of pernet_list sublists net: Add pernet sys and registration functions net: Make sys sublist pernet_operations executed out of net_mutex net: Move proc_net_ns_ops to pernet_sys list net: Move net_ns_ops to pernet_sys list net: Move sysctl_pernet_ops to pernet_sys list net: Move netfilter_net_ops to pernet_sys list net: Move nf_log_net_ops to pernet_sys list net: Move net_inuse_ops to pernet_sys list net: Move net_defaults_ops to pernet_sys list net: Move netlink_net_ops to pernet_sys list net: Move rtnetlink_net_ops to pernet_sys list net: Move audit_net_ops to pernet_sys list net: Move uevent_net_ops to pernet_sys list net: Move proto_net_ops to pernet_sys list net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list net: Move fib_* pernet_operations, registered via subsys_initcall(), to pernet_sys list net: Move subsys_initcall() registered pernet_operations from net/sched to pernet_sys list net: Move genl_pernet_ops to pernet_sys list net: Move wext_pernet_ops to pernet_sys list net: Move sysctl_core_ops to pernet_sys list fs/proc/proc_net.c |2 include/linux/rtnetlink.h |1 include/net/net_namespace.h |2 kernel/audit.c |2 lib/kobject_uevent.c|2 net/core/dev.c |2 net/core/fib_notifier.c |2 net/core/fib_rules.c|2 net/core/net-procfs.c |4 - net/core/net_namespace.c| 203 +-- net/core/rtnetlink.c|6 + net/core/sock.c |4 - net/core/sysctl_net_core.c |2 net/netfilter/core.c|2 net/netfilter/nf_log.c |2 net/netlink/af_netlink.c|2 net/netlink/genetlink.c |2 net/sched/act_api.c |2 net/sched/sch_api.c |2 net/sysctl_net.c|2 net/wireless/wext-core.c|2 21 files changed, 183 insertions(+), 67 deletions(-) -- Signed-off-by: Kirill Tkhai
[PATCH RFC 02/25] net: Cleanup copy_net_ns()
Line up destructors actions in the revers order to constructors. Next patches will add more actions, and this will be comfortable, if there is the such order. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 7ecf71050ffa..2e512965bf42 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -404,27 +404,25 @@ struct net *copy_net_ns(unsigned long flags, net = net_alloc(); if (!net) { - dec_net_namespaces(ucounts); - return ERR_PTR(-ENOMEM); + rv = -ENOMEM; + goto dec_ucounts; } - + refcount_set(&net->passive, 1); + net->ucounts = ucounts; get_user_ns(user_ns); rv = mutex_lock_killable(&net_mutex); - if (rv < 0) { - net_free(net); - dec_net_namespaces(ucounts); - put_user_ns(user_ns); - return ERR_PTR(rv); - } + if (rv < 0) + goto put_userns; - net->ucounts = ucounts; rv = setup_net(net, user_ns); mutex_unlock(&net_mutex); if (rv < 0) { - dec_net_namespaces(ucounts); +put_userns: put_user_ns(user_ns); net_drop_ns(net); +dec_ucounts: + dec_net_namespaces(ucounts); return ERR_PTR(rv); } return net;
[PATCH RFC 08/25] net: Move proc_net_ns_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from before initcalls. Since proc_net_ns_ops is registered pernet_subsys, made from: start_kernel()->proc_root_init()->proc_net_init(), and there is no a pernet_subsys, which is registered earlier, we start from it. proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit() register pernet net->proc_net and ->proc_net_stat, and constructors and destructors of another pernet_operations are not interested in foreign net's proc_net and proc_net_stat. Proc filesystem privitives are synchronized on proc_subdir_lock. So, it's safe to move proc_net_ns_ops to pernet_sys list and execute its methods in parallel with another pernet operations. Signed-off-by: Kirill Tkhai --- fs/proc/proc_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c index a2bf369c923d..5eb52765eeab 100644 --- a/fs/proc/proc_net.c +++ b/fs/proc/proc_net.c @@ -243,5 +243,5 @@ int __init proc_net_init(void) { proc_symlink("net", NULL, "self/net"); - return register_pernet_subsys(&proc_net_ns_ops); + return register_pernet_sys(&proc_net_ns_ops); }
[PATCH RFC 07/25] net: Make sys sublist pernet_operations executed out of net_mutex
Move net_mutex to setup_net() and cleanup_net(), and do not hold it, while sys sublist methods are executed. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 44 +++- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index f4f4aaa5ce1f..7aec8c1afe50 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) { struct net_generic *ng, *old_ng; - BUG_ON(!mutex_is_locked(&net_mutex)); + BUG_ON(!rwsem_is_locked(&net_sem)); BUG_ON(id < MIN_PERNET_OPS_ID); old_ng = rcu_dereference_protected(net->gen, - lockdep_is_held(&net_mutex)); + lockdep_is_held(&net_sem)); if (old_ng->s.len > id) { old_ng->ptr[id] = data; return 0; @@ -300,6 +300,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) { /* Must be called with net_mutex held */ const struct pernet_operations *ops, *saved_ops; + bool locked = false; int error = 0; LIST_HEAD(net_exit_list); @@ -311,14 +312,34 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) spin_lock_init(&net->nsid_lock); list_for_each_entry(ops, &pernet_list, list) { + if (&ops->list == first_subsys) { + BUG_ON(locked); + error = mutex_lock_killable(&net_mutex); + if (error) + goto out_undo; + locked = true; + } + error = ops_init(ops, net); if (error < 0) goto out_undo; } + + if (!locked) { + /* +* This may happen only on early boot, so we don't +* care about possibility to interrupt the locking. +*/ + mutex_lock(&net_mutex); + locked = true; + } + rtnl_lock(); list_add_tail_rcu(&net->list, &net_namespace_list); rtnl_unlock(); out: + if (locked) + mutex_unlock(&net_mutex); return error; out_undo: @@ -433,12 +454,7 @@ struct net *copy_net_ns(unsigned long flags, rv = down_read_killable(&net_sem); if (rv < 0) goto put_userns; - rv = mutex_lock_killable(&net_mutex); - if (rv < 0) - goto up_read; rv = setup_net(net, user_ns); - mutex_unlock(&net_mutex); -up_read: up_read(&net_sem); if (rv < 0) { put_userns: @@ -460,6 +476,7 @@ static void cleanup_net(struct work_struct *work) struct net *net, *tmp; struct list_head net_kill_list; LIST_HEAD(net_exit_list); + bool locked; /* Atomically snapshot the list of namespaces to cleanup */ spin_lock_irq(&cleanup_list_lock); @@ -468,6 +485,7 @@ static void cleanup_net(struct work_struct *work) down_read(&net_sem); mutex_lock(&net_mutex); + locked = true; /* Don't let anyone else find us. */ rtnl_lock(); @@ -500,10 +518,18 @@ static void cleanup_net(struct work_struct *work) synchronize_rcu(); /* Run all of the network namespace exit methods */ - list_for_each_entry_reverse(ops, &pernet_list, list) + list_for_each_entry_reverse(ops, &pernet_list, list) { ops_exit_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); + if (&ops->list == first_subsys) { + BUG_ON(!locked); + mutex_unlock(&net_mutex); + locked = false; + } + } + + if (locked) + mutex_unlock(&net_mutex); /* Free the net generic variables */ list_for_each_entry_reverse(ops, &pernet_list, list)
[PATCH RFC 12/25] net: Move nf_log_net_ops to pernet_sys list
nf_log_net_ops are registered the same initcall as netfilter_net_ops, so they has to be moved right after netfilter_net_ops. The ops would have had a problem in parallel execution with others, if init_net had been possible to released. But it's not, and the rest is safe for that. There is memory allocation, which nobody else interested in, and sysctl registration. So, we move it to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/netfilter/nf_log.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c index 8bb152a7cca4..08868afad813 100644 --- a/net/netfilter/nf_log.c +++ b/net/netfilter/nf_log.c @@ -582,5 +582,5 @@ static struct pernet_operations nf_log_net_ops = { int __init netfilter_log_init(void) { - return register_pernet_subsys(&nf_log_net_ops); + return register_pernet_sys(&nf_log_net_ops); }
[PATCH RFC 09/25] net: Move net_ns_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from pure initcalls. Since net_ns_init() is the only pure initcall in net subsystem, and there is no early initcalls; the pernet subsys, it registers, is the first in pernet_operations list. So, we start with it. net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only ida_simple_* functions, which are not need a synchronization. So it's safe to execute them in parallel with any other pernet_operations, and thus we convert net_ns_ops to pernet_sys type. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 7aec8c1afe50..2e8295aa7003 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -899,7 +899,7 @@ static int __init net_ns_init(void) init_net_initialized = true; up_write(&net_sem); - register_pernet_subsys(&net_ns_ops); + register_pernet_sys(&net_ns_ops); rtnl_register(PF_UNSPEC, RTM_NEWNSID, rtnl_net_newid, NULL, RTNL_FLAG_DOIT_UNLOCKED);
[PATCH RFC 15/25] net: Move netlink_net_ops to pernet_sys list
According to net/core/Makefile, net/core/af_netlink.o core initcalls execute right after net/core/net_namespace.o. The methods of netlink_net_ops create and destroy "netlink" file, which are not interested for foreigh pernet_operations. So, netlink_net_ops may safely be moved to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/netlink/af_netlink.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b9e0ee4e22f5..a4f1f5222b79 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2735,7 +2735,7 @@ static int __init netlink_proto_init(void) netlink_add_usersock_entry(); sock_register(&netlink_family_ops); - register_pernet_subsys(&netlink_net_ops); + register_pernet_sys(&netlink_net_ops); /* The netlink device handler may be needed early. */ rtnetlink_init(); out:
[PATCH RFC 13/25] net: Move net_inuse_ops to pernet_sys list
net/core/sock.o is the first linked file in net/core/Makefile, so its core initcall executes the first in the directory. net_inuse_ops methods expose statistics in /proc. No one from the rest of pernet_subsys or pernet_device lists does not touch net::core::inuse. So, it's safe to move net_inuse_ops to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/core/sock.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index 13719af7b4e3..be050b044699 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3081,7 +3081,7 @@ static struct pernet_operations net_inuse_ops = { static __init int net_inuse_init(void) { - if (register_pernet_subsys(&net_inuse_ops)) + if (register_pernet_sys(&net_inuse_ops)) panic("Cannot initialize net inuse counters"); return 0;
[PATCH RFC 21/25] net: Move fib_* pernet_operations, registered via subsys_initcall(), to pernet_sys list
Both of them create and initialize lists, which are not touched by another foreing pernet_operations. Signed-off-by: Kirill Tkhai --- net/core/fib_notifier.c |2 +- net/core/fib_rules.c|2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c index 0c048bdeb016..782a1475a32e 100644 --- a/net/core/fib_notifier.c +++ b/net/core/fib_notifier.c @@ -175,7 +175,7 @@ static struct pernet_operations fib_notifier_net_ops = { static int __init fib_notifier_init(void) { - return register_pernet_subsys(&fib_notifier_net_ops); + return register_pernet_sys(&fib_notifier_net_ops); } subsys_initcall(fib_notifier_init); diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 98e1066c3d55..b2706c18f0f3 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -1039,7 +1039,7 @@ static int __init fib_rules_init(void) rtnl_register(PF_UNSPEC, RTM_DELRULE, fib_nl_delrule, NULL, 0); rtnl_register(PF_UNSPEC, RTM_GETRULE, NULL, fib_nl_dumprule, 0); - err = register_pernet_subsys(&fib_rules_net_ops); + err = register_pernet_sys(&fib_rules_net_ops); if (err < 0) goto fail;
[PATCH RFC 19/25] net: Move proto_net_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from subsys initcalls. According to net/Makefile and net/core/Makefile, this is the first exected subsys_initcall(), registering pernet_subsys. It seems to be executed in parallel with others, as it's only creates/destoyes proc entry, which nobody else is not interested in. Signed-off-by: Kirill Tkhai --- net/core/sock.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index be050b044699..ed12e115458b 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3349,7 +3349,7 @@ static __net_initdata struct pernet_operations proto_net_ops = { static int __init proto_init(void) { - return register_pernet_subsys(&proto_net_ops); + return register_pernet_sys(&proto_net_ops); } subsys_initcall(proto_init);
[PATCH RFC 22/25] net: Move subsys_initcall() registered pernet_operations from net/sched to pernet_sys list
psched_net_ops only creates and destroyes /proc entry, and safe to be executed in parallel with any foreigh pernet_operations. tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht, which is not touched by foreign pernet_operations. So, move them to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/sched/act_api.c |2 +- net/sched/sch_api.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index 4d33a50a8a6d..f1de2146e6e0 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -1470,7 +1470,7 @@ static int __init tc_action_init(void) { int err; - err = register_pernet_subsys(&tcf_action_net_ops); + err = register_pernet_sys(&tcf_action_net_ops); if (err) return err; diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index b6c4f536876b..68938ca4bbe1 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -2008,7 +2008,7 @@ static int __init pktsched_init(void) { int err; - err = register_pernet_subsys(&psched_net_ops); + err = register_pernet_sys(&psched_net_ops); if (err) { pr_err("pktsched_init: " "cannot initialize per netns operations\n");
[PATCH RFC 23/25] net: Move genl_pernet_ops to pernet_sys list
This pernet_operations create and destroy net::genl_sock. Foreign pernet_operations don't touch it. Signed-off-by: Kirill Tkhai --- net/netlink/genetlink.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index d444daf1ac04..da7ab3dd5609 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -1045,7 +1045,7 @@ static int __init genl_init(void) if (err < 0) goto problem; - err = register_pernet_subsys(&genl_pernet_ops); + err = register_pernet_sys(&genl_pernet_ops); if (err) goto problem;
[PATCH RFC 20/25] net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list
net/core/dev.o is lined after net/core/sock.o. There are: 1)dev_proc_ops and dev_mc_net_ops, which create and destroy pernet proc file and not interested to another net namespaces; 2)netdev_net_ops, which creates pernet hash, which is not touched by another pernet_operations. So, move it to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/core/dev.c|2 +- net/core/net-procfs.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8ee29f4f5fa9..b90a503a9e1a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8787,7 +8787,7 @@ static int __init net_dev_init(void) INIT_LIST_HEAD(&offload_base); - if (register_pernet_subsys(&netdev_net_ops)) + if (register_pernet_sys(&netdev_net_ops)) goto out; /* diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c index 615ccab55f38..46096219d574 100644 --- a/net/core/net-procfs.c +++ b/net/core/net-procfs.c @@ -413,8 +413,8 @@ static struct pernet_operations __net_initdata dev_mc_net_ops = { int __init dev_proc_init(void) { - int ret = register_pernet_subsys(&dev_proc_ops); + int ret = register_pernet_sys(&dev_proc_ops); if (!ret) - return register_pernet_subsys(&dev_mc_net_ops); + return register_pernet_sys(&dev_mc_net_ops); return ret; }
[PATCH RFC 25/25] net: Move sysctl_core_ops to pernet_sys list
These pernet_operations register and destroy sysctl directory, and it's not interested for foreign pernet_operations. Signed-off-by: Kirill Tkhai --- net/core/sysctl_net_core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index cbc3dde4cfcc..0dab679b33fa 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -525,7 +525,7 @@ static __net_initdata struct pernet_operations sysctl_core_ops = { static __init int sysctl_core_init(void) { register_net_sysctl(&init_net, "net/core", net_core_table); - return register_pernet_subsys(&sysctl_core_ops); + return register_pernet_sys(&sysctl_core_ops); } fs_initcall(sysctl_core_init);
[PATCH RFC 24/25] net: Move wext_pernet_ops to pernet_sys list
These pernet_operations initialize and purge net::wext_nlevents queue, and are not touched by foreign pernet_operations. Signed-off-by: Kirill Tkhai --- net/wireless/wext-core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c index 6cdb054484d6..2103c2a003ed 100644 --- a/net/wireless/wext-core.c +++ b/net/wireless/wext-core.c @@ -394,7 +394,7 @@ static struct pernet_operations wext_pernet_ops = { static int __init wireless_nlevent_init(void) { - int err = register_pernet_subsys(&wext_pernet_ops); + int err = register_pernet_sys(&wext_pernet_ops); if (err) return err;
[PATCH RFC 14/25] net: Move net_defaults_ops to pernet_sys list
According to net/core/Makefile, net/core/net_namespace.o core initcalls execute right after net/core/sock.o. net_defaults_ops introduces only net_defaults_init_net method, and it acts on net::core::sysctl_somaxconn, which is not interested the rest of pernet_subsys and pernet_device lists. Then, move it to pernet_sys. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2e8295aa7003..7fc9d44c1817 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -371,7 +371,7 @@ static struct pernet_operations net_defaults_ops = { static __init int net_defaults_init(void) { - if (register_pernet_subsys(&net_defaults_ops)) + if (register_pernet_sys(&net_defaults_ops)) panic("Cannot initialize net default settings"); return 0;
[PATCH RFC 18/25] net: Move uevent_net_ops to pernet_sys list
This postcore_initcall() created pernet_operations are registered from ./lib directory, and they have to go right after audit_net_ops. uevent_net_init() and uevent_net_exit() create and destroy netlink socket, and these actions serialized in netlink code. Parallel execution with other pernet_operations makes the socket disappear earlier from uevent_sock_list on ->exit. As userspace can't be interested in broadcast messages of dying net, and, as I see, no one in kernel listen them, we may safely move uevent_net_ops to pernet_sys list. Signed-off-by: Kirill Tkhai --- lib/kobject_uevent.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index c3e84edc47c9..84c9d85477cc 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -647,7 +647,7 @@ static struct pernet_operations uevent_net_ops = { static int __init kobject_uevent_init(void) { - return register_pernet_subsys(&uevent_net_ops); + return register_pernet_sys(&uevent_net_ops); }
[PATCH RFC 17/25] net: Move audit_net_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from postcore initcalls. These pernet_operations are in ./kernel directory, and there are only one more postcore in ./lib. So, audit_net_ops have to go the first. audit_net_init() creates netlink socket, while audit_net_exit() destroys it. The rest of the pernet_list are not interested in the socket, so we move audit_net_ops to pernet_sys list. Signed-off-by: Kirill Tkhai --- kernel/audit.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/audit.c b/kernel/audit.c index 227db99b0f19..bb4626d7e712 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -1549,7 +1549,7 @@ static int __init audit_init(void) pr_info("initializing netlink subsys (%s)\n", audit_default ? "enabled" : "disabled"); - register_pernet_subsys(&audit_net_ops); + register_pernet_sys(&audit_net_ops); audit_initialized = AUDIT_INITIALIZED;
[PATCH RFC 16/25] net: Move rtnetlink_net_ops to pernet_sys list
rtnetlink_net_ops are added the same core initcall as netlink_net_ops, so they has to be added right after netlink_net_ops. rtnetlink_net_init() and rtnetlink_net_exit() create and destroy netlink socket. It looks like, another pernet_operations are not interested in foreiner net::rtnl, so rtnetlink_net_ops may be safely moved to pernet_sys list. Signed-off-by: Kirill Tkhai --- net/core/rtnetlink.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index cb06d43c4230..d9cf13554e4d 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -4503,7 +4503,7 @@ void __init rtnetlink_init(void) for (i = 0; i < ARRAY_SIZE(rtnl_msg_handlers_ref); i++) refcount_set(&rtnl_msg_handlers_ref[i], 1); - if (register_pernet_subsys(&rtnetlink_net_ops)) + if (register_pernet_sys(&rtnetlink_net_ops)) panic("rtnetlink_init: cannot initialize rtnetlink\n"); register_netdevice_notifier(&rtnetlink_dev_notifier);
[PATCH RFC 10/25] net: Move sysctl_pernet_ops to pernet_sys list
This patch starts to convert pernet_subsys, registered from core initcalls. Since net/socket.o is the first linked file in net/Makefile, its core initcalls execute the first. sysctl_pernet_ops is the first pernet_subsys, registered from sock_init(), so it goes ahead of others, registered via core_initcall(). Methods sysctl_net_init() and sysctl_net_exit() initialize net::sysctls of a namespace. pernet_operations::init()/exit() methods from the rest of the list do not touch net::sysctls of strangers, so it's safe to execute sysctl_pernet_ops's methods in parallel with any other pernet_operations. Signed-off-by: Kirill Tkhai --- net/sysctl_net.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sysctl_net.c b/net/sysctl_net.c index 9aed6fe1bf1a..1b91db88e54a 100644 --- a/net/sysctl_net.c +++ b/net/sysctl_net.c @@ -103,7 +103,7 @@ __init int net_sysctl_init(void) net_header = register_sysctl("net", empty); if (!net_header) goto out; - ret = register_pernet_subsys(&sysctl_pernet_ops); + ret = register_pernet_sys(&sysctl_pernet_ops); if (ret) goto out1; out:
[PATCH RFC 11/25] net: Move netfilter_net_ops to pernet_sys list
Since net/socket.o is the first linked file in net/Makefile, its core initcalls execute the first. netfilter_net_ops is executed right after sysctl_pernet_ops. Methods netfilter_net_init() and netfilter_net_exit() initialize net::nf::hooks and change net-related proc directory of net. Another pernet_operations do not interested in forein net::nf::hooks or proc entries, so it's safe to move netfilter_net_ops to pernet list. Signed-off-by: Kirill Tkhai --- net/netfilter/core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 52cd2901a097..2bed28281b67 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -606,7 +606,7 @@ int __init netfilter_init(void) { int ret; - ret = register_pernet_subsys(&netfilter_net_ops); + ret = register_pernet_sys(&netfilter_net_ops); if (ret < 0) goto err;
[PATCH RFC 06/25] net: Add pernet sys and registration functions
This is a new sublist of pernet_list, which will live ahead of already existing: sys, subsys, device. It's aimed for subsystems, which pernet_operations may execute in parallel with any other's pernet_operations. In further, step-by-step we will move all subsys there, adding necessary small synchronization locks, where it's need. After all subsys are moved to sys, we'll kill subsys list and we'll have all current subsys not requiring net_mutex and to be able to init and exit in parallel with others. Then we'll add dev sublist ahead of device, and will repeat the cycle. Suggested-by: Eric W. Biederman Signed-off-by: Kirill Tkhai --- include/net/net_namespace.h |2 + net/core/net_namespace.c| 75 ++- 2 files changed, 75 insertions(+), 2 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 10f99dafd5ac..2cde5f766ec6 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -324,6 +324,8 @@ struct pernet_operations { * device which caused kernel oops, and panics during network * namespace cleanup. So please don't get this wrong. */ +int register_pernet_sys(struct pernet_operations *); +void unregister_pernet_sys(struct pernet_operations *); int register_pernet_subsys(struct pernet_operations *); void unregister_pernet_subsys(struct pernet_operations *); int register_pernet_device(struct pernet_operations *); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 1d9712973695..f4f4aaa5ce1f 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -24,10 +24,24 @@ #include /* - * Our network namespace constructor/destructor lists + * Our network namespace constructor/destructor lists + * one by one linked in pernet_list. They are (in order + * of linking): sys, subsys, device. + * + * The methods from sys for a network namespace may be + * called in parallel with any method from any list + * for another net namespace. + * + * The methods from subsys and device can't be called + * in parallel with a method from subsys or device. + * + * When all subsys pernet_operations are moved to sys + * sublist, we'll kill subsys sublist, and create dev + * ahead of device sublist, and repeat the cycle. */ static LIST_HEAD(pernet_list); +static struct list_head *first_subsys = &pernet_list; static struct list_head *first_device = &pernet_list; DEFINE_MUTEX(net_mutex); @@ -987,6 +1001,57 @@ static void unregister_pernet_operations(struct pernet_operations *ops) ida_remove(&net_generic_ids, *ops->id); } +/** + * register_pernet_sys - register a network namespace system + * @ops: pernet operations structure for the system + * + * Register a subsystem which has init and exit functions + * that are called when network namespaces are created and + * destroyed respectively. + * + * When registered all network namespace init functions are + * called for every existing network namespace. Allowing kernel + * modules to have a race free view of the set of network namespaces. + * + * When a new network namespace is created all of the init + * methods are called in the order in which they were registered. + * + * When a network namespace is destroyed all of the exit methods + * are called in the reverse of the order with which they were + * registered. + */ +int register_pernet_sys(struct pernet_operations *ops) +{ + int error; + down_write(&net_sem); + if (first_subsys != first_device) { + panic("Pernet %ps registered out of order.\n" + "There is already %ps.\n", ops, + list_entry(first_subsys, struct pernet_operations, list)); + } + error = register_pernet_operations(first_subsys, ops); + up_write(&net_sem); + return error; +} +EXPORT_SYMBOL_GPL(register_pernet_sys); + +/** + * unregister_pernet_sys - unregister a network namespace system + * @ops: pernet operations structure to manipulate + * + * Remove the pernet operations structure from the list to be + * used when network namespaces are created or destroyed. In + * addition run the exit method for all existing network + * namespaces. + */ +void unregister_pernet_sys(struct pernet_operations *ops) +{ + down_write(&net_sem); + unregister_pernet_operations(ops); + up_write(&net_sem); +} +EXPORT_SYMBOL_GPL(unregister_pernet_sys); + /** * register_pernet_subsys - register a network namespace subsystem * @ops: pernet operations structure for the subsystem @@ -1011,6 +1076,8 @@ int register_pernet_subsys(struct pernet_operations *ops) int error; down_write(&net_sem); error = register_pernet_operations(first_device, ops); + if (!error) + update_first_on_add(first_subsys, first_device, &ops->list); up_write(&net_sem); return error;
[PATCH RFC 03/25] net: Introduce net_sem for protection of pernet_list
Curently mutex is used to protect pernet operations list. It makes cleanup_net() to execute ->exit methods of the same operations set, which was used on the time of ->init, even after net namespace is unlinked from net_namespace_list. But the problem is it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex)<--- Sleep there for ages I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop with preemptible RCU enabled. The solution is to convert net_mutex to the rw_semaphore and add small locks to really small number of pernet_operations, what really need them. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list. This gives signify performance increase, like you may see here: https://www.spinics.net/lists/netdev/msg467095.html It's 4.6 times performance increase on one-thread test. Multi-thread tests increase may be close to 4.6 multiplied to number of threads. This patch starts replacing net_mutex to net_sem. It adds rw_semaphore, describes the variables it protects, and makes to use where appropriate. net_mutex is still present, and next patches will kick it out step-by-step. Signed-off-by: Kirill Tkhai --- include/linux/rtnetlink.h |1 + net/core/net_namespace.c | 37 + net/core/rtnetlink.c |4 ++-- 3 files changed, 28 insertions(+), 14 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 2032ce2eb20b..f640fc87fe1d 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -35,6 +35,7 @@ extern int rtnl_is_locked(void); extern wait_queue_head_t netdev_unregistering_wq; extern struct mutex net_mutex; +extern struct rw_semaphore net_sem; #ifdef CONFIG_PROVE_LOCKING extern bool lockdep_rtnl_is_held(void); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2e512965bf42..2254b1639209 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -41,6 +41,11 @@ struct net init_net = { EXPORT_SYMBOL(init_net); static bool init_net_initialized; +/* + * net_sem: protects: pernet_list, net_generic_ids, + * init_net_initialized and first_* pointers. + */ +DECLARE_RWSEM(net_sem); #define MIN_PERNET_OPS_ID \ ((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *)) @@ -411,12 +416,16 @@ struct net *copy_net_ns(unsigned long flags, net->ucounts = ucounts; get_user_ns(user_ns); - rv = mutex_lock_killable(&net_mutex); + rv = down_read_killable(&net_sem); if (rv < 0) goto put_userns; - + rv = mutex_lock_killable(&net_mutex); + if (rv < 0) + goto up_read; rv = setup_net(net, user_ns); mutex_unlock(&net_mutex); +up_read: + up_read(&net_sem); if (rv < 0) { put_userns: put_user_ns(user_ns); @@ -443,6 +452,7 @@ static void cleanup_net(struct work_struct *work) list_replace_init(&cleanup_list, &net_kill_list); spin_unlock_irq(&cleanup_list_lock); + down_read(&net_sem); mutex_lock(&net_mutex); /* Don't let anyone else find us. */ @@ -484,6 +494,7 @@ static void cleanup_net(struct work_struct *work) ops_free_list(ops, &net_exit_list); mutex_unlock(&net_mutex); + up_read(&net_sem); /* Ensure there are no outstanding rcu callbacks using this * network namespace. @@ -510,8 +521,10 @@ static void cleanup_net(struct work_struct *work) */ void net_ns_barrier(void) { + down_write(&net_sem); mutex_lock(&net_mutex); mutex_unlock(&net_mutex); + up_write(&net_sem); } EXPORT_SYMBOL(net_ns_barrier); @@ -838,12 +851,12 @@ static int __init net_ns_init(void) rcu_assign_pointer(init_net.gen, ng); - mutex_lock(&net_mutex); + down_write(&net_sem); if (setup_net(&init_net, &init_user_ns)) panic("Could not setup the initial network namespace"); init_net_initialized = true; - mutex_unlock(&net_mutex); + up_write(&net_sem); register_pernet_
[PATCH RFC 04/25] net: Move mutex_unlock() in cleanup_net() up
net_sem protects from pernet_list changing, while ops_free_list() makes simple kfree(), and it can't race with other pernet_operations callbacks. So we may release net_mutex earlier then it was. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2254b1639209..a8ea580885d9 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -489,11 +489,12 @@ static void cleanup_net(struct work_struct *work) list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list); + mutex_unlock(&net_mutex); + /* Free the net generic variables */ list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); up_read(&net_sem); /* Ensure there are no outstanding rcu callbacks using this
[PATCH RFC 05/25] net: Add primitives to update heads of pernet_list sublists
Currently we have first_device, and device and subsys sublists. Next patches introduce one more sublist. So, move the functionality, which will be repeating, to the primitives. Signed-off-by: Kirill Tkhai --- net/core/net_namespace.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index a8ea580885d9..1d9712973695 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -939,6 +939,18 @@ static void __unregister_pernet_operations(struct pernet_operations *ops) static DEFINE_IDA(net_generic_ids); +#define update_first_on_add(first, delim, added) \ + do {\ + if (first == delim) \ + first = added; \ + } while (0) + +#define update_first_on_del(first, to_delete) \ + do {\ + if (first == to_delete) \ + first = (to_delete)->next; \ + } while (0) + static int register_pernet_operations(struct list_head *list, struct pernet_operations *ops) { @@ -1045,8 +1057,8 @@ int register_pernet_device(struct pernet_operations *ops) int error; down_write(&net_sem); error = register_pernet_operations(&pernet_list, ops); - if (!error && (first_device == &pernet_list)) - first_device = &ops->list; + if (!error) + update_first_on_add(first_device, &pernet_list, &ops->list); up_write(&net_sem); return error; } @@ -1064,8 +1076,7 @@ EXPORT_SYMBOL_GPL(register_pernet_device); void unregister_pernet_device(struct pernet_operations *ops) { down_write(&net_sem); - if (&ops->list == first_device) - first_device = first_device->next; + update_first_on_del(first_device, &ops->list); unregister_pernet_operations(ops); up_write(&net_sem); }
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 15.11.2017 19:31, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 12:51, Kirill Tkhai wrote: >>> On 15.11.2017 06:19, Eric W. Biederman wrote: Kirill Tkhai writes: > On 14.11.2017 21:39, Cong Wang wrote: >> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai >> wrote: >>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags, >>> >>> get_user_ns(user_ns); >>> >>> - rv = mutex_lock_killable(&net_mutex); >>> + rv = down_read_killable(&net_sem); >>> if (rv < 0) { >>> net_free(net); >>> dec_net_namespaces(ucounts); >>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags, >>> list_add_tail_rcu(&net->list, &net_namespace_list); >>> rtnl_unlock(); >>> } >>> - mutex_unlock(&net_mutex); >>> + up_read(&net_sem); >>> if (rv < 0) { >>> dec_net_namespaces(ucounts); >>> put_user_ns(user_ns); >>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work) >>> list_replace_init(&cleanup_list, &net_kill_list); >>> spin_unlock_irq(&cleanup_list_lock); >>> >>> - mutex_lock(&net_mutex); >>> + down_read(&net_sem); >>> >>> /* Don't let anyone else find us. */ >>> rtnl_lock(); >>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work) >>> list_for_each_entry_reverse(ops, &pernet_list, list) >>> ops_free_list(ops, &net_exit_list); >>> >>> - mutex_unlock(&net_mutex); >>> + up_read(&net_sem); >> >> After your patch setup_net() could run concurrently with cleanup_net(), >> given that ops_exit_list() is called on error path of setup_net() too, >> it means ops->exit() now could run concurrently if it doesn't have its >> own lock. Not sure if this breaks any existing user. > > Yes, there will be possible concurrent ops->init() for a net namespace, > and ops->exit() for another one. I hadn't found pernet operations, which > have a problem with that. If they exist, they are hidden and not clear > seen. > The pernet operations in general do not touch someone else's memory. > If suddenly there is one, KASAN should show it after a while. Certainly the use of hash tables shared between multiple network namespaces would count. I don't rembmer how many of these we have but there used to be quite a few. >>> >>> Could you please provide an example of hash tables, you mean? >> >> Ah, I see, it's dccp_hashinfo etc. JFI, I've checked dccp_hashinfo, and it seems to be safe. > > The big one used to be the route cache. With resizable hash tables > things may be getting better in that regard. I've checked some fib-related things, and wasn't able to find that. Excuse me, could you please clarify, if it's an assumption, or there is exactly a problem hash table, you know? Could you please point it me more exactly, if it's so.
Re: [PATCH] net: bridge: add max_fdb_count
Hi Andrew, On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote: > > Usually it's better to apply LRU or random here in my opinion, as the > > new entry is much more likely to be needed than older ones by definition. > > Hi Willy > > I think this depends on why you need to discard. If it is normal > operation and the limits are simply too low, i would agree. > > If however it is a DoS, throwing away the new entries makes sense, > leaving the old ones which are more likely to be useful. > > Most of the talk in this thread has been about limits for DoS > prevention... Sure but my point is that it can kick in on regular traffic and in this case it can be catastrophic. That's only what bothers me. If we have an unlimited default value with this algorithm I'm fine because nobody will get caught by accident with a bridge suddenly replicating high traffic on all ports because an unknown limit was reached. That's the principle of least surprise. I know that when fighting DoSes there's never any universally good solutions and one has to make tradeoffs. I'm perfectly fine with this. Cheers, Willy
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
Kirill Tkhai writes: > On 15.11.2017 19:31, Eric W. Biederman wrote: >> Kirill Tkhai writes: >> >>> On 15.11.2017 12:51, Kirill Tkhai wrote: On 15.11.2017 06:19, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 14.11.2017 21:39, Cong Wang wrote: >>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai >>> wrote: @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags, get_user_ns(user_ns); - rv = mutex_lock_killable(&net_mutex); + rv = down_read_killable(&net_sem); if (rv < 0) { net_free(net); dec_net_namespaces(ucounts); @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags, list_add_tail_rcu(&net->list, &net_namespace_list); rtnl_unlock(); } - mutex_unlock(&net_mutex); + up_read(&net_sem); if (rv < 0) { dec_net_namespaces(ucounts); put_user_ns(user_ns); @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work) list_replace_init(&cleanup_list, &net_kill_list); spin_unlock_irq(&cleanup_list_lock); - mutex_lock(&net_mutex); + down_read(&net_sem); /* Don't let anyone else find us. */ rtnl_lock(); @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); + up_read(&net_sem); >>> >>> After your patch setup_net() could run concurrently with cleanup_net(), >>> given that ops_exit_list() is called on error path of setup_net() too, >>> it means ops->exit() now could run concurrently if it doesn't have its >>> own lock. Not sure if this breaks any existing user. >> >> Yes, there will be possible concurrent ops->init() for a net namespace, >> and ops->exit() for another one. I hadn't found pernet operations, which >> have a problem with that. If they exist, they are hidden and not clear >> seen. >> The pernet operations in general do not touch someone else's memory. >> If suddenly there is one, KASAN should show it after a while. > > Certainly the use of hash tables shared between multiple network > namespaces would count. I don't rembmer how many of these we have but > there used to be quite a few. Could you please provide an example of hash tables, you mean? >>> >>> Ah, I see, it's dccp_hashinfo etc. > > JFI, I've checked dccp_hashinfo, and it seems to be safe. > >> >> The big one used to be the route cache. With resizable hash tables >> things may be getting better in that regard. > > I've checked some fib-related things, and wasn't able to find that. > Excuse me, could you please clarify, if it's an assumption, or > there is exactly a problem hash table, you know? Could you please > point it me more exactly, if it's so. Two things. 1) Hash tables are one case I know where we access data from multiple network namespaces. As such it can not be asserted that is no possibility for problems. 2) The responsible way to handle this is one patch for each set of methods explaining why those methods are safe to run in parallel. That ensures there is opportunity for review and people are going slowly enough that they actually look at these issues. The reason I want to see this broken up is that at 200ish sets of methods it is too much to review all at once. I completely agree that odds are that this can be made safe and that it is mostly likely already safe in practically every instance.My guess would be that if there are problems that need to be addressed they happen in one or two places and we need to find them. If possible I don't want to find them after the code has shipped in a stable release. Eric
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
Kirill Tkhai writes: > On 15.11.2017 19:29, Eric W. Biederman wrote: >> Kirill Tkhai writes: >> >>> On 15.11.2017 09:25, Eric W. Biederman wrote: Kirill Tkhai writes: > Curently mutex is used to protect pernet operations list. It makes > cleanup_net() to execute ->exit methods of the same operations set, > which was used on the time of ->init, even after net namespace is > unlinked from net_namespace_list. > > But the problem is it's need to synchronize_rcu() after net is removed > from net_namespace_list(): > > Destroy net_ns: > cleanup_net() > mutex_lock(&net_mutex) > list_del_rcu(&net->list) > synchronize_rcu() <--- Sleep there for > ages > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_exit_list(ops, &net_exit_list) > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_free_list(ops, &net_exit_list) > mutex_unlock(&net_mutex) > > This primitive is not fast, especially on the systems with many processors > and/or when preemptible RCU is enabled in config. So, all the time, while > cleanup_net() is waiting for RCU grace period, creation of new net > namespaces > is not possible, the tasks, who makes it, are sleeping on the same mutex: > > Create net_ns: > copy_net_ns() > mutex_lock_killable(&net_mutex)<--- Sleep there for > ages > > The solution is to convert net_mutex to the rw_semaphore. Then, > pernet_operations::init/::exit methods, modifying the net-related data, > will require down_read() locking only, while down_write() will be used > for changing pernet_list. > > This gives signify performance increase, like you may see below. There > is measured sequential net namespace creation in a cycle, in single > thread, without other tasks (single user mode): > > 1)int main(int argc, char *argv[]) > { > unsigned nr; > if (argc < 2) { > fprintf(stderr, "Provide nr iterations arg\n"); > return 1; > } > nr = atoi(argv[1]); > while (nr-- > 0) { > if (unshare(CLONE_NEWNET)) { > perror("Can't unshare"); > return 1; > } > } > return 0; > } > > Origin, 10 unshare(): > 0.03user 23.14system 1:39.85elapsed 23%CPU > > Patched, 10 unshare(): > 0.03user 67.49system 1:08.34elapsed 98%CPU > > 2)for i in {1..1}; do unshare -n bash -c exit; done > > Origin: > real 1m24,190s > user 0m6,225s > sys 0m15,132s > > Patched: > real 0m18,235s (4.6 times faster) > user 0m4,544s > sys 0m13,796s > > This patch requires commit 76f8507f7a64 "locking/rwsem: Add > down_read_killable()" > from Linus tree (not in net-next yet). Using a rwsem to protect the list of operations makes sense. That should allow removing the sing I am not wild about taking a the rwsem down_write in rtnl_link_unregister, and net_ns_barrier. I think that works but it goes from being a mild hack to being a pretty bad hack and something else that can kill the parallelism you are seeking it add. There are about 204 instances of struct pernet_operations. That is a lot of code to have carefully audited to ensure it can in parallel all at once. The existence of the exit_batch method, net_ns_barrier, for_each_net and taking of net_mutex in rtnl_link_unregister all testify to the fact that there are data structures accessed by multiple network namespaces. My preference would be to: - Add the net_sem in addition to net_mutex with down_write only held in register and unregister, and maybe net_ns_barrier and rtnl_link_unregister. - Factor out struct pernet_ops out of struct pernet_operations. With struct pernet_ops not having the exit_batch method. With pernet_ops being embedded an anonymous member of the old struct pernet_operations. - Add [un]register_pernet_{sys,dev} functions that take a struct pernet_ops, that don't take net_mutex. Have them order the pernet_list as: pernet_sys pernet_subsys pernet_device pernet_dev With the chunk in the middle taking the net_mutex. >>> >>> I think this approach will work. Thanks for the suggestion. Some more >>> thoughts to the plan below. >>> >>> The only difficult thing there will be to choose the right order >>> to move ops from pernet_subsys to pernet_sys and from pernet_device >>> to pernet_dev one by one. >>> >>> This is rather easy in case of tristate drivers, as modules may be loaded >>> at any time, and the only important o
RE: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
> On Thu, Nov 16, 2017 at 06:41:24PM -0800, tristram...@microchip.com > wrote: > > From: Tristram Ha > > > > This series of patches is to modify the original KSZ9477 DSA driver so > > that other KSZ switch drivers can be added and use the common code. > > Hi Tristram > > http://vger.kernel.org/~davem/net-next.html > > It is better to send an RFC patchset while netdev is closed and not > send it to David. He will shout at you otherwise. Noted. I really need to monitor the DSA discussion to better contribute to its success. I just found out the DSA API set_addr was removed last month due to not everybody is using it. It cited the Marvell switch was the only switch using that API and found a new way to program the MAC address. But looking at that driver I found it simply uses a randomized MAC address. For big switch with many ports where the main function is forwarding that MAC address may not matter. For small switch with 2 ports it acts more like an Ethernet controller where the switch is mainly used for daisy chaining in a ring network the MAC address can be used in feature like source address filtering.
Re: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
> I really need to monitor the DSA discussion to better contribute to its > success. > I just found out the DSA API set_addr was removed last month due to not > everybody is using it. It cited the Marvell switch was the only switch using > that > API and found a new way to program the MAC address. But looking at that > driver I found it simply uses a randomized MAC address. > > For big switch with many ports where the main function is forwarding that MAC > address may not matter. For small switch with 2 ports it acts more like an > Ethernet > controller where the switch is mainly used for daisy chaining in a ring > network the MAC > address can be used in feature like source address filtering. Hi Tristram The MAC address set by set_addr was only used for pause frames. Nothing else. So a random address is fine. The switch itself should not be sending any other frames. Andrew
Re: [patch net-next RFC v2 00/11] Add support for resource abstraction
On 11/14/17 9:18 AM, Jiri Pirko wrote: > From: Jiri Pirko > > Arkadi says: > > Many of the ASIC's internal resources are limited and are shared between > several hardware procedures. For example, unified hash-based memory can > be used for many lookup purposes, like FDB and LPM. In many cases the user > can provide a partitioning scheme for such a resource in order to perform > fine tuning for his application. In many cases after setting the > partitioning of the resource driver reload is needed. This patchset add > support for hot reset of the driver. > > Such an abstraction can be coupled with devlink's dpipe interface, which > models the ASIC's pipeline as an graph of match/action tables. By modeling > the hardware resource object, and by coupling it to several dpipe tables, > further visibility can be achieved in order to debug ASIC-wide issues. > > The proposed interface will provide the user the ability to understand the > limitations of the hardware, and receive notification regarding its occupancy. > Furthermore, monitoring the resource occupancy can be done in real-time and > can be useful in many cases. > > Userspace part prototype can be found at https://github.com/arkadis/iproute2/ > at resource_dev branch. > now that my firmware problem is fixed, I installed a build with this patch set. Trying to run devlink to split a port hangs: $ devlink port split swp1 count 4 [ 615.373359] INFO: task devlink:804 blocked for more than 120 seconds. [ 615.379934] Tainted: GW 4.14.0+ #38 [ 615.385238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 615.393111] devlink D0 804771 0x0080 [ 615.393115] Call Trace: [ 615.393126] __schedule+0x1de/0x690 [ 615.393130] schedule+0x36/0x80 [ 615.393139] schedule_preempt_disabled+0xe/0x10 [ 615.393146] __mutex_lock.isra.4+0x211/0x530 [ 615.393152] __mutex_lock_slowpath+0x13/0x20 [ 615.393155] ? __mutex_lock_slowpath+0x13/0x20 [ 615.393158] mutex_lock+0x2f/0x40 [ 615.393164] devlink_port_unregister+0x29/0x60 [devlink] [ 615.393169] mlxsw_core_port_fini+0x25/0x50 [mlxsw_core] [ 615.393179] mlxsw_sp_port_remove+0xf0/0x100 [mlxsw_spectrum] [ 615.393186] mlxsw_sp_port_split+0xdc/0x260 [mlxsw_spectrum] [ 615.393193] ? _cond_resched+0x19/0x30 [ 615.393200] mlxsw_devlink_port_split+0x36/0x50 [mlxsw_core] [ 615.393206] devlink_nl_cmd_port_split_doit+0x42/0x50 [devlink] [ 615.393212] genl_family_rcv_msg+0x1c9/0x390 [ 615.393217] genl_rcv_msg+0x4c/0xa0 [ 615.393220] ? _cond_resched+0x19/0x30 [ 615.393228] ? genl_family_rcv_msg+0x390/0x390 [ 615.393232] netlink_rcv_skb+0xec/0x120 [ 615.393235] genl_rcv+0x28/0x40 [ 615.393239] netlink_unicast+0x170/0x230 [ 615.393244] netlink_sendmsg+0x28e/0x370 [ 615.393251] SYSC_sendto+0x10e/0x1b0 [ 615.393258] ? __audit_syscall_entry+0xc1/0x110 [ 615.393261] ? syscall_trace_enter+0x1c6/0x2d0 [ 615.393264] ? __do_page_fault+0x231/0x4b0 [ 615.393268] SyS_sendto+0xe/0x10 [ 615.393272] do_syscall_64+0x60/0x1f0 [ 615.393277] entry_SYSCALL64_slow_path+0x25/0x25 [ 615.393280] RIP: 0033:0x7f4ef43c16f3 [ 615.393284] RSP: 002b:7fffb907fbc8 EFLAGS: 0246 ORIG_RAX: 002c [ 615.393287] RAX: ffda RBX: 013660e0 RCX: 7f4ef43c16f3 [ 615.393290] RDX: 0040 RSI: 01366110 RDI: 0003 [ 615.393291] RBP: R08: 7f4ef4686d80 R09: 000c [ 615.393292] R10: R11: 0246 R12: [ 615.393296] R13: 0004 R14: R15:
Re: [PATCH v10 5/8] ARM: dts: sunxi: Restore EMAC changes (boards)
Hey, Sorry for the bringing this up again. Isn't there a: ethernet0 = &emac; for some boards missing? Best, Philipp (Sorry for sending this to some persons more than once! My Thunderbird sent mails in html and didn't reach the mailing lists. I hope it works now :) ) On 31.10.2017 09:19, Corentin Labbe wrote: The original dwmac-sun8i DT bindings have some issue on how to handle integrated PHY and was reverted in last RC of 4.13. But now we have a solution so we need to get back that was reverted. This patch restore all boards DT about dwmac-sun8i This reverts partially commit fe45174b72ae ("arm: dts: sunxi: Revert EMAC changes") Signed-off-by: Corentin Labbe Acked-by: Florian Fainelli --- arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts | 9 + arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts | 19 +++ arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts | 19 +++ arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts | 7 +++ arch/arm/boot/dts/sun8i-h3-orangepi-2.dts | 8 arch/arm/boot/dts/sun8i-h3-orangepi-one.dts | 8 arch/arm/boot/dts/sun8i-h3-orangepi-pc-plus.dts | 5 + arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts| 8 arch/arm/boot/dts/sun8i-h3-orangepi-plus.dts | 22 ++ arch/arm/boot/dts/sun8i-h3-orangepi-plus2e.dts| 16 10 files changed, 121 insertions(+) diff --git a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts index b1502df7b509..6713d0f2b3f4 100644 --- a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts +++ b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts @@ -56,6 +56,8 @@ aliases { serial0 = &uart0; + /* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */ + ethernet0 = &emac; ethernet1 = &xr819; }; @@ -102,6 +104,13 @@ status = "okay"; }; +&emac { + phy-handle = <&int_mii_phy>; + phy-mode = "mii"; + allwinner,leds-active-low; + status = "okay"; +}; + &mmc0 { pinctrl-names = "default"; pinctrl-0 = <&mmc0_pins_a>; diff --git a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts index e1dba9ffa94b..f2292deaa590 100644 --- a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts +++ b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts @@ -52,6 +52,7 @@ compatible = "sinovoip,bpi-m2-plus", "allwinner,sun8i-h3"; aliases { + ethernet0 = &emac; serial0 = &uart0; serial1 = &uart1; }; @@ -111,6 +112,24 @@ status = "okay"; }; +&emac { + pinctrl-names = "default"; + pinctrl-0 = <&emac_rgmii_pins>; + phy-supply = <®_gmac_3v3>; + phy-handle = <&ext_rgmii_phy>; + phy-mode = "rgmii"; + + allwinner,leds-active-low; + status = "okay"; +}; + +&external_mdio { + ext_rgmii_phy: ethernet-phy@1 { + compatible = "ethernet-phy-ieee802.3-c22"; + reg = <0>; + }; +}; + &ir { pinctrl-names = "default"; pinctrl-0 = <&ir_pins_a>; diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts index 73766d38ee6c..cfb96da3cfef 100644 --- a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts +++ b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts @@ -66,6 +66,25 @@ status = "okay"; }; +&emac { + pinctrl-names = "default"; + pinctrl-0 = <&emac_rgmii_pins>; + phy-supply = <®_gmac_3v3>; + phy-handle = <&ext_rgmii_phy>; + phy-mode = "rgmii"; + + allwinner,leds-active-low; + + status = "okay"; +}; + +&external_mdio { + ext_rgmii_phy: ethernet-phy@1 { + compatible = "ethernet-phy-ieee802.3-c22"; + reg = <7>; + }; +}; + &ir { pinctrl-names = "default"; pinctrl-0 = <&ir_pins_a>; diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts index 8d2cc6e9a03f..78f6c24952dd 100644 --- a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts +++ b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts @@ -46,3 +46,10 @@ model = "FriendlyARM NanoPi NEO"; compatible = "friendlyarm,nanopi-neo", "allwinner,sun8i-h3"; }; + +&emac { + phy-handle = <&int_mii_phy>; + phy-mode = "mii"; + allwinner,leds-active-low; + status = "okay"; +}; diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts index 1bf51802f5aa..b20be95b49d5 100644 --- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts +++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts @@ -54,6 +54,7 @@ aliases { serial0 = &uart0; /* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */ + ethernet0 = &emac; ethernet1 = &rtl81
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 17.11.2017 21:54, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 19:29, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> On 15.11.2017 09:25, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> Curently mutex is used to protect pernet operations list. It makes >> cleanup_net() to execute ->exit methods of the same operations set, >> which was used on the time of ->init, even after net namespace is >> unlinked from net_namespace_list. >> >> But the problem is it's need to synchronize_rcu() after net is removed >> from net_namespace_list(): >> >> Destroy net_ns: >> cleanup_net() >> mutex_lock(&net_mutex) >> list_del_rcu(&net->list) >> synchronize_rcu() <--- Sleep there >> for ages >> list_for_each_entry_reverse(ops, &pernet_list, list) >> ops_exit_list(ops, &net_exit_list) >> list_for_each_entry_reverse(ops, &pernet_list, list) >> ops_free_list(ops, &net_exit_list) >> mutex_unlock(&net_mutex) >> >> This primitive is not fast, especially on the systems with many >> processors >> and/or when preemptible RCU is enabled in config. So, all the time, while >> cleanup_net() is waiting for RCU grace period, creation of new net >> namespaces >> is not possible, the tasks, who makes it, are sleeping on the same mutex: >> >> Create net_ns: >> copy_net_ns() >> mutex_lock_killable(&net_mutex)<--- Sleep there >> for ages >> >> The solution is to convert net_mutex to the rw_semaphore. Then, >> pernet_operations::init/::exit methods, modifying the net-related data, >> will require down_read() locking only, while down_write() will be used >> for changing pernet_list. >> >> This gives signify performance increase, like you may see below. There >> is measured sequential net namespace creation in a cycle, in single >> thread, without other tasks (single user mode): >> >> 1)int main(int argc, char *argv[]) >> { >> unsigned nr; >> if (argc < 2) { >> fprintf(stderr, "Provide nr iterations arg\n"); >> return 1; >> } >> nr = atoi(argv[1]); >> while (nr-- > 0) { >> if (unshare(CLONE_NEWNET)) { >> perror("Can't unshare"); >> return 1; >> } >> } >> return 0; >> } >> >> Origin, 10 unshare(): >> 0.03user 23.14system 1:39.85elapsed 23%CPU >> >> Patched, 10 unshare(): >> 0.03user 67.49system 1:08.34elapsed 98%CPU >> >> 2)for i in {1..1}; do unshare -n bash -c exit; done >> >> Origin: >> real 1m24,190s >> user 0m6,225s >> sys 0m15,132s >> >> Patched: >> real 0m18,235s (4.6 times faster) >> user 0m4,544s >> sys 0m13,796s >> >> This patch requires commit 76f8507f7a64 "locking/rwsem: Add >> down_read_killable()" >> from Linus tree (not in net-next yet). > > Using a rwsem to protect the list of operations makes sense. > > That should allow removing the sing > > I am not wild about taking a the rwsem down_write in > rtnl_link_unregister, and net_ns_barrier. I think that works but it > goes from being a mild hack to being a pretty bad hack and something > else that can kill the parallelism you are seeking it add. > > There are about 204 instances of struct pernet_operations. That is a > lot of code to have carefully audited to ensure it can in parallel all > at once. The existence of the exit_batch method, net_ns_barrier, > for_each_net and taking of net_mutex in rtnl_link_unregister all testify > to the fact that there are data structures accessed by multiple network > namespaces. > > My preference would be to: > > - Add the net_sem in addition to net_mutex with down_write only held in > register and unregister, and maybe net_ns_barrier and > rtnl_link_unregister. > > - Factor out struct pernet_ops out of struct pernet_operations. With > struct pernet_ops not having the exit_batch method. With pernet_ops > being embedded an anonymous member of the old struct pernet_operations. > > - Add [un]register_pernet_{sys,dev} functions that take a struct > pernet_ops, that don't take net_mutex. Have them order the > pernet_list as: > > pernet_sys > pernet_subsys > pernet_device > pernet_dev > > With the chunk in the middle taking the net_mutex. I think this approach will work. Thanks for the suggestion. Some more thoughts to the plan below. The only difficult thing there will be to choose the right order to move ops from pernet_subsys to per
[PATCH] usbnet: ipheth: fix potential null pointer dereference in ipheth_carrier_set
_dev_ is being dereferenced before it is null checked, hence there is a potential null pointer dereference. Fix this by moving the pointer dereference after _dev_ has been null checked. Addresses-Coverity-ID: 1462020 Fixes: bb1b40c7cb86 ("usbnet: ipheth: prevent TX queue timeouts when device not ready") Signed-off-by: Gustavo A. R. Silva --- drivers/net/usb/ipheth.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c index ca71f6c..7275761 100644 --- a/drivers/net/usb/ipheth.c +++ b/drivers/net/usb/ipheth.c @@ -291,12 +291,15 @@ static void ipheth_sndbulk_callback(struct urb *urb) static int ipheth_carrier_set(struct ipheth_device *dev) { - struct usb_device *udev = dev->udev; + struct usb_device *udev; int retval; + if (!dev) return 0; if (!dev->confirmed_pairing) return 0; + + udev = dev->udev; retval = usb_control_msg(udev, usb_rcvctrlpipe(udev, IPHETH_CTRL_ENDP), IPHETH_CMD_CARRIER_CHECK, /* request */ -- 2.7.4
Re: [pull request][net V2 0/5] Mellanox, mlx5 fixes 2017-11-08
On Sat, Nov 11, 2017 at 2:42 AM, David Miller wrote: > From: Saeed Mahameed > Date: Fri, 10 Nov 2017 15:50:15 +0900 > >> The follwoing series includes some fixes for mlx5 core and etherent >> driver. >> >> Sorry for the late submission but as you can see i have some very >> critical fixes below that i would like them merged into this RC. >> >> Please pull and let me know if there is any problem. > > Pulled. > >> For -stable: >> ('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13 >> ('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13 >> ('net/mlx5: Cancel health poll before sending panic teardown command') >> kernels >= 4.13 > > That FPGA change doesn't appear in this pull request. > Sorry about that, I had to drop it as you see in "V1->V2" log, but forgot to remove it from the -stable list.
[PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.
This patch adapts the tc command line interface to allow bandwidth limits to be specified as a percentage of the interface's capacity. Adding this functionality requires passing the specified device string to each class/qdisc which changes the prototype for a couple of functions: the .parse_qopt and .parse_copt interfaces. The device string is a required parameter for tc-qdisc and tc-class, and when not specified, the kernel returns ENODEV. In this patch, if the user tries to specify a bandwidth percentage without naming the device, we return an error from userspace. v2: * Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c, to make it accessible to tc. v3: * Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for use in tc. * Changed couple variable names in int parse_percent_rate(). * Handled showing error message when device speed is unknown. * Updated man page to warn users that when specifying rates in %, tc only uses the current device speed and does not recalculate if it changes after. During cases when properties (like device speed) are unknown, read_prop() assumes that if the property file can be opened but not read, it means that the property is unknown. Signed-off by: Nishanth Devarajan --- include/utils.h | 2 ++ ip/iptuntap.c | 32 --- lib/utils.c | 68 + man/man8/tc.8 | 5 - tc/q_atm.c | 2 +- tc/q_cbq.c | 25 - tc/q_choke.c| 9 ++-- tc/q_clsact.c | 2 +- tc/q_codel.c| 2 +- tc/q_drr.c | 4 ++-- tc/q_dsmark.c | 4 ++-- tc/q_fifo.c | 2 +- tc/q_fq.c | 16 +++--- tc/q_fq_codel.c | 2 +- tc/q_gred.c | 9 ++-- tc/q_hfsc.c | 45 +- tc/q_hhf.c | 2 +- tc/q_htb.c | 18 +++ tc/q_ingress.c | 2 +- tc/q_mqprio.c | 2 +- tc/q_multiq.c | 2 +- tc/q_netem.c| 23 ++- tc/q_pie.c | 2 +- tc/q_prio.c | 2 +- tc/q_qfq.c | 4 ++-- tc/q_red.c | 9 ++-- tc/q_rr.c | 2 +- tc/q_sfb.c | 2 +- tc/q_sfq.c | 2 +- tc/q_tbf.c | 16 +++--- tc/tc.c | 2 +- tc/tc_class.c | 2 +- tc/tc_qdisc.c | 2 +- tc/tc_util.c| 63 tc/tc_util.h| 7 -- 35 files changed, 283 insertions(+), 110 deletions(-) diff --git a/include/utils.h b/include/utils.h index 3d91c50..9377266 100644 --- a/include/utils.h +++ b/include/utils.h @@ -87,6 +87,8 @@ int get_prefix(inet_prefix *dst, char *arg, int family); int mask2bits(__u32 netmask); int get_addr_ila(__u64 *val, const char *arg); +int read_prop(const char *dev, char *prop, long *value); +int parse_percent(double *val, const char *str); int get_hex(char c); int get_integer(int *val, const char *arg, int base); int get_unsigned(unsigned *val, const char *arg, int base); diff --git a/ip/iptuntap.c b/ip/iptuntap.c index b46e452..09f2be2 100644 --- a/ip/iptuntap.c +++ b/ip/iptuntap.c @@ -223,38 +223,6 @@ static int do_del(int argc, char **argv) return tap_del_ioctl(&ifr); } -static int read_prop(char *dev, char *prop, long *value) -{ - char fname[IFNAMSIZ+25], buf[80], *endp; - ssize_t len; - int fd; - long result; - - sprintf(fname, "/sys/class/net/%s/%s", dev, prop); - fd = open(fname, O_RDONLY); - if (fd < 0) { - if (strcmp(prop, "tun_flags")) - fprintf(stderr, "open %s: %s\n", fname, - strerror(errno)); - return -1; - } - len = read(fd, buf, sizeof(buf)-1); - close(fd); - if (len < 0) { - fprintf(stderr, "read %s: %s", fname, strerror(errno)); - return -1; - } - - buf[len] = 0; - result = strtol(buf, &endp, 0); - if (*endp != '\n') { - fprintf(stderr, "Failed to parse %s\n", fname); - return -1; - } - *value = result; - return 0; -} - static void print_flags(long flags) { if (flags & IFF_TUN) diff --git a/lib/utils.c b/lib/utils.c index 4f2fa28..9d5ba2a 100644 --- a/lib/utils.c +++ b/lib/utils.c @@ -39,6 +39,74 @@ int resolve_hosts; int timestamp_short; +int read_prop(const char *dev, char *prop, long *value) +{ + char fname[128], buf[80], *endp, *nl; + FILE *fp; + long result; + int ret; + + ret = snprintf(fname, sizeof(fname), "/sys/class/net/%s/%s", + dev, prop); + + if (ret <= 0 || ret >= sizeof(fname)) { + fprintf(stderr, "could not build pathname for property\n"); + return -1; + } + + fp = fopen(fname, "r"); + if (fp == NULL) { + fprintf(stderr, "fopen %s: %s\n", fname, strerror(errno)); + return -1; + } + + if (!fgets(buf, size
Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.
On Sat, 18 Nov 2017 02:13:38 +0530 Nishanth Devarajan wrote: > + result = strtoul(buf, &endp, 0); > + > + if (*endp || buf == endp) { > + fprintf(stderr, "value \"%s\" in file %s is not a number\n", > + buf, fname); > + goto out; > + } > + > + if (result == ULONG_MAX && errno == ERANGE) { > + fprintf(stderr, "strtoul %s: %s", fname, strerror(errno)); > + goto out; > + } Since speed value of unknown is represented as "-1" I think you need to change this API to take signed value (ie use strtol)
Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit
On 17.11.2017 21:52, Eric W. Biederman wrote: > Kirill Tkhai writes: > >> On 15.11.2017 19:31, Eric W. Biederman wrote: >>> Kirill Tkhai writes: >>> On 15.11.2017 12:51, Kirill Tkhai wrote: > On 15.11.2017 06:19, Eric W. Biederman wrote: >> Kirill Tkhai writes: >> >>> On 14.11.2017 21:39, Cong Wang wrote: On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai wrote: > @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags, > > get_user_ns(user_ns); > > - rv = mutex_lock_killable(&net_mutex); > + rv = down_read_killable(&net_sem); > if (rv < 0) { > net_free(net); > dec_net_namespaces(ucounts); > @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags, > list_add_tail_rcu(&net->list, &net_namespace_list); > rtnl_unlock(); > } > - mutex_unlock(&net_mutex); > + up_read(&net_sem); > if (rv < 0) { > dec_net_namespaces(ucounts); > put_user_ns(user_ns); > @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work) > list_replace_init(&cleanup_list, &net_kill_list); > spin_unlock_irq(&cleanup_list_lock); > > - mutex_lock(&net_mutex); > + down_read(&net_sem); > > /* Don't let anyone else find us. */ > rtnl_lock(); > @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work) > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_free_list(ops, &net_exit_list); > > - mutex_unlock(&net_mutex); > + up_read(&net_sem); After your patch setup_net() could run concurrently with cleanup_net(), given that ops_exit_list() is called on error path of setup_net() too, it means ops->exit() now could run concurrently if it doesn't have its own lock. Not sure if this breaks any existing user. >>> >>> Yes, there will be possible concurrent ops->init() for a net namespace, >>> and ops->exit() for another one. I hadn't found pernet operations, which >>> have a problem with that. If they exist, they are hidden and not clear >>> seen. >>> The pernet operations in general do not touch someone else's memory. >>> If suddenly there is one, KASAN should show it after a while. >> >> Certainly the use of hash tables shared between multiple network >> namespaces would count. I don't rembmer how many of these we have but >> there used to be quite a few. > > Could you please provide an example of hash tables, you mean? Ah, I see, it's dccp_hashinfo etc. >> >> JFI, I've checked dccp_hashinfo, and it seems to be safe. >> >>> >>> The big one used to be the route cache. With resizable hash tables >>> things may be getting better in that regard. >> >> I've checked some fib-related things, and wasn't able to find that. >> Excuse me, could you please clarify, if it's an assumption, or >> there is exactly a problem hash table, you know? Could you please >> point it me more exactly, if it's so. > > Two things. > 1) Hash tables are one case I know where we access data from multiple >network namespaces. As such it can not be asserted that is no >possibility for problems. > > 2) The responsible way to handle this is one patch for each set of >methods explaining why those methods are safe to run in parallel. > >That ensures there is opportunity for review and people are going >slowly enough that they actually look at these issues. > > The reason I want to see this broken up is that at 200ish sets of > methods it is too much to review all at once. Ok, it's possible to split the changes in 400 patches, but there is a problem with three-state (no compile, module, built-in) drivers. Git bisect won't work anyway. Please see the description of the problem in cover message "[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore" I sent today. > I completely agree that odds are that this can be made safe and that it > is mostly likely already safe in practically every instance.My guess > would be that if there are problems that need to be addressed they > happen in one or two places and we need to find them. If possible I > don't want to find them after the code has shipped in a stable release. Kirill
[PATCH iproute2] tc: cleanup qdisc arg parsing
The qdisc arg parsing has magic limit of 16 for class which is not required by kernel. Also the limit of 16 for device name is really IFNAMSIZ. Signed-off-by: Stephen Hemminger --- tc/tc_qdisc.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c index fcb75f29128e..1066ae05a4b5 100644 --- a/tc/tc_qdisc.c +++ b/tc/tc_qdisc.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -49,8 +50,7 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) struct tc_sizespec szopts; __u16 *data; } stab = {}; - char d[16] = {}; - char k[16] = {}; + char d[IFNAMSIZ] = {}; struct { struct nlmsghdr n; struct tcmsgt; @@ -89,8 +89,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) return -1; } req.t.tcm_parent = TC_H_CLSACT; - strncpy(k, "clsact", sizeof(k) - 1); - q = get_qdisc_kind(k); + + q = get_qdisc_kind("clsact"); req.t.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0); NEXT_ARG_FWD(); break; @@ -100,8 +100,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) return -1; } req.t.tcm_parent = TC_H_INGRESS; - strncpy(k, "ingress", sizeof(k) - 1); - q = get_qdisc_kind(k); + + q = get_qdisc_kind("ingress"); req.t.tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0); NEXT_ARG_FWD(); break; @@ -124,26 +124,23 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv) } else if (matches(*argv, "help") == 0) { usage(); } else { - strncpy(k, *argv, sizeof(k)-1); - - q = get_qdisc_kind(k); + q = get_qdisc_kind(*argv); argc--; argv++; break; } argc--; argv++; } - if (k[0]) - addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1); if (est.ewma_log) addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est)); if (q) { + addattr_l(&req.n, sizeof(req), TCA_KIND, q->id, strlen(q->id) + 1); if (q->parse_qopt) { if (q->parse_qopt(q, argc, argv, &req.n)) return 1; } else if (argc) { - fprintf(stderr, "qdisc '%s' does not support option parsing\n", k); + fprintf(stderr, "qdisc '%s' does not support option parsing\n", q->id); return -1; } } else { -- 2.11.0
Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.
On Sat, 18 Nov 2017 02:13:38 +0530 Nishanth Devarajan wrote: > diff --git a/tc/tc_util.h b/tc/tc_util.h > index 583a21a..7b7420a 100644 > --- a/tc/tc_util.h > +++ b/tc/tc_util.h > @@ -24,14 +24,14 @@ struct qdisc_util { > struct qdisc_util *next; > const char *id; > int (*parse_qopt)(struct qdisc_util *qu, int argc, > - char **argv, struct nlmsghdr *n); > + char **argv, struct nlmsghdr *n, char *dev); One more nit... Since parsing queue options should not modify the device name, that should be const char *.
Greetings From Mrs. Sarah Smith
Greetings From Mrs. Sarah Smith, With Due Respect and Humanity, I was compelled to write to you under a humanitarian ground. My names are Mrs.Sarah Smith , am 52 years old From Switzerland; I am married to Late Mr. Hazard Smith; but we Living Benin Republic, We were married for 25 years without a child. He died after a Cardiac Arteries Operation. And Recently, My Doctor told me that I would not last for the next 3 months due to my cancer problem (Breast cancer). Before my husband died last year there is this sum $2,800,000.00 United State Dollars that he deposited with a bank in Benin and presently the fund is still with the Bank. Having known my condition I decided to donate this fund to individual that will utilize this fund the way I am going to instruct herein. I want somebody that will use this fund according to the desire of my late husband to help less privileged people, orphanages, widows. I took this decision because I don't have any child that will inherit this fund, and I don't want in a way were this fund will be used in wrong way. If you wish to help me actualize this vision, please indicate your readiness immediately you received this proposal. Remain blessed you as you listing to the voice of reasoning. Your beloved sister Mrs. Sarah Smith,
Re: [LTP] [RFC] [PATCH] netns: Fix race in virtual interface bringup
Alexey, Li, thank you for your suggestions. On Fri, Nov 17, 2017 at 03:08:20PM +0300, Alexey Kodanev wrote: > On 11/17/2017 09:09 AM, Li Wang wrote: > > Hi Dan, > > > > On Fri, Nov 10, 2017 at 4:38 AM, Dan Rue wrote: > >> Symptoms (+ command, error): > >> netns_comm_ip_ipv6_ioctl: > >> + ip netns exec tst_net_ns1 ping6 -q -c2 -I veth1 fd00::2 > >> connect: Cannot assign requested address > >> > >> netns_comm_ip_ipv6_netlink: > >> + ip netns exec tst_net_ns0 ping6 -q -c2 -I veth0 fd00::3 > >> connect: Cannot assign requested address > >> > >> netns_comm_ns_exec_ipv6_ioctl: > >> + ns_exec 6689 net ping6 -q -c2 -I veth0 fd00::3 > >> connect: Cannot assign requested address > >> > >> netns_comm_ns_exec_ipv6_netlin: > >> + ns_exec 6891 net ping6 -q -c2 -I veth0 fd00::3 > >> connect: Cannot assign requested address > >> > >> The error is coming from ping6, which is trying to get an IP address for > >> veth0 (due to -I veth0), but cannot. Waiting for two seconds fixes the > >> test in my testcases. 1 second is not long enough. > >> > >> dmesg shows the following during the test: > >> > >> [Nov 7 15:39] LTP: starting netns_comm_ip_ipv6_ioctl (netns_comm.sh ip > >> ipv6 ioctl) > >> [ +0.302401] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready > >> [ +0.048059] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready > > It's quite strange that veth interface needs 2 seconds to become > operational and it is up in less than 0.3s according to dmesg, but > you said that it's not enough even 1 sec... Are you sure that IPv6 > address not in tentative state and dad process actually disabled? > I'm asking because you don't have it disabled in the script: > https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3 Investigating further, the dmesg output is reporting on the status of the link between veth0 and veth1, not the veth0 interface itself. That is, the first dmesg message comes from "ip netns exec tst_net_ns0 ifconfig veth0 up" and the second comes from "ip netns exec tst_net_ns1 ifconfig veth1 up". This explains why we see .3s in dmesg but a 2 second sleep being required. There is not actually anything in dmesg that is helpful here. Regarding dad (duplicate address detection), we have seen similar issues on low power ARM64 boards and IPv4. Anyway, I tried disabling dad on the interface and it did not make a difference. > > >> > >> Signed-off-by: Dan Rue > >> --- > >> > >> We've periodically hit this problem across many arm64 kernels and boards, > >> and > >> it seems to be caused by "ping6" running before the virtual interface is > >> actually ready. "sleep 2" works around the issue and proves that it is a > >> race > >> condition, but I would prefer something faster and deterministic. Please > >> suggest a better implementation. > > Just FYI: > > > > I'm not good at network things, but one method I copied from ltp/numa > > test is to split the '2s' into many smaller pieces of time. > > > > which something like: > > > > --- a/testcases/kernel/containers/netns/netns_helper.sh > > +++ b/testcases/kernel/containers/netns/netns_helper.sh > > @@ -240,6 +240,22 @@ netns_ip_setup() > > tst_brkm TBROK "unable to add device veth1 to the > > separate network namespace" > > } > > > > +wait_for_set_ip() > > +{ > > + local dev=$1 > > + local retries=200 > > + > > + while [ $retries -gt 0 ]; do > > + dmesg -c | grep -q "IPv6: ADDRCONF(NETDEV_CHANGE): > > $dev: link becomes ready" > > > What about "grep -q up /sys/class/net/$dev/operstate && break"? Since dmesg will not help, I explored /sys as proposed. operstate shows "up", and ping6 still fails. carrier shows "1" (up), and ping6 still fails. dormant shows "0" (interface is not dormant), and ping6 still fails. flags shows "0x1003" before and after a 2s sleep (they don't change) So it seems there is nothing in dmesg, or /sys that can help here. Dan > > Thanks, > Alexey > > > > + if [ $? -eq 0 ]; then > > + break > > + fi > > + > > + retries=$((retries-1)) > > + tst_sleep 10ms > > + done > > +} > > + > > ## > > # Enables virtual ethernet devices and assigns IP addresses for both > > # of them (IPv4/IPv6 variant is decided by netns_setup() function). > > @@ -285,6 +301,9 @@ netns_set_ip() > > tst_brkm TBROK "enabling veth1 device failed" > > ;; > > esac > > + > > + wait_for_set_ip veth0 > > + wait_for_set_ip veth1 > > } > > > > netns_ns_exec_cleanup() > > > >> Also, is it correct that "ifconfig veth0 up" returns before the interface > >> is > >> actually ready? > >> > >> See also this isolated test script: > >> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3 > >> > >> testcases/kernel/containers/netns/netns_helper.sh | 1 + > >> 1 file changed, 1
[PATCH net] net: accept UFO datagrams from tuntap and packet
From: Willem de Bruijn Tuntap and similar devices can inject GSO packets. Accept type VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively. Processes are expected to use feature negotiation such as TUNSETOFFLOAD to detect supported offload types and refrain from injecting other packets. This process breaks down with live migration: guest kernels do not renegotiate flags, so destination hosts need to expose all features that the source host does. Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677. This patch introduces nearly(*) no new code to simplify verification. It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP insertion and software UFO segmentation. It does not reinstate protocol stack support, hardware offload (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception of VIRTIO_NET_HDR_GSO_UDP packets in tuntap. To support SKB_GSO_UDP reappearing in the stack, also reinstate logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD by squashing in commit 939912216fa8 ("net: skb_needs_check() removes CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1 ("net: avoid skb_warn_bad_offload false positives on UFO"). (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id, ipv6_proxy_select_ident is changed to return a __be32, which is assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted at the end of the enum to minimize code churn. Link: http://lkml.kernel.org/r/ Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.") Reported-by: Michal Kubecek Signed-off-by: Willem de Bruijn --- drivers/net/tap.c | 2 +- drivers/net/tun.c | 4 ++ include/linux/netdev_features.h | 4 +- include/linux/netdevice.h | 1 + include/linux/skbuff.h | 2 + include/linux/virtio_net.h | 5 ++- include/net/ipv6.h | 1 + net/core/dev.c | 3 +- net/ipv4/af_inet.c | 12 +- net/ipv4/udp_offload.c | 49 ++-- net/ipv6/output_core.c | 31 +++ net/ipv6/udp_offload.c | 85 +++-- net/openvswitch/datapath.c | 14 +++ net/openvswitch/flow.c | 6 ++- net/sched/act_csum.c| 6 +++ 15 files changed, 211 insertions(+), 14 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index b13890953ebb..e9489b88407c 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1077,7 +1077,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd, case TUNSETOFFLOAD: /* let the user check for future flags */ if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 | - TUN_F_TSO_ECN)) + TUN_F_TSO_ECN | TUN_F_UFO)) return -EINVAL; rtnl_lock(); diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 6bb1e604aadd..a33385d8ac65 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2369,6 +2369,10 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) features |= NETIF_F_TSO6; arg &= ~(TUN_F_TSO4|TUN_F_TSO6); } + + if (arg & TUN_F_UFO) { + arg &= ~TUN_F_UFO; + } } /* This gives the user a way to test for new features in future by diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index dc8b4896b77b..b1b0ca7ccb2b 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -54,8 +54,9 @@ enum { NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */ NETIF_F_GSO_SCTP_BIT, /* ... SCTP fragmentation */ NETIF_F_GSO_ESP_BIT,/* ... ESP with TSO */ + NETIF_F_GSO_UDP_BIT,/* ... UFO, deprecated except tuntap */ /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */ - NETIF_F_GSO_ESP_BIT, + NETIF_F_GSO_UDP_BIT, NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */ NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */ @@ -132,6 +133,7 @@ enum { #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM) #define NETIF_F_GSO_SCTP __NETIF_F(GSO_SCTP) #define NETIF_F_GSO_ESP__NETIF_F(GSO_ESP) +#define NETIF_F_GSO_UDP__NETIF_F(GSO_UDP) #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER) #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX) #define NETIF_F_HW_VLAN_STAG_TX__NETIF_F(HW_VLAN_STAG_TX) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 6b274bfe489f..ef789e1d679e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -4140,6 +4140,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_ty
Re: regression: UFO removal breaks kvm live migration
On Fri, Nov 17, 2017 at 9:48 AM, Willem de Bruijn wrote: >>> Okay, I will send a patch to reinstate UFO for this use case (only). There >>> is some related work in tap_handle_frame and packet_direct_xmit to >>> segment directly in the device. I will be traveling the next few days, so >>> it won't be in time for 4.14 (but can go in stable later, of course). >> >> I'm finishing up and running some tests. The majority of the patch is a >> straightforward partial revert of the patchset, so while fairly large for a >> patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all >> thoroughly tested code. Notably absent are the protocol layer and >> hardware support (NETIF_F_UFO) portions. >> >> The only open issue is whether to rely on existing skb_gso_segment >> processing in the transmit path from validate_xmit_skb or to add new >> skb_gso_segment calls directly to tun_get_user, tap_get_user and >> pf_packet. Tun has to loop around four different ways of injecting >> packets into the device. Something like the below snippet. >> >> More conservative is to introduce no completely new code and rely on >> validate_xmit_skb, but that means having to protect the entire stack >> against skbs with SKB_GSO_UDP, so also bringing back some >> checksum and fragment handling snippets in gre_gso_segment, >> __skb_udp_tunnel_segment, act_csum and openvswitch. > > Come to think of it, as this patch does not bring back NETIF_F_UFO > support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be > excluded. > > Then this is probably the simpler and more obviously correct approach. Sent: http://patchwork.ozlabs.org/patch/839168/
[iproute2 PATCH] man: tc-mqprio: add documentation for new offload options
This patch adds documentation for additional offload modes and associated parameters in tc-mqprio. Signed-off-by: Amritha Nambiar --- man/man8/tc-mqprio.8 | 60 +- 1 file changed, 49 insertions(+), 11 deletions(-) diff --git a/man/man8/tc-mqprio.8 b/man/man8/tc-mqprio.8 index 0e1d305..a1bedd3 100644 --- a/man/man8/tc-mqprio.8 +++ b/man/man8/tc-mqprio.8 @@ -16,7 +16,17 @@ P0 P1 P2... count1@offset1 count2@offset2 ... .B ] [ hw 1|0 -.B ] +.B ] [ mode +dcb|channel] +.B ] [ shaper +dcb| +.B [ bw_rlimit +.B min_rate +min_rate1 min_rate2 ... +.B max_rate +max_rate1 max_rate2 ... +.B ]] + .SH DESCRIPTION The MQPRIO qdisc is a simple queuing discipline that allows mapping @@ -36,14 +46,16 @@ and By default these parameters are configured by the hardware driver to match the hardware QOS structures. -Enabled hardware can provide hardware QOS with the ability to steer -traffic flows to designated traffic classes provided by this qdisc. -Configuring the hardware based QOS mechanism is outside the scope of -this qdisc. Tools such as -.B lldpad -and -.B ethtool -exist to provide this functionality. Also further qdiscs may be added +.B Channel +mode supports full offload of the mqprio options, the traffic classes, the queue +configurations and QOS attributes to the hardware. Enabled hardware can provide +hardware QOS with the ability to steer traffic flows to designated traffic +classes provided by this qdisc. Hardware based QOS is configured using the +.B shaper +parameter. +.B bw_rlimit +with minimum and maximum bandwidth rates can be used for setting +transmission rates on each traffic class. Also further qdiscs may be added to the classes of MQPRIO to create more complex configurations. .SH ALGORITHM @@ -104,9 +116,35 @@ contiguous range of queues. hw Set to .B 1 -to use hardware QOS defaults. Set to +to support hardware offload. Set to .B 0 -to override hardware defaults with user specified values. +to configure user specified values in software only. + +.TP +mode +Set to +.B channel +for full use of the mqprio options. Use +.B dcb +to offload only TC values and use hardware QOS defaults. Supported with 'hw' +set to 1 only. + +.TP +shaper +Use +.B bw_rlimit +to set bandwidth rate limits for a traffic class. Use +.B dcb +for hardware QOS defaults. Supported with 'hw' set to 1 only. + +.TP +min_rate +Minimum value of bandwidth rate limit for a traffic class. + +.TP +max_rate +Maximum value of bandwidth rate limit for a traffic class. + .SH AUTHORS John Fastabend,
[iproute2 PATCH] man: tc-flower: add explanation for hw_tc option
Add details explaining the hw_tc option. Signed-off-by: Amritha Nambiar --- man/man8/tc-flower.8 |9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8 index be46f02..fd9098e 100644 --- a/man/man8/tc-flower.8 +++ b/man/man8/tc-flower.8 @@ -10,7 +10,10 @@ flower \- flow based traffic control filter .B action .IR ACTION_SPEC " ] [ " .B classid -.IR CLASSID " ]" +.IR CLASSID " ] [ " +.B hw_tc +.IR TCID " ]" + .ti -8 .IR MATCH_LIST " := [ " MATCH_LIST " ] " MATCH @@ -77,6 +80,10 @@ is in the form .BR X : Y ", while " X " and " Y are interpreted as numbers in hexadecimal format. .TP +.BI hw_tc " TCID" +Specify a hardware traffic class to pass matching packets on to. TCID is in the +range 0 through 15. +.TP .BI indev " ifname" Match on incoming interface name. Obviously this makes sense only for forwarded flows.
[PATCH 1/8] mm: kmemleak: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by kmemleak at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Michal Hocko Cc: Andrew Morton Cc: Matthew Wilcox --- mm/kmemleak.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 7780cd8..25b977f 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include -- 1.8.3.1
[PATCH 6/8] net: caif: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by caif at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Dmitry Tarnyagin Cc: "David S. Miller" --- net/caif/cfpkt_skbuff.c | 1 - net/caif/chnl_net.c | 1 - 2 files changed, 2 deletions(-) diff --git a/net/caif/cfpkt_skbuff.c b/net/caif/cfpkt_skbuff.c index 71b6ab2..38c2b7a 100644 --- a/net/caif/cfpkt_skbuff.c +++ b/net/caif/cfpkt_skbuff.c @@ -8,7 +8,6 @@ #include #include -#include #include #include diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c index 922ac1d..53ecda1 100644 --- a/net/caif/chnl_net.c +++ b/net/caif/chnl_net.c @@ -8,7 +8,6 @@ #define pr_fmt(fmt) KBUILD_MODNAME ":%s(): " fmt, __func__ #include -#include #include #include #include -- 1.8.3.1
[PATCH 8/8] net: tipc: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by TIPC at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Jon Maloy Cc: Ying Xue Cc: "David S. Miller" --- net/tipc/core.h | 1 - 1 file changed, 1 deletion(-) diff --git a/net/tipc/core.h b/net/tipc/core.h index 5cc5398..099e072 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -49,7 +49,6 @@ #include #include #include -#include #include #include #include -- 1.8.3.1
[PATCH 4/8] vfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by vfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Alexander Viro --- fs/dcache.c | 1 - fs/file_table.c | 1 - 2 files changed, 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index f901413..9340e8c 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -32,7 +32,6 @@ #include #include #include -#include #include #include #include diff --git a/fs/file_table.c b/fs/file_table.c index 61517f5..dab099e 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -23,7 +23,6 @@ #include #include #include -#include #include #include #include -- 1.8.3.1
[PATCH 3/8] fs: btrfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by btrfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org --- fs/btrfs/extent_map.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 2e348fb..cced7f1 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -2,7 +2,6 @@ #include #include #include -#include #include "ctree.h" #include "extent_map.h" #include "compression.h" -- 1.8.3.1
[PATCH 7/8] net: ovs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by openvswitch at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Pravin Shelar Cc: "David S. Miller" Cc: d...@openvswitch.org --- net/openvswitch/vport-internal_dev.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 04a3128..2f47c65 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -16,7 +16,6 @@ * 02110-1301, USA */ -#include #include #include #include -- 1.8.3.1
[PATCH 2/8] fs: pstore: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by pstore at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Kees Cook Cc: Anton Vorontsov Cc: Colin Cross Cc: Tony Luck --- fs/pstore/platform.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c index 2b21d18..25dcef4 100644 --- a/fs/pstore/platform.c +++ b/fs/pstore/platform.c @@ -41,7 +41,6 @@ #include #include #include -#include #include #include -- 1.8.3.1
[PATCH 5/8] crypto: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by crypto at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi Cc: Herbert Xu Cc: "David S. Miller" Cc: linux-cry...@vger.kernel.org --- crypto/ablk_helper.c | 1 - crypto/blkcipher.c | 1 - crypto/mcryptd.c | 1 - 3 files changed, 3 deletions(-) diff --git a/crypto/ablk_helper.c b/crypto/ablk_helper.c index 1441f07..ee52660 100644 --- a/crypto/ablk_helper.c +++ b/crypto/ablk_helper.c @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c index 6c43a0a..01c0d4a 100644 --- a/crypto/blkcipher.c +++ b/crypto/blkcipher.c @@ -18,7 +18,6 @@ #include #include #include -#include #include #include #include diff --git a/crypto/mcryptd.c b/crypto/mcryptd.c index 4e64726..9fa362c 100644 --- a/crypto/mcryptd.c +++ b/crypto/mcryptd.c @@ -26,7 +26,6 @@ #include #include #include -#include #define MCRYPTD_MAX_CPU_QLEN 100 #define MCRYPTD_BATCH 9 -- 1.8.3.1
Re: [ftrace-bpf 1/5] add BPF_PROG_TYPE_FTRACE to bpf
On Mon, Nov 13, 2017 at 12:06:17AM -0800, peng yu wrote: > > 1. anything bpf related has to go via net-next tree. > I found there is a net-next git repo: > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git > I will use this repo for the further bpf-ftrace patch set. > > > 2. > > this obviously breaks ABI. New types can only be added to the end. > Sure, I will add the new type at the end. > > > 3. > > this won't even compile, since ftrace_regs is only added in the patch 4. > It could compile, as the ftrace_regs related code is inside the > "#ifdef FTRACE_BPF_FILTER" macro, if this macro is not defined, no > ftrace_regs related code would be compiled. > > > > Since bpf program will see ftrace_regs as an input it becomes > > abi, so has to be defined in uapi/linux/bpf_ftrace.h or similar. > > We need to think through how to make it generic across archs > > instead of defining ftrace_regs for each arch. > I'm not sure whether I'm fully understand your meaning. Like kprobe, > the ftrace-bpf need to get a function's parameters and check them. So > it won't be abi stable, and it should depend on architecture > implement. I can create a header file like uapi/linux/bpf_ftrace.h, > but I noticed that kprobe doesn't have such a header file, if I'm > wrong, please let me know. And about make it generic across archs, I > know kprobe use pt_regs as parameter, the pt_regs is defined on each > arch, so I can't see how bpf-ftrace can get a generic interface across > archs if it need to check function's parameters. If I misunderstand > anything, please let me know. all of ftrace are called at function entry and calling convention is fixed per architecture, so we can make a generic and stable struct bpf_ftrace_args { __u64 arg1, arg2, .. arg5; }; save_mcount_regs doesn't care what order the regs are stored so the same stack space can be used to keep bpf_ftrace_args and used in restore_mcount_regs. I'd also make it depend on DYNAMIC_FTRACE_WITH_REGS to avoid dealing with obscure corner cases. > > > 4. > > the patch 2/3 takes an approach of passing FD integer value in text form > > to the kernel. That approach was discussed years ago and rejected. > > It has to use binary interface like perf_event + ioctl. > > See RFC patches where we're extending perf_event_open syscall to > > support binary access to kprobe/uprobe. > > imo binary interface to ftrace is pre-requisite to ftrace+bpf work. > > We've had too many issues with text based kprobe api to repeat > > the same mistake here. > I notice the kprobe-bpf prog is set through the PERF_EVENT_IOC_SET_BPF > ioctl, I may try to see whether I can reuse this interface, or if it > is not suitable, I will try to define a new binary interface. > > > 5. > > patch 4 hacks save_mcount_regs asm to pass ctx pointer in %rcx > > whereas it's only used in ftrace_graph_caller which doesn't seem right. > > It points out to another issue that such ftrace+bpf integration > > is only done for ftrace_graph_caller without extensibility in mind. > > If we do ftrace+bpf I'd rather see generic framework that applies > > to all of ftrace instead of single feature of it. > It is a hard problem. The ftrace framework has lots of tracers, > function tracer and function graph tracer use the 'gcc -pg' directly, > other tracers use tracepoint, I should spend more time to find a > suitable solution. since all of ftrace goes through the same function entry point it should be possible to have one generic bpf filter interface suitable for all tracers that ftrace supports.