Re: [PATCH] net: bridge: add max_fdb_count

2017-11-17 Thread Nikolay Aleksandrov
On 17/11/17 08:14, Nikolay Aleksandrov wrote:
> On 17/11/17 07:26, Willy Tarreau wrote:
>> Hi Stephen,
>>
>> On Thu, Nov 16, 2017 at 04:27:18PM -0800, Stephen Hemminger wrote:
>>> On Thu, 16 Nov 2017 21:21:55 +0100
>>> Vincent Bernat  wrote:
>>>
  ? 16 novembre 2017 20:23 +0100, Andrew Lunn  :

> struct net_bridge_fdb_entry is 40 bytes.
>
> My WiFi access point which is also a 5 port bridge, currently has 97MB
> free RAM. That is space for about 2.5M FDB entries. So even Roopa's
> 128K is not really a problem, in terms of memory.  

 I am also interested in Sarah's patch because we can now have bridge
 with many ports through VXLAN. The FDB can be replicated to an external
 daemon with BGP and the cost of each additional MAC address is therefore
 higher than just a few bytes. It seems simpler to implement a limiting
 policy early (at the port or bridge level).

 Also, this is a pretty standard limit to have for a bridge (switchport
 port-security maximum on Cisco, set interface X mac-limit on
 Juniper). And it's not something easy to do with ebtables.
>>>
>>> I want an optional limit per port, it makes a lot of sense.
>>> If for no other reason that huge hash tables are a performance problems.
>>
>> Except its not a limit in that it doesn't prevent new traffic from going
>> in, it only prevents new MACs from being learned, so suddenly you start
>> flooding all ports with new traffic once the limit is reached, which is
>> not trivial to detect nor diagnose.
>>
>>> There is a bigger question about which fdb to evict but just dropping the
>>> new one seems to be easiest and as good as any other solution.
>>
>> Usually it's better to apply LRU or random here in my opinion, as the
>> new entry is much more likely to be needed than older ones by definition.
>> In terms of CPU usage it may even be better to kill an entire series in
>> the hash table (eg: all nodes in the same table entry for example), as
>> the operation can be almost as cheap and result in not being needed for
>> a while again.
>>
>> Willy
>>
> 
> Hi,
> I have been thinking about this and how to try and keep everyone happy
> while maintaining performance, so how about this:
> 
>  * Add a per-port fdb LRU list which is used only when there are >= 2000
>global fdb entries _or_ a per-port limit is set. If the list is in use,
>update the fdb list position once per second. I think these properties will
>allow us to scale and have a better LRU locking granularity (per-port), 
> and in
>smaller setups (not needing LRU) the cost will be a single test in fast 
> path.
> 

It seems I spoke too soon, I just tried a quick and dirty version of this and 
the
per-port LRU becomes a nightmare with the completely lockless fdb dst port 
changes.
So I think going back to the original proposition with new fdb dropping is best
for now, otherwise we need to add some synchronization on fdb port move. In 
general
the per-port locks would be tricky to handle with moving fdbs, so even if we add
a per-port LRU list we'll need some global lock to handle moves and updates in a
simple manner, and I'd like to avoid adding a new bridge lock.

>  * Use the above LRU list for per-port limit evictions
> >  * More importantly use the LRU list for fdb expire, removing the need to 
> > walk
>over all fdbs every time we expire entries. This would be of great help for
>larger setups with many fdbs (it will kick in on >= 2000 fdb entries).
> 

Will have to drop the above point for now, even though I was mostly interested
in that.

>  * (optional) Make the global LRU kick in limit an option, people might want 
> to
>minimize traffic blocking due to expire process.
> 
> Any comments and suggestions are welcome. When we agree on the details I'll do
> the RFC patches and run some tests before submitting. Defaults will be kept as
> they are now. I've chosen the 2000 limit arbitrarily and am happy to change it
> if people have something else in mind. This should play nice with the 
> resizeable
> hashtable change.
> 
> Thanks,
>  Nik
> 
> 
> 
> 
> 



Re: [PATCH] macvlan: verify MTU before lowerdev xmit

2017-11-17 Thread Daniel Axtens
Hi Dave,

> This is an area where we really haven't set down some clear rules
> for behavior.
>
> If an interface has a particular MTU, it must be able to successfully
> send MTU sized packets on that link be it virtual or physical.
>
> Only a "next hop" can have a different MTU and thus drop packets.
> This requirement is absolutely necessary in order for proper
> signalling (path MTU messages) to make their way back to the sending
> host.
>
> In this VM-->macvlan case it's more like a point to point connection
> and there lacks a "next hop" to serve and the provider of proper
> signalling.
>
> This whole situation seems to be handled quite poorly in virtualized
> setups.  Allowing one end of the virtual networking "link" into the
> guest have a different MTU from the other end is a HUGE mistake.

I agree, but users do it, so I'm just trying to figure out the best way
to handle it. Currently the bridge code and openvswitch will detect when
a large packet arrives and drop the packet* - the bridge code with
is_skb_forwardable() and openvswitch with it's own approach in
vport.c. This patch tries to bring macvlan in line with those.

*except for GSO packets - they are assumed to be ok, which isn't always
 true. I am working on some patches for this - but my approach may need
 to be changed in light of what you're saying.

> There needs to be control path signalling between the guest and the
> provider of the virtual link so that they can synchronize their MTU
> settings.

We have these sorts of problems on probably every type of virtual
interface, some of which are easier to deal with than others. I'm
particularly worried about interfaces like ibmveth where the 'other end'
is usually an AIX partition on a big powerpc system. AIX tends to bring
up these interfaces with MTUs of around 64k as well. (This is what
originially got me down the rabbit hole of caring about mismatched-MTU
handling!)

> Yes this is hard, but what is happening now doesn't fly in the long
> term.

I'm at a bit of a loss about how we would do a proper fix. The only
thing that comes to mind would be for the receive routines of the
virtual network interfaces to check the size of incoming packets, but -
if I understand correctly - we would need to know what the maximum size
for a recieved packet should be.

Regards,
Daniel


Re: [PATCH] macvlan: verify MTU before lowerdev xmit

2017-11-17 Thread David Miller
From: Daniel Axtens 
Date: Fri, 17 Nov 2017 19:34:27 +1100

> Hi Dave,
> 
>> This is an area where we really haven't set down some clear rules
>> for behavior.
>>
>> If an interface has a particular MTU, it must be able to successfully
>> send MTU sized packets on that link be it virtual or physical.
>>
>> Only a "next hop" can have a different MTU and thus drop packets.
>> This requirement is absolutely necessary in order for proper
>> signalling (path MTU messages) to make their way back to the sending
>> host.
>>
>> In this VM-->macvlan case it's more like a point to point connection
>> and there lacks a "next hop" to serve and the provider of proper
>> signalling.
>>
>> This whole situation seems to be handled quite poorly in virtualized
>> setups.  Allowing one end of the virtual networking "link" into the
>> guest have a different MTU from the other end is a HUGE mistake.
> 
> I agree, but users do it, so I'm just trying to figure out the best way
> to handle it. Currently the bridge code and openvswitch will detect when
> a large packet arrives and drop the packet* - the bridge code with
> is_skb_forwardable() and openvswitch with it's own approach in
> vport.c. This patch tries to bring macvlan in line with those.
> 
> *except for GSO packets - they are assumed to be ok, which isn't always
>  true. I am working on some patches for this - but my approach may need
>  to be changed in light of what you're saying.

So how exactly do the oversized packets get to the macvlan device from
the VM in this scenerio?  That detail seems to be missing from the
diagrams you provided earlier.  The VM and the macvlan boxes are just
connected with a line.

Thank you.


Re: [PATCH] sfp: Add support for DWDM SFP modules

2017-11-17 Thread Russell King - ARM Linux
On Fri, Nov 17, 2017 at 02:58:05PM +0900, David Miller wrote:
> From: Jan Kundrát 
> Date: Wed, 15 Nov 2017 12:39:33 +0100
> 
> > Without this patch, but with CONFIG_SFP enabled, my NIC won't detect
> > module unplugging, which is suboptimal.
> > 
> > I'm using an OEM "Cisco compatible" DWDM fixed-frequency 100Ghz-grid SFP
> > module. It reports itself as a 0x0b 0x24. According to SFF-8024, byte 0
> > value "0Bh" refers to a "DWDM-SFP/SFP+ (not using SFF-8472)". In
> > practice, there's a lot of shared properties here.
> > 
> > Everything is apparently defined in a document called "DWDM SFP MSA
> > (Multi-source Agreement), Revision 1.0, 19th September 2005". I don't
> > have access to that ocument (yet). Its likely source, the
> > http://www.dwdmsfpmsa.org/ has been down for years.
> > 
> > From the datasheets that I was able to find on random vendors' web, the
> > second byte can vary -- 0x27 is used, too.
> > 
> > Tested on Clearfog Base with v4.14 and Russell King's SFP patches.
> > 
> > Signed-off-by: Jan Kundrát 
> 
> Russell, Florian, Amdrew, can I get a review?

Hi David,

I've been discussing the topic with Jan off-list.

There's nothing wrong per-se with the patch in that it will allow a
DWDM module to be functional, but (eg) the data read by ethtool -m
won't be correct.  The DWDM module EEPROM format is not compatible
with any of the ETH_MODULE_SFF_ formats.  I also need to validate
that the fields we do read from the EEPROM contents in the sfp code
are meaningful for DWDM.

As Jan says, the DWDM MSA isn't available - from what I can tell,
having looked on archive.org and extensive google searching, the MSA
was never made public and dwdmsfpmsa.org has been down for years.
When the site was operational, it looked like you had to be a member
to have access - archive.org has all the legal agreements for that,
but no specification.

However, the module datasheets give some information about the EEPROM
format, which has been useful.

Also, this patch will conflict with Florian's SFF module patch (a
soldered down version of SFP modules).

I already have a stack of patches for phy, phylink and sfp that I
need to send, including documentation patches which Florian has
already found very useful and helpful.  I had assumed that net-next
was already closed, being almost a week into the merge window.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


[PATCH v2] net: qcom/emac: extend DMA mask to 46bits

2017-11-17 Thread Wang Dongsheng
Since PTP doesn't support yet, so extend the DMA address to 46bits.
When PTP is supported, the dma_mask should be fixed dynamically.

Signed-off-by: Wang Dongsheng 
---
v2:
- Changes PATCH subject.
- Dynamic fix TPD_BUFFER_ADDR_H_SET size.
- Modify DMA MASK to 46bits.
- Add Comments for DMA MASK.
---
 drivers/net/ethernet/qualcomm/emac/emac-mac.c | 9 ++---
 drivers/net/ethernet/qualcomm/emac/emac-mac.h | 5 -
 drivers/net/ethernet/qualcomm/emac/emac.c | 8 ++--
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c 
b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index 9cbb2726..db73348 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -1362,7 +1362,8 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt,
goto error;
 
TPD_BUFFER_ADDR_L_SET(tpd, lower_32_bits(tpbuf->dma_addr));
-   TPD_BUFFER_ADDR_H_SET(tpd, upper_32_bits(tpbuf->dma_addr));
+   TPD_BUFFER_ADDR_H_SET(adpt, tpd,
+ upper_32_bits(tpbuf->dma_addr));
TPD_BUF_LEN_SET(tpd, tpbuf->length);
emac_tx_tpd_create(adpt, tx_q, tpd);
count++;
@@ -1380,7 +1381,8 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt,
goto error;
 
TPD_BUFFER_ADDR_L_SET(tpd, lower_32_bits(tpbuf->dma_addr));
-   TPD_BUFFER_ADDR_H_SET(tpd, upper_32_bits(tpbuf->dma_addr));
+   TPD_BUFFER_ADDR_H_SET(adpt, tpd,
+ upper_32_bits(tpbuf->dma_addr));
TPD_BUF_LEN_SET(tpd, tpbuf->length);
emac_tx_tpd_create(adpt, tx_q, tpd);
count++;
@@ -1402,7 +1404,8 @@ static void emac_tx_fill_tpd(struct emac_adapter *adpt,
goto error;
 
TPD_BUFFER_ADDR_L_SET(tpd, lower_32_bits(tpbuf->dma_addr));
-   TPD_BUFFER_ADDR_H_SET(tpd, upper_32_bits(tpbuf->dma_addr));
+   TPD_BUFFER_ADDR_H_SET(adpt, tpd,
+ upper_32_bits(tpbuf->dma_addr));
TPD_BUF_LEN_SET(tpd, tpbuf->length);
emac_tx_tpd_create(adpt, tx_q, tpd);
count++;
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.h 
b/drivers/net/ethernet/qualcomm/emac/emac-mac.h
index 5028fb4..31c1f21 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.h
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.h
@@ -115,7 +115,10 @@ struct emac_tpd {
 /* High-14bit Buffer Address, So, the 64b-bit address is
  * {DESC_CTRL_11_TX_DATA_HIADDR[17:0],(register) BUFFER_ADDR_H, BUFFER_ADDR_L}
  */
-#define TPD_BUFFER_ADDR_H_SET(tpd, val)BITS_SET((tpd)->word[3], 18, 
30, val)
+#define TPD_BUFFER_ADDR_H_SET(adpt, tpd, val)  BITS_SET((tpd)->word[3], 18, \
+   TX_TS_ENABLE & \
+   readl((adpt)->csr + EMAC_EMAC_WRAPPER_CSR1) ? \
+   30 : 31, val)
 /* Format D. Word offset from the 1st byte of this packet to start to calculate
  * the custom checksum.
  */
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index 70c92b6..de5a4fb 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -615,8 +615,12 @@ static int emac_probe(struct platform_device *pdev)
u32 reg;
int ret;
 
-   /* The TPD buffer address is limited to 45 bits. */
-   ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(45));
+   /*
+* The TPD buffer address is limited to:
+* 1. PTP:  45bits. (Driver doesn't support yet.)
+* 2. NON-PTP:  46bits.
+*/
+   ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(46));
if (ret) {
dev_err(&pdev->dev, "could not set DMA mask\n");
return ret;
-- 
2.7.4



Re: Bisected 4.14 Regression: IPsec transport mode breakage

2017-11-17 Thread Steffen Klassert
On Wed, Nov 15, 2017 at 09:46:19AM -0700, Kevin Locke wrote:
> Hi all,
> 
> I am using an L2TP/IPsec (transport mode) VPN connection from a client
> behind a NAT running Debian with strongswan 5.6.0-2 and xl2tpd
> 1.3.10-1 to a Cisco Meraki MX60 with a public IP.  The connection
> works with kernel 4.13 but not with kernel 4.14.  With 4.14 the IPsec
> connection appears to be established correctly but xl2tpd is unable to
> establish the L2TP connection.  The relevant error from syslog is:
> 
> charon: 09[KNL] creating acquire job for policy 192.168.21.10/32[udp/l2f] === 
> X.X.X.X/32[udp/l2f] with reqid {1}
> charon: 12[CFG] trap not found, unable to acquire reqid 1
> 
> I have bisected the issue to commit c9f3f813d462.  I have attached the
> client ipsec.conf as well as the syslog during the connection attempt
> for both c9f3f813d462 (bad) and cf3796675174 (good).  Meraki IPs have
> been redacted to protect the innocent.
> 
> I'd appreciate any assistance in fixing the issue.  Let me know if
> there's anything else I can do to help troubleshoot or test.

The offending commit is already reverted in the 'net' tree
and will be available in mainline soon.

Thanks for the report!


[PATCH] [net] ibmvnic: fix dma_mapping_error call

2017-11-17 Thread Desnes Augusto Nunes do Rosario
This patch fixes the dma_mapping_error call to use the correct dma_addr
which is inside the ibmvnic_vpd struct. Moreover, it fixes an uninitialized
warning regarding a local dma_addr variable which is not used anymore.

Fixes: 4e6759be28e4 ("ibmvnic: Feature implementation of VPD for the ibmvnic 
driver")
Reported-by: Stephen Rothwell 
Signed-off-by: Desnes A. Nunes do Rosario 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 04aaacb..1dc4aef 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -849,7 +849,6 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
 {
struct device *dev = &adapter->vdev->dev;
union ibmvnic_crq crq;
-   dma_addr_t dma_addr;
int len = 0;
 
if (adapter->vpd->buff)
@@ -879,7 +878,7 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
adapter->vpd->dma_addr =
dma_map_single(dev, adapter->vpd->buff, adapter->vpd->len,
   DMA_FROM_DEVICE);
-   if (dma_mapping_error(dev, dma_addr)) {
+   if (dma_mapping_error(dev, adapter->vpd->dma_addr)) {
dev_err(dev, "Could not map VPD buffer\n");
kfree(adapter->vpd->buff);
return -ENOMEM;
-- 
2.9.5



Re: [PATCH] macvlan: verify MTU before lowerdev xmit

2017-11-17 Thread Daniel Axtens
Hi Dave,

> So how exactly do the oversized packets get to the macvlan device from
> the VM in this scenerio?  That detail seems to be missing from the
> diagrams you provided earlier.  The VM and the macvlan boxes are just
> connected with a line.

Inside the VM I'm using netperf talking on an interface which the guest
believes to have a MTU of 1500.

I'm setting up the VM using libvirt - my understanding is that libvirt
creates a macvtap device in private mode, and qemu opens that tap device
and writes data from the emulated network card (I see the same behaviour
with a emulated rtl8139, e1000, and with virtio).

I think I could replicate this with any userspace program - qemu is just
the easiest for me at the moment.

Hopefully that's what you had in mind? Let me know if you wanted
different info.

Regards,
Daniel

[The gory details, in case it matters: The VM has a network adaptor with
the following XML:


  
  
  
  

]

>
> Thank you.


Re: [PATCH 26/31] nds32: Build infrastructure

2017-11-17 Thread Greentime Hu
2017-11-10 16:26 GMT+08:00 Greentime Hu :
> 2017-11-09 18:33 GMT+08:00 Arnd Bergmann :
>> On Thu, Nov 9, 2017 at 10:02 AM, Greentime Hu  wrote:
>>> 2017-11-08 18:16 GMT+08:00 Arnd Bergmann :
 On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu  wrote:
>>
> +config GENERIC_CALIBRATE_DELAY
> +   def_bool y

 It's better to avoid the delay loop completely and skip the calibration,
 if your hardware allows.
>>>
>>> Thanks.
>>> Do you mean that this config should be def_bool n?
>>> why? Almost all arch enable it.
>>
>> It depends on what your hardware can do. If you have a way to see how much
>> time has passed that is guaranteed to be reliable on all machines, then
>> use that instead.
>>
>> On a lot of architectures, it's not possible, so they have to fall back to 
>> using
>> the delay loop.
>
> I get it. I will discuss it with our HW colleagues.
> We may get these informations in some registers.

Hi, Arnd:

I think I can't set it to default n because it will be called in start_kernel.

start_kernel() -> calibrate_delay()

If I don't enable this config, it will link error because it didn't
build init/calibrate.c


Re: [PATCH 26/31] nds32: Build infrastructure

2017-11-17 Thread Arnd Bergmann
On Fri, Nov 17, 2017 at 1:39 PM, Greentime Hu  wrote:
> 2017-11-10 16:26 GMT+08:00 Greentime Hu :
>> 2017-11-09 18:33 GMT+08:00 Arnd Bergmann :
>>> On Thu, Nov 9, 2017 at 10:02 AM, Greentime Hu  wrote:
 2017-11-08 18:16 GMT+08:00 Arnd Bergmann :
> On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu  wrote:
>>>
>> +config GENERIC_CALIBRATE_DELAY
>> +   def_bool y
>
> It's better to avoid the delay loop completely and skip the calibration,
> if your hardware allows.

 Thanks.
 Do you mean that this config should be def_bool n?
 why? Almost all arch enable it.
>>>
>>> It depends on what your hardware can do. If you have a way to see how much
>>> time has passed that is guaranteed to be reliable on all machines, then
>>> use that instead.
>>>
>>> On a lot of architectures, it's not possible, so they have to fall back to 
>>> using
>>> the delay loop.
>>
>> I get it. I will discuss it with our HW colleagues.
>> We may get these informations in some registers.
>
> Hi, Arnd:
>
> I think I can't set it to default n because it will be called in start_kernel.
>
> start_kernel() -> calibrate_delay()
>
> If I don't enable this config, it will link error because it didn't
> build init/calibrate.c

You will have to provide an architecture-specific implementation of
this function, please ahve a look at what others are doing, e.g. tile,
openrisc or
h8300.

  Arnd


Re: [PATCH] ARM64: dts: meson-axg: add ethernet mac controller

2017-11-17 Thread Neil Armstrong
On 13/11/2017 09:01, Yixun Lan wrote:
> Add DT info for the stmmac ethernet MAC which found in
> the Amlogic's Meson-AXG SoC, also describe the ethernet
> pinctrl & clock information here.
> 
> This is tested in the S400 dev board which use a RTL8211F PHY,
> and the pins connect to the 'eth_rgmii_y_pins' group.
> 
> Signed-off-by: Yixun Lan 
> ---
>  arch/arm64/boot/dts/amlogic/meson-axg-s400.dts |  7 
>  arch/arm64/boot/dts/amlogic/meson-axg.dtsi | 53 
> ++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts 
> b/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts
> index 9eb6aaee155d..7b39a9fe2b0f 100644
> --- a/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts
> +++ b/arch/arm64/boot/dts/amlogic/meson-axg-s400.dts
> @@ -21,3 +21,10 @@
>   status = "okay";
>  };
>  
> +ðmac {
> + status = "okay";
> + phy-mode = "rgmii";
> +
> + pinctrl-0 = <ð_rgmii_y_pins>;
> + pinctrl-names = "default";
> +};
> diff --git a/arch/arm64/boot/dts/amlogic/meson-axg.dtsi 
> b/arch/arm64/boot/dts/amlogic/meson-axg.dtsi
> index 65945c6c8b65..57faaa9d8013 100644
> --- a/arch/arm64/boot/dts/amlogic/meson-axg.dtsi
> +++ b/arch/arm64/boot/dts/amlogic/meson-axg.dtsi
> @@ -157,6 +157,19 @@
>   #address-cells = <0>;
>   };
>  
> + ethmac: ethernet@ff3f {
> + compatible = "amlogic,meson-axg-dwmac", 
> "amlogic,meson-gxbb-dwmac", "snps,dwmac";
> + reg = <0x0 0xff3f 0x0 0x1
> + 0x0 0xff634540 0x0 0x8>;
> + interrupts = ;
> + interrupt-names = "macirq";
> + status = "disabled";
> + clocks = <&clkc CLKID_ETH>,
> +  <&clkc CLKID_FCLK_DIV2>,
> +  <&clkc CLKID_MPLL2>;
> + clock-names = "stmmaceth", "clkin0", "clkin1";
> + };
> +
>   hiubus: hiubus@ff63c000 {
>   compatible = "simple-bus";
>   reg = <0x0 0xff63c000 0x0 0x1c00>;
> @@ -203,6 +216,46 @@
>   #gpio-cells = <2>;
>   gpio-ranges = <&pinctrl_periphs 0 0 86>;
>   };
> +
> + eth_rgmii_x_pins: eth-x-rgmii {
> + mux {
> + groups = "eth_mdio_x",
> +"eth_mdc_x",
> +"eth_rgmii_rx_clk_x",
> +"eth_rx_dv_x",
> +"eth_rxd0_x",
> +"eth_rxd1_x",
> +"eth_rxd2_rgmii",
> +"eth_rxd3_rgmii",
> +"eth_rgmii_tx_clk",
> +"eth_txen_x",
> +"eth_txd0_x",
> +"eth_txd1_x",
> +"eth_txd2_rgmii",
> +"eth_txd3_rgmii";
> + function = "eth";
> + };
> + };
> +
> + eth_rgmii_y_pins: eth-y-rgmii {
> + mux {
> + groups = "eth_mdio_y",
> +"eth_mdc_y",
> +"eth_rgmii_rx_clk_y",
> +"eth_rx_dv_y",
> +"eth_rxd0_y",
> +"eth_rxd1_y",
> +"eth_rxd2_rgmii",
> +"eth_rxd3_rgmii",
> +"eth_rgmii_tx_clk",
> +"eth_txen_y",
> +"eth_txd0_y",
> +"eth_txd1_y",
> +"eth_txd2_rgmii",
> +"eth_txd3_rgmii";
> + function = "eth";
> + };
> + };
>   };
>   };
>  
> 

Reviewed-by: Neil Armstrong 


[PATCH net-next v2 0/2] net: thunderx: add support for PTP clock

2017-11-17 Thread Aleksey Makarov
This series adds support for IEEE 1588 Precision Time Protocol
to Cavium ethernet driver.

The first patch adds support for the Precision Time Protocol Clocks and
Timestamping coprocessor (PTP) found on Cavium processors.
It registers a new PTP clock in the PTP core and provides functions
to use the counter in BGX, TNS, GTI, and NIC blocks.

The second patch introduces support for the PTP protocol to the
Cavium ThunderX ethernet driver.

v2:
- use readq()/writeq() in place of cavium_ptp_reg_read()/cavium_ptp_reg_write(),
  don't use readq_relaxed()/writeq_relaxed() (David Daney)

v1:
  https://lkml.kernel.org/r/20171107190704.15458-1-aleksey.maka...@cavium.com

Radoslaw Biernacki (1):
  net: add support for Cavium PTP coprocessor

Sunil Goutham (1):
  net: thunderx: add timestamping support

 drivers/net/ethernet/cavium/Kconfig|  14 +
 drivers/net/ethernet/cavium/Makefile   |   1 +
 drivers/net/ethernet/cavium/common/Makefile|   1 +
 drivers/net/ethernet/cavium/common/cavium_ptp.c| 334 +
 drivers/net/ethernet/cavium/common/cavium_ptp.h|  78 +
 drivers/net/ethernet/cavium/thunder/nic.h  |  15 +
 drivers/net/ethernet/cavium/thunder/nic_main.c |  58 +++-
 drivers/net/ethernet/cavium/thunder/nic_reg.h  |   1 +
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c|  29 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 173 ++-
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  26 ++
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c  |  29 ++
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h  |   4 +
 13 files changed, 757 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/ethernet/cavium/common/Makefile
 create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.c
 create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.h

-- 
2.15.0



[PATCH net-next v2 1/2] net: add support for Cavium PTP coprocessor

2017-11-17 Thread Aleksey Makarov
From: Radoslaw Biernacki 

This patch adds support for the Precision Time Protocol
Clocks and Timestamping hardware found on Cavium ThunderX
processors.

Signed-off-by: Radoslaw Biernacki 
Signed-off-by: Aleksey Makarov 
---
 drivers/net/ethernet/cavium/Kconfig |  13 +
 drivers/net/ethernet/cavium/Makefile|   1 +
 drivers/net/ethernet/cavium/common/Makefile |   1 +
 drivers/net/ethernet/cavium/common/cavium_ptp.c | 334 
 drivers/net/ethernet/cavium/common/cavium_ptp.h |  78 ++
 5 files changed, 427 insertions(+)
 create mode 100644 drivers/net/ethernet/cavium/common/Makefile
 create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.c
 create mode 100644 drivers/net/ethernet/cavium/common/cavium_ptp.h

diff --git a/drivers/net/ethernet/cavium/Kconfig 
b/drivers/net/ethernet/cavium/Kconfig
index 63be75eb34d2..fabe0ffcc2d4 100644
--- a/drivers/net/ethernet/cavium/Kconfig
+++ b/drivers/net/ethernet/cavium/Kconfig
@@ -50,6 +50,19 @@ config   THUNDER_NIC_RGX
  This driver supports configuring XCV block of RGX interface
  present on CN81XX chip.
 
+config CAVIUM_PTP
+   tristate "Cavium PTP coprocessor as PTP clock"
+   depends on 64BIT
+   depends on PTP_1588_CLOCK
+   select CAVIUM_RST
+   default y
+   ---help---
+ This driver adds support for the Precision Time Protocol Clocks and
+ Timestamping coprocessor (PTP) found on Cavium processors.
+ PTP provides timestamping mechanism that is suitable for use in IEEE 
1588
+ Precision Time Protocol or other purposes.  Timestamps can be used in
+ BGX, TNS, GTI, and NIC blocks.
+
 config LIQUIDIO
tristate "Cavium LiquidIO support"
depends on 64BIT
diff --git a/drivers/net/ethernet/cavium/Makefile 
b/drivers/net/ethernet/cavium/Makefile
index 872da9f7c31a..946bba84e81d 100644
--- a/drivers/net/ethernet/cavium/Makefile
+++ b/drivers/net/ethernet/cavium/Makefile
@@ -1,6 +1,7 @@
 #
 # Makefile for the Cavium ethernet device drivers.
 #
+obj-$(CONFIG_NET_VENDOR_CAVIUM) += common/
 obj-$(CONFIG_NET_VENDOR_CAVIUM) += thunder/
 obj-$(CONFIG_NET_VENDOR_CAVIUM) += liquidio/
 obj-$(CONFIG_NET_VENDOR_CAVIUM) += octeon/
diff --git a/drivers/net/ethernet/cavium/common/Makefile 
b/drivers/net/ethernet/cavium/common/Makefile
new file mode 100644
index ..dd8561b8060b
--- /dev/null
+++ b/drivers/net/ethernet/cavium/common/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_CAVIUM_PTP) += cavium_ptp.o
diff --git a/drivers/net/ethernet/cavium/common/cavium_ptp.c 
b/drivers/net/ethernet/cavium/common/cavium_ptp.c
new file mode 100644
index ..f4c738db27fd
--- /dev/null
+++ b/drivers/net/ethernet/cavium/common/cavium_ptp.c
@@ -0,0 +1,334 @@
+/*
+ * cavium_ptp.c - PTP 1588 clock on Cavium hardware
+ *
+ * Copyright (c) 2003-2015, 2017 Cavium, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This file may also be available under a different license from Cavium.
+ * Contact Cavium, Inc. for more information
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "cavium_ptp.h"
+
+#define DRV_NAME   "Cavium PTP Driver"
+
+#define PCI_DEVICE_ID_CAVIUM_PTP   0xA00C
+#define PCI_DEVICE_ID_CAVIUM_RST   0xA00E
+
+#define PCI_PTP_BAR_NO 0
+#define PCI_RST_BAR_NO 0
+
+#define PTP_CLOCK_CFG  0xF00ULL
+#define  PTP_CLOCK_CFG_PTP_EN  BIT(0)
+#define PTP_CLOCK_LO   0xF08ULL
+#define PTP_CLOCK_HI   0xF10ULL
+#define PTP_CLOCK_COMP 0xF18ULL
+
+#define RST_BOOT   0x1600ULL
+#define CLOCK_BASE_RATE5000ULL
+
+static u64 ptp_cavium_clock_get(void)
+{
+   struct pci_dev *pdev;
+   void __iomem *base;
+   u64 ret = CLOCK_BASE_RATE * 16;
+
+   pdev = pci_get_device(PCI_VENDOR_ID_CAVIUM,
+ PCI_DEVICE_ID_CAVIUM_RST, NULL);
+   if (!pdev)
+   goto error;
+
+   base = pci_ioremap_bar(pdev, PCI_RST_BAR_NO);
+   if (!base)
+   goto error_put_pdev;
+
+   ret = CLOCK_BASE_RATE * ((readq(base + RST_BOOT) >> 33) & 0x3f);
+
+   iounmap(base);
+
+error_put_pdev:
+   pci_dev_put(pdev);
+
+error:
+   return ret;
+}
+
+struct cavium_ptp *cavium_ptp_get(void)
+{
+   struct cavium_ptp *ptp;
+   struct pci_dev *pdev;
+
+   pdev = pci_get_device(PCI_VENDOR_ID_CAVIUM,
+ PCI_DEVICE_ID_CAVIUM_PTP, NULL);
+   if (!pdev)
+   return ERR_PTR(-ENODEV);
+
+   ptp = pci_get_drvdata(pdev);
+   if (!ptp) {
+   pci_dev_put(pdev);
+   ptp = ERR_PTR(-EPROBE_DEFER);
+   }
+
+   return ptp;
+}
+EXPORT_SYMBOL(cavium_ptp_get);
+
+void cavium_ptp_put(struct cavium_ptp *ptp)
+{
+   pci_dev_put(ptp->pdev);
+}
+EXPORT_SYMBOL(cavium_ptp_put);
+
+/**
+

[PATCH net-next v2 2/2] net: thunderx: add timestamping support

2017-11-17 Thread Aleksey Makarov
From: Sunil Goutham 

This adds timestamping support for both receive and transmit
paths. On the receive side no filters are supported i.e either
all pkts will get a timestamp appended infront of the packet or none.
On the transmit side HW doesn't support timestamp insertion but
only generates a separate CQE with transmitted packet's timestamp.
Also HW supports only one packet at a time for timestamping on the
transmit side.

Signed-off-by: Sunil Goutham 
Signed-off-by: Aleksey Makarov 
---
 drivers/net/ethernet/cavium/Kconfig|   1 +
 drivers/net/ethernet/cavium/thunder/nic.h  |  15 ++
 drivers/net/ethernet/cavium/thunder/nic_main.c |  58 ++-
 drivers/net/ethernet/cavium/thunder/nic_reg.h  |   1 +
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c|  29 +++-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 173 -
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  26 
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c  |  29 
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h  |   4 +
 9 files changed, 330 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/cavium/Kconfig 
b/drivers/net/ethernet/cavium/Kconfig
index fabe0ffcc2d4..6d003cbe197c 100644
--- a/drivers/net/ethernet/cavium/Kconfig
+++ b/drivers/net/ethernet/cavium/Kconfig
@@ -27,6 +27,7 @@ config THUNDER_NIC_PF
 
 config THUNDER_NIC_VF
tristate "Thunder Virtual function driver"
+   select CAVIUM_PTP
depends on 64BIT
---help---
  This driver supports Thunder's NIC virtual function
diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 4a02e618e318..204b234beb9d 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -263,6 +263,8 @@ struct nicvf_drv_stats {
struct u64_stats_sync   syncp;
 };
 
+struct cavium_ptp;
+
 struct nicvf {
struct nicvf*pnicvf;
struct net_device   *netdev;
@@ -312,6 +314,12 @@ struct nicvf {
struct tasklet_struct   qs_err_task;
struct work_struct  reset_task;
 
+   /* PTP timestamp */
+   struct cavium_ptp   *ptp_clock;
+   boolhw_rx_tstamp;
+   struct sk_buff  *ptp_skb;
+   atomic_ttx_ptp_skbs;
+
/* Interrupt coalescing settings */
u32 cq_coalesce_usecs;
u32 msg_enable;
@@ -371,6 +379,7 @@ struct nicvf {
 #defineNIC_MBOX_MSG_LOOPBACK   0x16/* Set interface in 
loopback */
 #defineNIC_MBOX_MSG_RESET_STAT_COUNTER 0x17/* Reset statistics 
counters */
 #defineNIC_MBOX_MSG_PFC0x18/* Pause frame control 
*/
+#defineNIC_MBOX_MSG_PTP_CFG0x19/* HW packet timestamp 
*/
 #defineNIC_MBOX_MSG_CFG_DONE   0xF0/* VF configuration 
done */
 #defineNIC_MBOX_MSG_SHUTDOWN   0xF1/* VF is being shutdown 
*/
 
@@ -521,6 +530,11 @@ struct pfc {
u8fc_tx;
 };
 
+struct set_ptp {
+   u8msg;
+   bool  enable;
+};
+
 /* 128 bit shared memory between PF and each VF */
 union nic_mbx {
struct { u8 msg; }  msg;
@@ -540,6 +554,7 @@ union nic_mbx {
struct set_loopback lbk;
struct reset_stat_cfg   reset_stat;
struct pfc  pfc;
+   struct set_ptp  ptp;
 };
 
 #define NIC_NODE_ID_MASK   0x03
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c 
b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 8f1dd55b3e08..4c1c5414a162 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -426,13 +426,22 @@ static void nic_init_hw(struct nicpf *nic)
/* Enable backpressure */
nic_reg_write(nic, NIC_PF_BP_CFG, (1ULL << 6) | 0x03);
 
-   /* TNS and TNS bypass modes are present only on 88xx */
+   /* TNS and TNS bypass modes are present only on 88xx
+* Also offset of this CSR has changed in 81xx and 83xx.
+*/
if (nic->pdev->subsystem_device == PCI_SUBSYS_DEVID_88XX_NIC_PF) {
/* Disable TNS mode on both interfaces */
nic_reg_write(nic, NIC_PF_INTF_0_1_SEND_CFG,
- (NIC_TNS_BYPASS_MODE << 7) | BGX0_BLOCK);
+ (NIC_TNS_BYPASS_MODE << 7) |
+ BGX0_BLOCK | (1ULL << 16));
nic_reg_write(nic, NIC_PF_INTF_0_1_SEND_CFG | (1 << 8),
- (NIC_TNS_BYPASS_MODE << 7) | BGX1_BLOCK);
+ (NIC_TNS_BYPASS_MODE << 7) |
+ BGX1_BLOCK | (1ULL << 16));
+   } else {
+   /* Configure timestamp generation timeout to 10us */
+   for (i = 0; i < nic->hw->bgx_cnt; i++)
+   nic_reg_write(nic, NIC_PF_INTFX_SEND_CFG | (i <<

Re: [PATCH 26/31] nds32: Build infrastructure

2017-11-17 Thread Greentime Hu
2017-11-17 20:50 GMT+08:00 Arnd Bergmann :
> On Fri, Nov 17, 2017 at 1:39 PM, Greentime Hu  wrote:
>> 2017-11-10 16:26 GMT+08:00 Greentime Hu :
>>> 2017-11-09 18:33 GMT+08:00 Arnd Bergmann :
 On Thu, Nov 9, 2017 at 10:02 AM, Greentime Hu  wrote:
> 2017-11-08 18:16 GMT+08:00 Arnd Bergmann :
>> On Wed, Nov 8, 2017 at 6:55 AM, Greentime Hu  wrote:

>>> +config GENERIC_CALIBRATE_DELAY
>>> +   def_bool y
>>
>> It's better to avoid the delay loop completely and skip the calibration,
>> if your hardware allows.
>
> Thanks.
> Do you mean that this config should be def_bool n?
> why? Almost all arch enable it.

 It depends on what your hardware can do. If you have a way to see how much
 time has passed that is guaranteed to be reliable on all machines, then
 use that instead.

 On a lot of architectures, it's not possible, so they have to fall back to 
 using
 the delay loop.
>>>
>>> I get it. I will discuss it with our HW colleagues.
>>> We may get these informations in some registers.
>>
>> Hi, Arnd:
>>
>> I think I can't set it to default n because it will be called in 
>> start_kernel.
>>
>> start_kernel() -> calibrate_delay()
>>
>> If I don't enable this config, it will link error because it didn't
>> build init/calibrate.c
>
> You will have to provide an architecture-specific implementation of
> this function, please ahve a look at what others are doing, e.g. tile,
> openrisc or
> h8300.

Thanks. I got you.


Re: GRO disabled with IPv4 options

2017-11-17 Thread Willem de Bruijn
On Thu, Nov 16, 2017 at 5:47 PM, Tom Herbert  wrote:
> On Thu, Nov 16, 2017 at 1:40 PM, Herbert Xu  
> wrote:
>> On Thu, Nov 16, 2017 at 04:12:43PM +0100, Cristian Klein wrote:
>>>
>>> Does somebody know the rationale for this? Is it because IPv4
>>> options are rarely used, hence implementing GRO in that case does
>>> not pay off or are there some caveats? Specifically would it make
>>
>> Precisely.  GRO is about optimising for the common case.  At the
>> time there was no impetus to support IP options.
>>
>>> sense to do GRO when the IPv4 options are byte-identical in
>>> consecutive packets?
>>
>> Yes there is no reason why we can't do this.  As long as it doesn't
>> penalise the non-IP-option case too much.
>>
> Of course it would also be nice to have GRO support for various IPv6
> extension headers, at this point we're more likely to see those rather
> than IP options in real deployment!

ipv6_gro_receive already pulls common ones with ipv6_gso_pull_exthdrs.
And to add a counterpoint: GRO has to be resilient to malformed packets,
so there is value in keeping it simple and limited to the common case.


Re: [PATCH 26/31] nds32: Build infrastructure

2017-11-17 Thread Greentime Hu
2017-11-16 18:25 GMT+08:00 Arnd Bergmann :
> On Thu, Nov 16, 2017 at 11:03 AM, Greentime Hu  wrote:
>> 2017-11-13 18:45 GMT+08:00 Geert Uytterhoeven :
>>> Given the checks for __NDS32_EB__, NDS32 can be either big or little endian,
>>> so you should have (excatly one of) CPU_BIG_ENDIAN or CPU_LITTLE_ENDIAN set.
>>
>> Thanks.
>> I will check if we need this config or not and update in the next version 
>> patch.
>
> I think we have one architecture in the kernel that determines endianess from
> the way that the toolchain is built. What all the others do it to have a 
> Kconfig
> option, at least CONFIG_CPU_BIG_ENDIAN that is used to pass -mbig-endian
> or -mlittle-endian to the compiler. You should do it that way so you can use
> any toolchain with any kernel configuration.

Thanks
I will add these 2 configs in the next version patch.
Pass -EL or -EB to compiler based on different configs.


Re: GRO disabled with IPv4 options

2017-11-17 Thread David Miller
From: Willem de Bruijn 
Date: Fri, 17 Nov 2017 08:51:57 -0500

> ipv6_gro_receive already pulls common ones with ipv6_gso_pull_exthdrs.
> And to add a counterpoint: GRO has to be resilient to malformed packets,
> so there is value in keeping it simple and limited to the common case.

Agreed.

Also it is not exactly clear what should happen with all IP option
types when reconstructing the original packet stream on the GSO side.

It just sounds incredibly complicated to me for little to no gain.


Re: [PATCH] net: bridge: add max_fdb_count

2017-11-17 Thread Andrew Lunn
> Usually it's better to apply LRU or random here in my opinion, as the
> new entry is much more likely to be needed than older ones by definition.

Hi Willy

I think this depends on why you need to discard. If it is normal
operation and the limits are simply too low, i would agree.

If however it is a DoS, throwing away the new entries makes sense,
leaving the old ones which are more likely to be useful.

Most of the talk in this thread has been about limits for DoS
prevention...

Andrew



[PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define

2017-11-17 Thread Greg Kroah-Hartman
There is no need to #define the license of the driver, just put it in
the MODULE_LICENSE() line directly as a text string.

This allows tools that check that the module license matches the source
code license to work properly, as there is no need to unwind the
unneeded dereference.

Cc: "David S. Miller" 
Cc: Andreas Kemnade 
Cc: Johan Hovold 
Reported-by: Philippe Ombredanne 
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/net/usb/hso.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/usb/hso.c b/drivers/net/usb/hso.c
index d7a3379ea668..f40d1423cf46 100644
--- a/drivers/net/usb/hso.c
+++ b/drivers/net/usb/hso.c
@@ -76,7 +76,6 @@
 
 #define MOD_AUTHOR "Option Wireless"
 #define MOD_DESCRIPTION"USB High Speed Option driver"
-#define MOD_LICENSE"GPL"
 
 #define HSO_MAX_NET_DEVICES10
 #define HSO__MAX_MTU   2048
@@ -3288,7 +3287,7 @@ module_exit(hso_exit);
 
 MODULE_AUTHOR(MOD_AUTHOR);
 MODULE_DESCRIPTION(MOD_DESCRIPTION);
-MODULE_LICENSE(MOD_LICENSE);
+MODULE_LICENSE("GPL");
 
 /* change the debug level (eg: insmod hso.ko debug=0x04) */
 MODULE_PARM_DESC(debug, "debug level mask [0x01 | 0x02 | 0x04 | 0x08 | 0x10]");
-- 
2.15.0



Re: [PATCH v2] net: qcom/emac: extend DMA mask to 46bits

2017-11-17 Thread Timur Tabi

On 11/17/17 3:56 AM, Wang Dongsheng wrote:

-#define TPD_BUFFER_ADDR_H_SET(tpd, val)BITS_SET((tpd)->word[3], 18, 
30, val)
+#define TPD_BUFFER_ADDR_H_SET(adpt, tpd, val)  BITS_SET((tpd)->word[3], 18, \
+   TX_TS_ENABLE & \
+   readl((adpt)->csr + EMAC_EMAC_WRAPPER_CSR1) ? \
+   30 : 31, val)


NAK.

Sorry, but this is terrible.  Every write to the TPD forces a secret 
read from another register?  No thank you.


Just do what you had in v1, but also set the DMA mask.  Add a comment 
saying that we can support 46 bits because we never enable timestamping.


--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: regression: UFO removal breaks kvm live migration

2017-11-17 Thread Willem de Bruijn
On Fri, Nov 10, 2017 at 12:32 AM, Willem de Bruijn
 wrote:
> On Wed, Nov 8, 2017 at 9:53 PM, Jason Wang  wrote:
>>
>>
>> On 2017年11月08日 20:32, David Miller wrote:
>>>
>>> From: Jason Wang 
>>> Date: Wed, 8 Nov 2017 17:25:48 +0900
>>>
 On 2017年11月08日 17:08, Willem de Bruijn wrote:
>
> That won't help in the short term. I'm still reading up to see if
> there are
> any other options besides reimplement or advertise-but-drop, such as
> an implicit trigger that would make the guest renegotiate. It's
> unlikely, but
> worth a look..

 Yes, this looks hard. And even if we can manage to do this, it looks
 an overkill since it will impact all guest after migration.
>>>
>>> Like Willem I would much prefer "advertise-but-drop" if it works.
>>
>>
>> This makes migration work but all guest UFO traffic will stall.
>>
>>>
>>> In the long term feature renegotiation triggers are a must.
>>>
>>> There is no way for us to remove features otherwise.
>>
>>
>> We can remove if we don't break userspace(guest).
>>
>>> In my opinion
>>> this will even make migrations more powerful.
>>
>>
>> But this does not help for guest running old version of kernel which still
>> think UFO work.
>
> Indeed, if we have to support live migration of arbitrary old guests
> without any expectations on hypervisor version either, features can
> simply never be reverted, even if a negotiation interface exists.
>
> At least for upcoming features and devices, guest code should not
> have this expectation, but from the start allow renegation such as
> CTRL_GUEST_OFFLOADS [1] based on a host trigger. But for
> tuntap TUNSETOFFLOAD it seems that ship has sailed.
>
> Okay, I will send a patch to reinstate UFO for this use case (only). There
> is some related work in tap_handle_frame and packet_direct_xmit to
> segment directly in the device. I will be traveling the next few days, so
> it won't be in time for 4.14 (but can go in stable later, of course).

I'm finishing up and running some tests. The majority of the patch is a
straightforward partial revert of the patchset, so while fairly large for a
patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
thoroughly tested code. Notably absent are the protocol layer and
hardware support (NETIF_F_UFO) portions.

The only open issue is whether to rely on existing skb_gso_segment
processing in the transmit path from validate_xmit_skb or to add new
skb_gso_segment calls directly to tun_get_user, tap_get_user and
pf_packet. Tun has to loop around four different ways of injecting
packets into the device. Something like the below snippet.

More conservative is to introduce no completely new code and rely on
validate_xmit_skb, but that means having to protect the entire stack
against skbs with SKB_GSO_UDP, so also bringing back some
checksum and fragment handling snippets in gre_gso_segment,
__skb_udp_tunnel_segment, act_csum and openvswitch.

A third option is to send the conservative approach to net, then
in net-next follow up with a patch to plug the SKB_GSO_UDP
directly in the devices and revert the tunnel/act/openvswitch stanzas
I'm leaning towards that approach.

@@ -1380,7 +1380,7 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,
int noblock, bool more)
 {
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
-   struct sk_buff *skb;
+   struct sk_buff *skb, *segs = NULL;
size_t total_len = iov_iter_count(from);
size_t len = total_len, align = tun->align, linear;
struct virtio_net_hdr gso = { 0 };
@@ -1552,12 +1552,33 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,
}

rxhash = __skb_get_hash_symmetric(skb);
+
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) {
+   skb_push(skb, ETH_HLEN);
+   segs = __skb_gso_segment(skb, netif_skb_features(skb), false);
+
+   if (IS_ERR(segs)) {
+   kfree_skb(skb);
+   return PTR_ERR(segs);
+   }
+
+   if (segs) {
+   consume_skb(skb);
+   skb = segs;
+   }
+again:
+   skb_pull(skb, ETH_HLEN);
+   segs = skb->next;
+   skb->next = NULL;
+   }
+
 #ifndef CONFIG_4KSTACKS
-tun_rx_batched(tun, tfile, skb, more);
+   tun_rx_batched(tun, tfile, skb, more || segs);
 #else
netif_rx_ni(skb);
 #endif

+   if (segs) {
+   skb = segs;
+   goto again;
+   }


Re: [PATCH] sfp: Add support for DWDM SFP modules

2017-11-17 Thread David Miller
From: Russell King - ARM Linux 
Date: Fri, 17 Nov 2017 09:52:10 +

> I already have a stack of patches for phy, phylink and sfp that I
> need to send, including documentation patches which Florian has
> already found very useful and helpful.  I had assumed that net-next
> was already closed, being almost a week into the merge window.

Yes it is.

Thanks for the info, I'll mark this 'deferred' in patchwork.  Please
have this respun and posted once net-next is openned back up and
the various issues have been sorted out.

Thank you.


Re: regression: UFO removal breaks kvm live migration

2017-11-17 Thread Willem de Bruijn
>> Okay, I will send a patch to reinstate UFO for this use case (only). There
>> is some related work in tap_handle_frame and packet_direct_xmit to
>> segment directly in the device. I will be traveling the next few days, so
>> it won't be in time for 4.14 (but can go in stable later, of course).
>
> I'm finishing up and running some tests. The majority of the patch is a
> straightforward partial revert of the patchset, so while fairly large for a
> patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
> thoroughly tested code. Notably absent are the protocol layer and
> hardware support (NETIF_F_UFO) portions.
>
> The only open issue is whether to rely on existing skb_gso_segment
> processing in the transmit path from validate_xmit_skb or to add new
> skb_gso_segment calls directly to tun_get_user, tap_get_user and
> pf_packet. Tun has to loop around four different ways of injecting
> packets into the device. Something like the below snippet.
>
> More conservative is to introduce no completely new code and rely on
> validate_xmit_skb, but that means having to protect the entire stack
> against skbs with SKB_GSO_UDP, so also bringing back some
> checksum and fragment handling snippets in gre_gso_segment,
> __skb_udp_tunnel_segment, act_csum and openvswitch.

Come to think of it, as this patch does not bring back NETIF_F_UFO
support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be
excluded.

Then this is probably the simpler and more obviously correct approach.


Re: [PATCH] net: usb: hso.c: remove unneeded DRIVER_LICENSE #define

2017-11-17 Thread Philippe Ombredanne
On Fri, Nov 17, 2017 at 3:19 PM, Greg Kroah-Hartman
 wrote:
> There is no need to #define the license of the driver, just put it in
> the MODULE_LICENSE() line directly as a text string.
>
> This allows tools that check that the module license matches the source
> code license to work properly, as there is no need to unwind the
> unneeded dereference.
>
> Cc: "David S. Miller" 
> Cc: Andreas Kemnade 
> Cc: Johan Hovold 
> Reported-by: Philippe Ombredanne 
> Signed-off-by: Greg Kroah-Hartman 


Reviewed-by: Philippe Ombredanne 
-- 
Cordially
Philippe Ombredanne


Re: [PATCH net] sctp: set frag_point in sctp_setsockopt_maxseg correctly

2017-11-17 Thread Marcelo Ricardo Leitner
On Fri, Nov 17, 2017 at 02:11:11PM +0800, Xin Long wrote:
> Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with
> val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect.
> 
> val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT.
> Then in sctp_datamsg_from_user(), when it's value is greater than cookie
> echo len and trying to bundle with cookie echo chunk, the first_len will
> overflow.
> 
> The worse case is when it's value is equal as cookie echo len, first_len
> becomes 0, it will go into a dead loop for fragment later on. In Hangbin
> syzkaller testing env, oom was even triggered due to consecutive memory
> allocation in that loop.
> 
> Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should
> deduct the data header for frag_point or user_frag check.
> 
> This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting
> the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when
> setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg
> codes.
> 
> Suggested-by: Marcelo Ricardo Leitner 
> Reported-by: Hangbin Liu 
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  include/net/sctp/sctp.h |  3 ++-
>  net/sctp/socket.c   | 29 +++--
>  2 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index d7d8cba..749a428 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -444,7 +444,8 @@ static inline int sctp_frag_point(const struct 
> sctp_association *asoc, int pmtu)
>   if (asoc->user_frag)
>   frag = min_t(int, frag, asoc->user_frag);
>  
> - frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN));
> + frag = SCTP_TRUNC4(min_t(int, frag, SCTP_MAX_CHUNK_LEN -
> + sizeof(struct sctp_data_chunk)));
>  
>   return frag;
>  }
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 4c0a772..3204a9b 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -3140,9 +3140,9 @@ static int sctp_setsockopt_mappedv4(struct sock *sk, 
> char __user *optval, unsign
>   */
>  static int sctp_setsockopt_maxseg(struct sock *sk, char __user *optval, 
> unsigned int optlen)
>  {
> + struct sctp_sock *sp = sctp_sk(sk);
>   struct sctp_assoc_value params;
>   struct sctp_association *asoc;
> - struct sctp_sock *sp = sctp_sk(sk);
>   int val;
>  
>   if (optlen == sizeof(int)) {
> @@ -3158,26 +3158,35 @@ static int sctp_setsockopt_maxseg(struct sock *sk, 
> char __user *optval, unsigned
>   if (copy_from_user(¶ms, optval, optlen))
>   return -EFAULT;
>   val = params.assoc_value;
> - } else
> + } else {
>   return -EINVAL;
> + }
>  
> - if ((val != 0) && ((val < 8) || (val > SCTP_MAX_CHUNK_LEN)))
> - return -EINVAL;
> + if (val) {
> + int min_len, max_len;
>  
> - asoc = sctp_id2assoc(sk, params.assoc_id);
> - if (!asoc && params.assoc_id && sctp_style(sk, UDP))
> - return -EINVAL;
> + min_len = SCTP_DEFAULT_MINSEGMENT - sp->pf->af->net_header_len;
> + min_len -= sizeof(struct sctphdr) +
> +sizeof(struct sctp_data_chunk);
> +
> + max_len = SCTP_MAX_CHUNK_LEN - sizeof(struct sctp_data_chunk);
>  
> + if (val < min_len || val > max_len)
> + return -EINVAL;
> + }
> +
> + asoc = sctp_id2assoc(sk, params.assoc_id);
>   if (asoc) {
>   if (val == 0) {
> - val = asoc->pathmtu;
> - val -= sp->pf->af->net_header_len;
> + val = asoc->pathmtu - sp->pf->af->net_header_len;
>   val -= sizeof(struct sctphdr) +
> - sizeof(struct sctp_data_chunk);
> +sizeof(struct sctp_data_chunk);
>   }
>   asoc->user_frag = val;
>   asoc->frag_point = sctp_frag_point(asoc, asoc->pathmtu);
>   } else {
> + if (params.assoc_id && sctp_style(sk, UDP))
> + return -EINVAL;
>   sp->user_frag = val;
>   }
>  
> -- 
> 2.1.0
> 


Product Enquiry

2017-11-17 Thread Carol Merck
Hello,

We recently visited your website and we are interested in your models, We will 
like to make an order from your list of products. However, we would like to see 
your company's latest catalogs with the; minimum order quantity, delivery 
time/FOB, payment terms etc. Official order placement will follow as soon as 
possible. 

Awaiting your prompt reply.


Thanks and best regards,

Carol Merck
Purchasing Manager


Re: [PATCH net] sctp: report SCTP_ERROR_INV_STRM as cpu endian

2017-11-17 Thread Marcelo Ricardo Leitner
On Fri, Nov 17, 2017 at 02:15:02PM +0800, Xin Long wrote:
> rfc6458 demands the send_error in SCTP_SEND_FAILED_EVENT should
> be in cpu endian, while SCTP_ERROR_INV_STRM is in big endian.
> 
> This issue is there since very beginning, Eric noticed it by
> running 'make C=2 M=net/sctp/'.
> 
> This patch is to convert it before reporting it.

Unfortunatelly we can't fix this as this will break UAPI. It will
break applications that are currently matching on the current value.

> 
> Reported-by: Eric Dumazet 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/stream.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index a11db21..f86ceee 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -64,7 +64,7 @@ static void sctp_stream_outq_migrate(struct sctp_stream 
> *stream,
>*/
>  
>   /* Mark as failed send. */
> - sctp_chunk_fail(ch, SCTP_ERROR_INV_STRM);
> + sctp_chunk_fail(ch, be16_to_cpu(SCTP_ERROR_INV_STRM));
>   if (asoc->peer.prsctp_capable &&
>   SCTP_PR_PRIO_ENABLED(ch->sinfo.sinfo_flags))
>   asoc->sent_cnt_removable--;
> -- 
> 2.1.0
> 


Re: [PATCH] qed: fix unnecessary call to memset cocci warnings

2017-11-17 Thread Andy Shevchenko
On Fri, Nov 17, 2017 at 12:04 AM, Vasyl Gomonovych  wrote:
> Use kzalloc rather than kmalloc followed by memset with 0
>
> drivers/net/ethernet/qlogic/qed/qed_dcbx.c:1280:13-20: WARNING:
> kzalloc should be used for dcbx_info, instead of kmalloc/memset
> Generated by: scripts/coccinelle/api/alloc/kzalloc-simple.cocci

While this looks okay per se now, it would be good if you put version
of the patch and add a changelog to it.

I think no need to resend this one, just for your information.

Reviewed-by: Andy Shevchenko 

> Signed-off-by: Vasyl Gomonovych 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
> b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> index 8f6ccc0c39e5..cc9e0dfcee48 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
> @@ -1277,11 +1277,10 @@ static struct qed_dcbx_get *qed_dcbnl_get_dcbx(struct 
> qed_hwfn *hwfn,
>  {
> struct qed_dcbx_get *dcbx_info;
>
> -   dcbx_info = kmalloc(sizeof(*dcbx_info), GFP_ATOMIC);
> +   dcbx_info = kzalloc(sizeof(*dcbx_info), GFP_ATOMIC);
> if (!dcbx_info)
> return NULL;
>
> -   memset(dcbx_info, 0, sizeof(*dcbx_info));
> if (qed_dcbx_query_params(hwfn, dcbx_info, type)) {
> kfree(dcbx_info);
> return NULL;
> --
> 1.9.1
>



-- 
With Best Regards,
Andy Shevchenko


[PATCH 2/2] ip6_tunnel: pass tun_dst arg from ip6_tnl_rcv() to __ip6_tnl_rcv()

2017-11-17 Thread Alexey Kodanev
Otherwise tun_dst argument is unused there. Currently, ip6_tnl_rcv()
invoked with tun_dst set to NULL, so there is no actual functional
changes introduced in this patch.

Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a1c2444..bc050e8 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -869,7 +869,7 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb,
struct metadata_dst *tun_dst,
bool log_ecn_err)
 {
-   return __ip6_tnl_rcv(t, skb, tpi, NULL, ip6ip6_dscp_ecn_decapsulate,
+   return __ip6_tnl_rcv(t, skb, tpi, tun_dst, ip6ip6_dscp_ecn_decapsulate,
 log_ecn_err);
 }
 EXPORT_SYMBOL(ip6_tnl_rcv);
-- 
1.8.3.1



[PATCH 1/2] gre6: use log_ecn_error module parameter in ip6_tnl_rcv()

2017-11-17 Thread Alexey Kodanev
After commit 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call
common GRE functions") it's not used anywhere in the module, but
previously was used in ip6gre_rcv().

Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE 
functions")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/ip6_gre.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 59c121b..5d6bee0 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -461,7 +461,7 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi)
  &ipv6h->saddr, &ipv6h->daddr, tpi->key,
  tpi->proto);
if (tunnel) {
-   ip6_tnl_rcv(tunnel, skb, tpi, NULL, false);
+   ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
 
return PACKET_RCVD;
}
-- 
1.8.3.1



Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai
On 15.11.2017 19:29, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 Curently mutex is used to protect pernet operations list. It makes
 cleanup_net() to execute ->exit methods of the same operations set,
 which was used on the time of ->init, even after net namespace is
 unlinked from net_namespace_list.

 But the problem is it's need to synchronize_rcu() after net is removed
 from net_namespace_list():

 Destroy net_ns:
 cleanup_net()
   mutex_lock(&net_mutex)
   list_del_rcu(&net->list)
   synchronize_rcu()  <--- Sleep there for 
 ages
   list_for_each_entry_reverse(ops, &pernet_list, list)
 ops_exit_list(ops, &net_exit_list)
   list_for_each_entry_reverse(ops, &pernet_list, list)
 ops_free_list(ops, &net_exit_list)
   mutex_unlock(&net_mutex)

 This primitive is not fast, especially on the systems with many processors
 and/or when preemptible RCU is enabled in config. So, all the time, while
 cleanup_net() is waiting for RCU grace period, creation of new net 
 namespaces
 is not possible, the tasks, who makes it, are sleeping on the same mutex:

 Create net_ns:
 copy_net_ns()
   mutex_lock_killable(&net_mutex)<--- Sleep there for 
 ages

 The solution is to convert net_mutex to the rw_semaphore. Then,
 pernet_operations::init/::exit methods, modifying the net-related data,
 will require down_read() locking only, while down_write() will be used
 for changing pernet_list.

 This gives signify performance increase, like you may see below. There
 is measured sequential net namespace creation in a cycle, in single
 thread, without other tasks (single user mode):

 1)int main(int argc, char *argv[])
 {
 unsigned nr;
 if (argc < 2) {
 fprintf(stderr, "Provide nr iterations arg\n");
 return 1;
 }
 nr = atoi(argv[1]);
 while (nr-- > 0) {
 if (unshare(CLONE_NEWNET)) {
 perror("Can't unshare");
 return 1;
 }
 }
 return 0;
 }

 Origin, 10 unshare():
 0.03user 23.14system 1:39.85elapsed 23%CPU

 Patched, 10 unshare():
 0.03user 67.49system 1:08.34elapsed 98%CPU

 2)for i in {1..1}; do unshare -n bash -c exit; done

 Origin:
 real 1m24,190s
 user 0m6,225s
 sys 0m15,132s

 Patched:
 real 0m18,235s   (4.6 times faster)
 user 0m4,544s
 sys 0m13,796s

 This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
 down_read_killable()"
 from Linus tree (not in net-next yet).
>>>
>>> Using a rwsem to protect the list of operations makes sense.
>>>
>>> That should allow removing the sing
>>>
>>> I am not wild about taking a the rwsem down_write in
>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>> goes from being a mild hack to being a pretty bad hack and something
>>> else that can kill the parallelism you are seeking it add.
>>>
>>> There are about 204 instances of struct pernet_operations.  That is a
>>> lot of code to have carefully audited to ensure it can in parallel all
>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>> to the fact that there are data structures accessed by multiple network
>>> namespaces.
>>>
>>> My preference would be to:
>>>
>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>   register and unregister, and maybe net_ns_barrier and
>>>   rtnl_link_unregister.
>>>
>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>
>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>   pernet_list as:
>>>
>>>   pernet_sys
>>>   pernet_subsys
>>>   pernet_device
>>>   pernet_dev
>>>
>>>   With the chunk in the middle taking the net_mutex.
>>
>> I think this approach will work. Thanks for the suggestion. Some more
>> thoughts to the plan below.
>>
>> The only difficult thing there will be to choose the right order
>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>> to pernet_dev one by one.
>>
>> This is rather easy in case of tristate drivers, as modules may be loaded
>> at any time, and the only important order is dependences between them.
>> So, it's possible to start from a module, who has no dependences,
>> and move it to pernet_sys, and then continue with

Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-17 Thread Oleg Nesterov
On 11/15, Yonghong Song wrote:
>
> v3 -> v4:
>   . Revert most of v3 change as 32bit emulation is not really working
> on x86_64 platform as among other issues, function emulate_push_stack()
> needs to account for 32bit app on 64bit platform.
> A separate effort is ongoing to address this issue.

Reviewed-by: Oleg Nesterov 



Please test your patch with the fix below, in this particular case the
TIF_IA32 check should be fine. Although this is not what we really want,
we should probably use user_64bit_mode(regs) which checks ->cs. But this
needs more changes and doesn't solve other problems (get_unmapped_area)
so I still can't decide what should we do right now...

Oleg.

--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -516,7 +516,7 @@ struct uprobe_xol_ops {
 
 static inline int sizeof_long(void)
 {
-   return in_ia32_syscall() ? 4 : 8;
+   return test_thread_flag(TIF_IA32) ? 4 : 8;
 }
 
 static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs 
*regs)



Re: Bisected 4.14 Regression: IPsec transport mode breakage

2017-11-17 Thread Kevin Locke
On Fri, 2017-11-17 at 11:03 +0100, Steffen Klassert wrote:
> On Wed, Nov 15, 2017 at 09:46:19AM -0700, Kevin Locke wrote:
>> I have bisected the issue to commit c9f3f813d462.  I have attached the
>> client ipsec.conf as well as the syslog during the connection attempt
>> for both c9f3f813d462 (bad) and cf3796675174 (good).
> 
> The offending commit is already reverted in the 'net' tree
> and will be available in mainline soon.

Great, thank you!  I tested davem/net#94802151894d and can confirm
that it works and fixes the issue for me.  Thanks again.

-- 
Cheers,  |  ke...@kevinlocke.name| XMPP: ke...@kevinlocke.name
Kevin|  https://kevinlocke.name  | IRC:   kevinoid on freenode


[PATCH 0/4][v6] Add the ability to do BPF directed error injection

2017-11-17 Thread Josef Bacik
I've reworked this to be opt-in only as per Igno and Alexei.  Still needs to go
through Dave because of the bpf bits, but I need tracing guys to weigh in and
sign off on my approach please.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

Josef


[PATCH 2/4] btrfs: make open_ctree error injectable

2017-11-17 Thread Josef Bacik
From: Josef Bacik 

This allows us to do error injection with BPF for open_ctree.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dfdab849037b..c6b4e1f07072 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "hash.h"
@@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
goto fail_block_groups;
goto retry_root_backup;
 }
+BPF_ALLOW_ERROR_INJECTION(open_ctree);
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-- 
2.7.5



[PATCH 1/4] add infrastructure for tagging functions as error injectable

2017-11-17 Thread Josef Bacik
From: Josef Bacik 

Using BPF we can override kprob'ed functions and return arbitrary
values.  Obviously this can be a bit unsafe, so make this feature opt-in
for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.

Signed-off-by: Josef Bacik 
---
 arch/x86/include/asm/asm.h|   6 ++
 include/asm-generic/kprobes.h |   9 +++
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/kprobes.h   |   1 +
 include/linux/module.h|   5 ++
 kernel/kprobes.c  | 163 ++
 kernel/module.c   |   6 +-
 7 files changed, 199 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b0dc91f4bedc..340f4cc43255 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -85,6 +85,12 @@
_ASM_PTR (entry);   \
.popsection
 
+# define _ASM_KPROBE_ERROR_INJECT(entry)   \
+   .pushsection "_kprobe_error_inject_list","aw" ; \
+   _ASM_ALIGN ;\
+   _ASM_PTR (entry);   \
+   .popseciton
+
 .macro ALIGN_DESTINATION
/* check for bad alignment of destination */
movl %edi,%ecx
diff --git a/include/asm-generic/kprobes.h b/include/asm-generic/kprobes.h
index 57af9f21d148..f96c4de5d7b0 100644
--- a/include/asm-generic/kprobes.h
+++ b/include/asm-generic/kprobes.h
@@ -22,4 +22,13 @@ static unsigned long __used  
\
 #endif
 #endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define BPF_ALLOW_ERROR_INJECTION(fname)   \
+static unsigned long __used\
+   __attribute__((__section__("_kprobe_error_inject_list")))   \
+   _eil_addr_##fname = (unsigned long)fname;
+#else
+#define BPF_ALLOW_ERROR_INJECTION(fname)
+#endif
+
 #endif /* _ASM_GENERIC_KPROBES_H */
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 8acfc1e099e1..85822804861e 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,6 +136,15 @@
 #define KPROBE_BLACKLIST()
 #endif
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define ERROR_INJECT_LIST(). = ALIGN(8);   
\
+   
VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
+   KEEP(*(_kprobe_error_inject_list))  
\
+   VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
= .;
+#else
+#define ERROR_INJECT_LIST()
+#endif
+
 #ifdef CONFIG_EVENT_TRACING
 #define FTRACE_EVENTS(). = ALIGN(8);   
\
VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
@@ -560,6 +569,7 @@
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
KPROBE_BLACKLIST()  \
+   ERROR_INJECT_LIST() \
MEM_DISCARD(init.rodata)\
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index bd2684700b74..4f501cb73aec 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset);
 extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
+extern bool within_kprobe_error_injection_list(unsigned long addr);
 
 struct kprobe_insn_cache {
struct mutex mutex;
diff --git a/include/linux/module.h b/include/linux/module.h
index fe5aa3736707..7bb1a9b9a322 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -475,6 +475,11 @@ struct module {
ctor_fn_t *ctors;
unsigned int num_ctors;
 #endif
+
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+   unsigned int num_kprobe_ei_funcs;
+   unsigned long *kprobe_ei_funcs;
+#endif
 } cacheline_aligned __randomize_layout;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1606a4224e1..7afadf07b34e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned 
long hash)
return &(kretprobe_table_locks[hash].lock);
 }
 
+/* List of symbols that can be overriden for error injection. */
+static LIST_HEAD(kprobe_error_injection_list);
+static DEFIN

[PATCH 3/4] bpf: add a bpf_override_function helper

2017-11-17 Thread Josef Bacik
From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov 
Signed-off-by: Josef Bacik 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/kprobes.h   |  4 +++
 arch/x86/include/asm/ptrace.h|  5 
 arch/x86/kernel/kprobes/ftrace.c | 14 ++
 include/linux/filter.h   |  3 ++-
 include/linux/trace_events.h |  1 +
 include/uapi/linux/bpf.h |  7 -
 kernel/bpf/core.c|  3 +++
 kernel/bpf/verifier.c|  2 ++
 kernel/events/core.c |  7 +
 kernel/trace/Kconfig | 11 
 kernel/trace/bpf_trace.c | 38 +++
 kernel/trace/trace_kprobe.c  | 55 +++-
 kernel/trace/trace_probe.h   | 12 +
 15 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d789a89cb32c..4fb618082259 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -195,6 +195,9 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_KPROBE_OVERRIDE
+   bool
+
 config HAVE_NMI
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..5126d2750dd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
+   select HAVE_KPROBE_OVERRIDE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 6cf65437b5e5..c6c3b1f4306a 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);
 
+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
+#endif
+
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
/* copy of the original instruction */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8e67fa..f04e71800c2f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct 
pt_regs *regs)
return regs->ax;
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->ax = rc;
+}
+
 /*
  * user_mode(regs) determines whether a register set came from user
  * mode.  On x86_32, this is true if V8086 mode was enabled OR if the
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 041f7b6dfa0f..3c455bf490cb 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "   ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)&override_func;
+}
+NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index fc6aeca945db..be8bd5a8efaa 100644
--- a/include/linux/trace_events.h
+++ b/include/linu

[PATCH 4/4] samples/bpf: add a test for bpf_override_return

2017-11-17 Thread Josef Bacik
From: Josef Bacik 

This adds a basic test for bpf_override_return to verify it works.  We
override the main function for mounting a btrfs fs so it'll return
-ENOMEM and then make sure that trying to mount a btrfs fs will fail.

Acked-by: Alexei Starovoitov 
Signed-off-by: Josef Bacik 
---
 samples/bpf/Makefile  |  4 
 samples/bpf/test_override_return.sh   | 15 +++
 samples/bpf/tracex7_kern.c| 16 
 samples/bpf/tracex7_user.c| 28 
 tools/include/uapi/linux/bpf.h|  7 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 6 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100755 samples/bpf/test_override_return.sh
 create mode 100644 samples/bpf/tracex7_kern.c
 create mode 100644 samples/bpf/tracex7_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea2b9e6135f3..83d06bc1f710 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -14,6 +14,7 @@ hostprogs-y += tracex3
 hostprogs-y += tracex4
 hostprogs-y += tracex5
 hostprogs-y += tracex6
+hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
 hostprogs-y += lathist
@@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
@@ -100,6 +102,7 @@ always += tracex3_kern.o
 always += tracex4_kern.o
 always += tracex5_kern.o
 always += tracex6_kern.o
+always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
@@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf
 HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
+HOSTLOADLIBES_tracex7 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
diff --git a/samples/bpf/test_override_return.sh 
b/samples/bpf/test_override_return.sh
new file mode 100755
index ..e68b9ee6814b
--- /dev/null
+++ b/samples/bpf/test_override_return.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir tmpmnt
+./tracex7 $DEVICE
+if [ $? -eq 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+fi
+losetup -d $DEVICE
diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c
new file mode 100644
index ..1ab308a43e0f
--- /dev/null
+++ b/samples/bpf/tracex7_kern.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/open_ctree")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   unsigned long rc = -12;
+
+   bpf_override_return(ctx, rc);
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c
new file mode 100644
index ..8a52ac492e8b
--- /dev/null
+++ b/samples/bpf/tracex7_user.c
@@ -0,0 +1,28 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int argc, char **argv)
+{
+   FILE *f;
+   char filename[256];
+   char command[256];
+   int ret;
+
+   snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+   if (load_bpf_file(filename)) {
+   printf("%s", bpf_log_buf);
+   return 1;
+   }
+
+   snprintf(command, 256, "mount %s tmpmnt/", argv[1]);
+   f = popen(command, "r");
+   ret = pclose(f);
+
+   return ret ? 0 : 1;
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a4b6e78c977..3756dde69834 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -673,6 +673,10 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return : 0 on success or negative error code
+ *
+ * int bpf_override_return(pt_regs, rc)
+ * @pt_regs: pointer to struct pt_regs
+ * @rc: the return value to set
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -732,7 +736,8 @@ union bpf_attr {
FN(xdp_adjust_meta),\
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
-   FN(getsockopt),
+   FN(getsockopt), \
+   FN(override_return),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/to

Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-17 Thread Yonghong Song



On 11/17/17 9:25 AM, Oleg Nesterov wrote:

On 11/15, Yonghong Song wrote:


v3 -> v4:
   . Revert most of v3 change as 32bit emulation is not really working
 on x86_64 platform as among other issues, function emulate_push_stack()
 needs to account for 32bit app on 64bit platform.
 A separate effort is ongoing to address this issue.


Reviewed-by: Oleg Nesterov 



Please test your patch with the fix below, in this particular case the
TIF_IA32 check should be fine. Although this is not what we really want,
we should probably use user_64bit_mode(regs) which checks ->cs. But this
needs more changes and doesn't solve other problems (get_unmapped_area)
so I still can't decide what should we do right now...


I tested the below change with my patch. On x86_64, both 64bit and 32bit 
program can be uprobe emulated properly. On x86_32, however, there is a 
compilation error like below:


In function ‘check_copy_size’,
inlined from ‘copy_to_user’ at 
/home/yhs/work/tip/include/linux/uaccess.h:154:6,
inlined from ‘emulate_push_stack.isra.9’ at 
/home/yhs/work/tip/arch/x86/kernel/uprobes.c:535:6:
/home/yhs/work/tip/include/linux/thread_info.h:139:4: error: call to 
‘__bad_copy_from’ declared with attribute error: copy source size is too 
small

__bad_copy_from();

Basically, test_thread_flag(TIF_IA32) returns 0 on x86_32 system.



Oleg.

--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -516,7 +516,7 @@ struct uprobe_xol_ops {
  
  static inline int sizeof_long(void)

  {
-   return in_ia32_syscall() ? 4 : 8;
+   return test_thread_flag(TIF_IA32) ? 4 : 8;
  }
  
  static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)




[PATCH net-next 1/2] net-next: use five-tuple hash for sk_txhash

2017-11-17 Thread Shaohua Li
From: Shaohua Li 

We are using sk_txhash to calculate flowlabel, but sk_txhash isn't
always available, for example, in inet_timewait_sock. This causes
problem for reset packet, which will have a different flowlabel. This
causes our router doesn't correctly close tcp connection. We are using
flowlabel to do load balance. Routers in the path maintain connection
state. So if flow label changes, the packet is routed through a
different router. In this case, the old router doesn't get the reset
packet to close the tcp connection.

Per Tom's suggestion, we switch back to five-tuple hash, so we can
reconstruct correct flowlabel for reset packet.

At most places, we already have the flowi info, so we directly use it
build sk_txhash. For synack, we do this after route search. At that
time, we have the flowi info ready, so don't need to create the flowi
info again.

I don't change sk_rethink_txhash() though, it still uses random hash,
which is the whole point to select a different path after a negative
routing advise.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 include/net/sock.h| 18 --
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 17 -
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 18 +-
 10 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index f8715c5..85a6192 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1732,22 +1732,12 @@ static inline kuid_t sock_net_uid(const struct net 
*net, const struct sock *sk)
return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
 }
 
-static inline u32 net_tx_rndhash(void)
-{
-   u32 v = prandom_u32();
-
-   return v ?: 1;
-}
-
-static inline void sk_set_txhash(struct sock *sk)
-{
-   sk->sk_txhash = net_tx_rndhash();
-}
-
 static inline void sk_rethink_txhash(struct sock *sk)
 {
-   if (sk->sk_txhash)
-   sk_set_txhash(sk);
+   if (sk->sk_txhash) {
+   u32 v = prandom_u32();
+   sk->sk_txhash = v ?: 1;
+   }
 }
 
 static inline struct dst_entry *
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85ea578..8d68fde 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1840,7 +1840,7 @@ struct tcp_request_sock_ops {
 __u16 *mss);
 #endif
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req);
+  struct request_sock *req);
u32 (*init_seq)(const struct sk_buff *skb);
u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..ed9ccb7 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -74,7 +74,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len
inet->inet_daddr = fl4->daddr;
inet->inet_dport = usin->sin_port;
sk->sk_state = TCP_ESTABLISHED;
-   sk_set_txhash(sk);
+   sk->sk_txhash = get_hash_from_flowi4(fl4);
inet->inet_id = jiffies;
 
sk_dst_set(sk, &rt->dst);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index fda37f2..76f1cf6 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -335,7 +335,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq->rcv_isn   = ntohl(th->seq) - 1;
treq->snt_isn   = cookie;
treq->ts_off= 0;
-   treq->txhash= net_tx_rndhash();
req->mss= mss;
ireq->ir_num= ntohs(th->dest);
ireq->ir_rmt_port   = th->source;
@@ -376,6 +375,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
security_req_classify_flow(req, flowi4_to_flowi(&fl4));
+
+   treq->txhash = get_hash_from_flowi4(&fl4);
+
rt = ip_route_output_key(sock_net(sk), &fl4);
if (IS_ERR(rt)) {
reqsk_free(req);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dabbf1d..92b4a10 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6289,7 +6289,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
}
 
tcp_rsk(req)->snt_isn = isn;
-   tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
if (!want_cookie) {
tcp_reqsk_record_syn(sk, req, skb);
diff --git a/net/ipv4

[PATCH net-next 0/2] net: fix flowlabel inconsistency in reset packet

2017-11-17 Thread Shaohua Li
From: Shaohua Li 

Hi,

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. We are using flowlabel to do
load balance. Routers in the path maintain connection state. So if flow
label changes, the packet is routed through a different router. In this
case, the old router doesn't get the reset packet to close the tcp
connection.

The reason is the normal packet gets the skb->hash from sk->sk_txhash,
which is generated randomly. ip6_make_flowlabel then uses the hash to
create a flowlabel. The reset packet doesn't get assigned a hash, so the
flowlabel is calculated with flowi6.

The patches fix the issue.

Thanks,
Shaohua

Shaohua Li (2):
  net-next: use five-tuple hash for sk_txhash
  net-next: copy user configured flowlabel to reset packet

 include/net/sock.h| 18 --
 include/net/tcp.h |  2 +-
 net/ipv4/datagram.c   |  2 +-
 net/ipv4/syncookies.c |  4 +++-
 net/ipv4/tcp_input.c  |  1 -
 net/ipv4/tcp_ipv4.c   | 17 -
 net/ipv4/tcp_output.c |  1 -
 net/ipv6/datagram.c   |  4 +++-
 net/ipv6/syncookies.c |  3 ++-
 net/ipv6/tcp_ipv6.c   | 36 ++--
 10 files changed, 56 insertions(+), 32 deletions(-)

-- 
2.9.5



[PATCH net-next 2/2] net-next: copy user configured flowlabel to reset packet

2017-11-17 Thread Shaohua Li
From: Shaohua Li 

Reset packet doesn't use user configured flowlabel, instead, it always
uses 0. This will cause inconsistency for flowlabel. tw sock already
records flowlabel info, so we can directly use it.

Cc: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 net/ipv6/tcp_ipv6.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index a1a5802..9b678cd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -901,6 +901,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
int oif = 0;
+   u8 tclass = 0;
+   __be32 flowlabel = 0;
 
if (th->rst)
return;
@@ -954,7 +956,21 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
trace_tcp_send_reset(sk, skb);
}
 
-   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
+   if (sk) {
+   if (sk_fullsock(sk)) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   tclass = np->tclass;
+   flowlabel = np->flow_label & IPV6_FLOWLABEL_MASK;
+   } else {
+   struct inet_timewait_sock *tw = inet_twsk(sk);
+
+   tclass = tw->tw_tclass;
+   flowlabel = cpu_to_be32(tw->tw_flowlabel);
+   }
+   }
+   tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, key, 1,
+   tclass, flowlabel);
 
 #ifdef CONFIG_TCP_MD5SIG
 out:
-- 
2.9.5



[PATCH RFC 01/25] net: Assign net to net_namespace_list in setup_net()

2017-11-17 Thread Kirill Tkhai
This patch merges two repeating pieces of code in one,
and they will live in setup_net() now.

It acts as cleanup even despite init_net_initialized
assignment is reordered with the linking of net now.
This variable is need for proc_net_init() called from:

start_kernel()->proc_root_init()->proc_net_init(),

which can't race with net_ns_init(), called from
initcall.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b797832565d3..7ecf71050ffa 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -296,6 +296,9 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
if (error < 0)
goto out_undo;
}
+   rtnl_lock();
+   list_add_tail_rcu(&net->list, &net_namespace_list);
+   rtnl_unlock();
 out:
return error;
 
@@ -417,11 +420,6 @@ struct net *copy_net_ns(unsigned long flags,
 
net->ucounts = ucounts;
rv = setup_net(net, user_ns);
-   if (rv == 0) {
-   rtnl_lock();
-   list_add_tail_rcu(&net->list, &net_namespace_list);
-   rtnl_unlock();
-   }
mutex_unlock(&net_mutex);
if (rv < 0) {
dec_net_namespaces(ucounts);
@@ -847,11 +845,6 @@ static int __init net_ns_init(void)
panic("Could not setup the initial network namespace");
 
init_net_initialized = true;
-
-   rtnl_lock();
-   list_add_tail_rcu(&init_net.list, &net_namespace_list);
-   rtnl_unlock();
-
mutex_unlock(&net_mutex);
 
register_pernet_subsys(&net_ns_ops);



[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore

2017-11-17 Thread Kirill Tkhai
Hi,

this is continuation of discussion from here:

https://lkml.org/lkml/2017/11/14/298

The plan has changed a little bit, so I'd be happy to hear
people's comments, before I dived into all 400+ pernet subsys
and devices.

The patch set adds pernet sys list ahead of subsys and device,
and it's used for pernet_operations, which may be executed
in parallel with any other pernet_operations methods. Also,
some high-priority ops converted (up to registered using
postcore_initcall(), and some subsys_initcall()) in order
of appearance. The sequence in setup_net() is following:

1)execute all the callbacks from pernet_sys list
2)lock net_mutex
3)execute all the callbacks from pernet_subsys list
4)execute all the callbacks from pernet_device list
5)unlock net_mutex

There was not pernet_operations, requiring additional
synchronization, yet, but I've bumped in another problem.
The problem is that some drivers may be compiled as modules
and as kernel-image part. They register pernet_operations
from device_initcall() for example. This initcall executes
in different time comparing to in-kernel built-in only
drivers.

Imagine, we have three state driverA, and boolean driverB.
driverA registers pernet_subsys from subsys_initcall().
driverB registers pernet_subsys from fs_initcall().
So, here we have two cases:

driverA is module  driverA is built-in
   ---
register driverB ops   register driverA ops
register driverA ops   register driverB ops

So, the order is different. When converting driver one-by-one,
it's impossible to make the order true for all .config
states, because of the above. So, the bisect won't work.

And it seems, it's just the same as to convert pernet_operations
from all the files in file alphabetical order. What do you
think about this? (Note, the patches has no such a problem
at the moment, as there are all in-kernel early core drivers).

Maybe there are another comments on the code.
---

Kirill Tkhai (25):
  net: Assign net to net_namespace_list in setup_net()
  net: Cleanup copy_net_ns()
  net: Introduce net_sem for protection of pernet_list
  net: Move mutex_unlock() in cleanup_net() up
  net: Add primitives to update heads of pernet_list sublists
  net: Add pernet sys and registration functions
  net: Make sys sublist pernet_operations executed out of net_mutex
  net: Move proc_net_ns_ops to pernet_sys list
  net: Move net_ns_ops to pernet_sys list
  net: Move sysctl_pernet_ops to pernet_sys list
  net: Move netfilter_net_ops to pernet_sys list
  net: Move nf_log_net_ops to pernet_sys list
  net: Move net_inuse_ops to pernet_sys list
  net: Move net_defaults_ops to pernet_sys list
  net: Move netlink_net_ops to pernet_sys list
  net: Move rtnetlink_net_ops to pernet_sys list
  net: Move audit_net_ops to pernet_sys list
  net: Move uevent_net_ops to pernet_sys list
  net: Move proto_net_ops to pernet_sys list
  net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list
  net: Move fib_* pernet_operations, registered via subsys_initcall(), to 
pernet_sys list
  net: Move subsys_initcall() registered pernet_operations from net/sched 
to pernet_sys list
  net: Move genl_pernet_ops to pernet_sys list
  net: Move wext_pernet_ops to pernet_sys list
  net: Move sysctl_core_ops to pernet_sys list


 fs/proc/proc_net.c  |2 
 include/linux/rtnetlink.h   |1 
 include/net/net_namespace.h |2 
 kernel/audit.c  |2 
 lib/kobject_uevent.c|2 
 net/core/dev.c  |2 
 net/core/fib_notifier.c |2 
 net/core/fib_rules.c|2 
 net/core/net-procfs.c   |4 -
 net/core/net_namespace.c|  203 +--
 net/core/rtnetlink.c|6 +
 net/core/sock.c |4 -
 net/core/sysctl_net_core.c  |2 
 net/netfilter/core.c|2 
 net/netfilter/nf_log.c  |2 
 net/netlink/af_netlink.c|2 
 net/netlink/genetlink.c |2 
 net/sched/act_api.c |2 
 net/sched/sch_api.c |2 
 net/sysctl_net.c|2 
 net/wireless/wext-core.c|2 
 21 files changed, 183 insertions(+), 67 deletions(-)

--
Signed-off-by: Kirill Tkhai 


[PATCH RFC 02/25] net: Cleanup copy_net_ns()

2017-11-17 Thread Kirill Tkhai
Line up destructors actions in the revers order
to constructors. Next patches will add more actions,
and this will be comfortable, if there is the such
order.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7ecf71050ffa..2e512965bf42 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -404,27 +404,25 @@ struct net *copy_net_ns(unsigned long flags,
 
net = net_alloc();
if (!net) {
-   dec_net_namespaces(ucounts);
-   return ERR_PTR(-ENOMEM);
+   rv = -ENOMEM;
+   goto dec_ucounts;
}
-
+   refcount_set(&net->passive, 1);
+   net->ucounts = ucounts;
get_user_ns(user_ns);
 
rv = mutex_lock_killable(&net_mutex);
-   if (rv < 0) {
-   net_free(net);
-   dec_net_namespaces(ucounts);
-   put_user_ns(user_ns);
-   return ERR_PTR(rv);
-   }
+   if (rv < 0)
+   goto put_userns;
 
-   net->ucounts = ucounts;
rv = setup_net(net, user_ns);
mutex_unlock(&net_mutex);
if (rv < 0) {
-   dec_net_namespaces(ucounts);
+put_userns:
put_user_ns(user_ns);
net_drop_ns(net);
+dec_ucounts:
+   dec_net_namespaces(ucounts);
return ERR_PTR(rv);
}
return net;



[PATCH RFC 08/25] net: Move proc_net_ns_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from before initcalls.

Since proc_net_ns_ops is registered pernet_subsys,
made from:

start_kernel()->proc_root_init()->proc_net_init(),

and there is no a pernet_subsys, which is registered
earlier, we start from it.

proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
register pernet net->proc_net and ->proc_net_stat, and
constructors and destructors of another pernet_operations
are not interested in foreign net's proc_net and proc_net_stat.
Proc filesystem privitives are synchronized on proc_subdir_lock.

So, it's safe to move proc_net_ns_ops to pernet_sys list
and execute its methods in parallel with another pernet
operations.

Signed-off-by: Kirill Tkhai 
---
 fs/proc/proc_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index a2bf369c923d..5eb52765eeab 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -243,5 +243,5 @@ int __init proc_net_init(void)
 {
proc_symlink("net", NULL, "self/net");
 
-   return register_pernet_subsys(&proc_net_ns_ops);
+   return register_pernet_sys(&proc_net_ns_ops);
 }



[PATCH RFC 07/25] net: Make sys sublist pernet_operations executed out of net_mutex

2017-11-17 Thread Kirill Tkhai
Move net_mutex to setup_net() and cleanup_net(), and
do not hold it, while sys sublist methods are executed.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   44 +++-
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f4f4aaa5ce1f..7aec8c1afe50 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 {
struct net_generic *ng, *old_ng;
 
-   BUG_ON(!mutex_is_locked(&net_mutex));
+   BUG_ON(!rwsem_is_locked(&net_sem));
BUG_ON(id < MIN_PERNET_OPS_ID);
 
old_ng = rcu_dereference_protected(net->gen,
-  lockdep_is_held(&net_mutex));
+  lockdep_is_held(&net_sem));
if (old_ng->s.len > id) {
old_ng->ptr[id] = data;
return 0;
@@ -300,6 +300,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
 {
/* Must be called with net_mutex held */
const struct pernet_operations *ops, *saved_ops;
+   bool locked = false;
int error = 0;
LIST_HEAD(net_exit_list);
 
@@ -311,14 +312,34 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
spin_lock_init(&net->nsid_lock);
 
list_for_each_entry(ops, &pernet_list, list) {
+   if (&ops->list == first_subsys) {
+   BUG_ON(locked);
+   error = mutex_lock_killable(&net_mutex);
+   if (error)
+   goto out_undo;
+   locked = true;
+   }
+
error = ops_init(ops, net);
if (error < 0)
goto out_undo;
}
+
+   if (!locked) {
+   /*
+* This may happen only on early boot, so we don't
+* care about possibility to interrupt the locking.
+*/
+   mutex_lock(&net_mutex);
+   locked = true;
+   }
+
rtnl_lock();
list_add_tail_rcu(&net->list, &net_namespace_list);
rtnl_unlock();
 out:
+   if (locked)
+   mutex_unlock(&net_mutex);
return error;
 
 out_undo:
@@ -433,12 +454,7 @@ struct net *copy_net_ns(unsigned long flags,
rv = down_read_killable(&net_sem);
if (rv < 0)
goto put_userns;
-   rv = mutex_lock_killable(&net_mutex);
-   if (rv < 0)
-   goto up_read;
rv = setup_net(net, user_ns);
-   mutex_unlock(&net_mutex);
-up_read:
up_read(&net_sem);
if (rv < 0) {
 put_userns:
@@ -460,6 +476,7 @@ static void cleanup_net(struct work_struct *work)
struct net *net, *tmp;
struct list_head net_kill_list;
LIST_HEAD(net_exit_list);
+   bool locked;
 
/* Atomically snapshot the list of namespaces to cleanup */
spin_lock_irq(&cleanup_list_lock);
@@ -468,6 +485,7 @@ static void cleanup_net(struct work_struct *work)
 
down_read(&net_sem);
mutex_lock(&net_mutex);
+   locked = true;
 
/* Don't let anyone else find us. */
rtnl_lock();
@@ -500,10 +518,18 @@ static void cleanup_net(struct work_struct *work)
synchronize_rcu();
 
/* Run all of the network namespace exit methods */
-   list_for_each_entry_reverse(ops, &pernet_list, list)
+   list_for_each_entry_reverse(ops, &pernet_list, list) {
ops_exit_list(ops, &net_exit_list);
 
-   mutex_unlock(&net_mutex);
+   if (&ops->list == first_subsys) {
+   BUG_ON(!locked);
+   mutex_unlock(&net_mutex);
+   locked = false;
+   }
+   }
+
+   if (locked)
+   mutex_unlock(&net_mutex);
 
/* Free the net generic variables */
list_for_each_entry_reverse(ops, &pernet_list, list)



[PATCH RFC 12/25] net: Move nf_log_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
nf_log_net_ops are registered the same initcall
as netfilter_net_ops, so they has to be moved right
after netfilter_net_ops.

The ops would have had a problem in parallel execution
with others, if init_net had been possible to released.
But it's not, and the rest is safe for that. There is
memory allocation, which nobody else interested in,
and sysctl registration. So, we move it to pernet_sys
list.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nf_log.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 8bb152a7cca4..08868afad813 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -582,5 +582,5 @@ static struct pernet_operations nf_log_net_ops = {
 
 int __init netfilter_log_init(void)
 {
-   return register_pernet_subsys(&nf_log_net_ops);
+   return register_pernet_sys(&nf_log_net_ops);
 }



[PATCH RFC 09/25] net: Move net_ns_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from pure initcalls.

Since net_ns_init() is the only pure initcall in net subsystem,
and there is no early initcalls; the pernet subsys, it registers,
is the first in pernet_operations list. So, we start with it.

net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only
ida_simple_* functions, which are not need a synchronization.

So it's safe to execute them in parallel with any other
pernet_operations, and thus we convert net_ns_ops to pernet_sys type.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7aec8c1afe50..2e8295aa7003 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -899,7 +899,7 @@ static int __init net_ns_init(void)
init_net_initialized = true;
up_write(&net_sem);
 
-   register_pernet_subsys(&net_ns_ops);
+   register_pernet_sys(&net_ns_ops);
 
rtnl_register(PF_UNSPEC, RTM_NEWNSID, rtnl_net_newid, NULL,
  RTNL_FLAG_DOIT_UNLOCKED);



[PATCH RFC 15/25] net: Move netlink_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
According to net/core/Makefile, net/core/af_netlink.o
core initcalls execute right after net/core/net_namespace.o.

The methods of netlink_net_ops create and destroy "netlink"
file, which are not interested for foreigh pernet_operations.
So, netlink_net_ops may safely be moved to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/netlink/af_netlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b9e0ee4e22f5..a4f1f5222b79 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2735,7 +2735,7 @@ static int __init netlink_proto_init(void)
netlink_add_usersock_entry();
 
sock_register(&netlink_family_ops);
-   register_pernet_subsys(&netlink_net_ops);
+   register_pernet_sys(&netlink_net_ops);
/* The netlink device handler may be needed early. */
rtnetlink_init();
 out:



[PATCH RFC 13/25] net: Move net_inuse_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
net/core/sock.o is the first linked file in net/core/Makefile,
so its core initcall executes the first in the directory.

net_inuse_ops methods expose statistics in /proc.
No one from the rest of pernet_subsys or pernet_device lists
does not touch net::core::inuse. So, it's safe to move
net_inuse_ops to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 13719af7b4e3..be050b044699 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3081,7 +3081,7 @@ static struct pernet_operations net_inuse_ops = {
 
 static __init int net_inuse_init(void)
 {
-   if (register_pernet_subsys(&net_inuse_ops))
+   if (register_pernet_sys(&net_inuse_ops))
panic("Cannot initialize net inuse counters");
 
return 0;



[PATCH RFC 21/25] net: Move fib_* pernet_operations, registered via subsys_initcall(), to pernet_sys list

2017-11-17 Thread Kirill Tkhai
Both of them create and initialize lists, which are not touched
by another foreing pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/fib_notifier.c |2 +-
 net/core/fib_rules.c|2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
index 0c048bdeb016..782a1475a32e 100644
--- a/net/core/fib_notifier.c
+++ b/net/core/fib_notifier.c
@@ -175,7 +175,7 @@ static struct pernet_operations fib_notifier_net_ops = {
 
 static int __init fib_notifier_init(void)
 {
-   return register_pernet_subsys(&fib_notifier_net_ops);
+   return register_pernet_sys(&fib_notifier_net_ops);
 }
 
 subsys_initcall(fib_notifier_init);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 98e1066c3d55..b2706c18f0f3 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -1039,7 +1039,7 @@ static int __init fib_rules_init(void)
rtnl_register(PF_UNSPEC, RTM_DELRULE, fib_nl_delrule, NULL, 0);
rtnl_register(PF_UNSPEC, RTM_GETRULE, NULL, fib_nl_dumprule, 0);
 
-   err = register_pernet_subsys(&fib_rules_net_ops);
+   err = register_pernet_sys(&fib_rules_net_ops);
if (err < 0)
goto fail;
 



[PATCH RFC 19/25] net: Move proto_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from subsys initcalls.

According to net/Makefile and net/core/Makefile, this
is the first exected subsys_initcall(), registering
pernet_subsys.

It seems to be executed in parallel with others,
as it's only creates/destoyes proc entry, which
nobody else is not interested in.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index be050b044699..ed12e115458b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3349,7 +3349,7 @@ static __net_initdata struct pernet_operations 
proto_net_ops = {
 
 static int __init proto_init(void)
 {
-   return register_pernet_subsys(&proto_net_ops);
+   return register_pernet_sys(&proto_net_ops);
 }
 
 subsys_initcall(proto_init);



[PATCH RFC 22/25] net: Move subsys_initcall() registered pernet_operations from net/sched to pernet_sys list

2017-11-17 Thread Kirill Tkhai
psched_net_ops only creates and destroyes /proc entry,
and safe to be executed in parallel with any foreigh
pernet_operations.

tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht,
which is not touched by foreign pernet_operations.

So, move them to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/sched/act_api.c |2 +-
 net/sched/sch_api.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 4d33a50a8a6d..f1de2146e6e0 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1470,7 +1470,7 @@ static int __init tc_action_init(void)
 {
int err;
 
-   err = register_pernet_subsys(&tcf_action_net_ops);
+   err = register_pernet_sys(&tcf_action_net_ops);
if (err)
return err;
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b6c4f536876b..68938ca4bbe1 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -2008,7 +2008,7 @@ static int __init pktsched_init(void)
 {
int err;
 
-   err = register_pernet_subsys(&psched_net_ops);
+   err = register_pernet_sys(&psched_net_ops);
if (err) {
pr_err("pktsched_init: "
   "cannot initialize per netns operations\n");



[PATCH RFC 23/25] net: Move genl_pernet_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This pernet_operations create and destroy net::genl_sock.
Foreign pernet_operations don't touch it.

Signed-off-by: Kirill Tkhai 
---
 net/netlink/genetlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index d444daf1ac04..da7ab3dd5609 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1045,7 +1045,7 @@ static int __init genl_init(void)
if (err < 0)
goto problem;
 
-   err = register_pernet_subsys(&genl_pernet_ops);
+   err = register_pernet_sys(&genl_pernet_ops);
if (err)
goto problem;
 



[PATCH RFC 20/25] net: Move pernet_subsys, registered via net_dev_init(), to pernet_sys list

2017-11-17 Thread Kirill Tkhai
net/core/dev.o is lined after net/core/sock.o.

There are:
1)dev_proc_ops and dev_mc_net_ops, which create and destroy
pernet proc file and not interested to another net namespaces;
2)netdev_net_ops, which creates pernet hash, which is not
touched by another pernet_operations.

So, move it to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/core/dev.c|2 +-
 net/core/net-procfs.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8ee29f4f5fa9..b90a503a9e1a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8787,7 +8787,7 @@ static int __init net_dev_init(void)
 
INIT_LIST_HEAD(&offload_base);
 
-   if (register_pernet_subsys(&netdev_net_ops))
+   if (register_pernet_sys(&netdev_net_ops))
goto out;
 
/*
diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index 615ccab55f38..46096219d574 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -413,8 +413,8 @@ static struct pernet_operations __net_initdata 
dev_mc_net_ops = {
 
 int __init dev_proc_init(void)
 {
-   int ret = register_pernet_subsys(&dev_proc_ops);
+   int ret = register_pernet_sys(&dev_proc_ops);
if (!ret)
-   return register_pernet_subsys(&dev_mc_net_ops);
+   return register_pernet_sys(&dev_mc_net_ops);
return ret;
 }



[PATCH RFC 25/25] net: Move sysctl_core_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
These pernet_operations register and destroy sysctl
directory, and it's not interested for foreign
pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/sysctl_net_core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cbc3dde4cfcc..0dab679b33fa 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -525,7 +525,7 @@ static __net_initdata struct pernet_operations 
sysctl_core_ops = {
 static __init int sysctl_core_init(void)
 {
register_net_sysctl(&init_net, "net/core", net_core_table);
-   return register_pernet_subsys(&sysctl_core_ops);
+   return register_pernet_sys(&sysctl_core_ops);
 }
 
 fs_initcall(sysctl_core_init);



[PATCH RFC 24/25] net: Move wext_pernet_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
These pernet_operations initialize and purge net::wext_nlevents
queue, and are not touched by foreign pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/wireless/wext-core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c
index 6cdb054484d6..2103c2a003ed 100644
--- a/net/wireless/wext-core.c
+++ b/net/wireless/wext-core.c
@@ -394,7 +394,7 @@ static struct pernet_operations wext_pernet_ops = {
 
 static int __init wireless_nlevent_init(void)
 {
-   int err = register_pernet_subsys(&wext_pernet_ops);
+   int err = register_pernet_sys(&wext_pernet_ops);
 
if (err)
return err;



[PATCH RFC 14/25] net: Move net_defaults_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
According to net/core/Makefile, net/core/net_namespace.o
core initcalls execute right after net/core/sock.o.

net_defaults_ops introduces only net_defaults_init_net method,
and it acts on net::core::sysctl_somaxconn, which
is not interested the rest of pernet_subsys and pernet_device
lists. Then, move it to pernet_sys.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2e8295aa7003..7fc9d44c1817 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -371,7 +371,7 @@ static struct pernet_operations net_defaults_ops = {
 
 static __init int net_defaults_init(void)
 {
-   if (register_pernet_subsys(&net_defaults_ops))
+   if (register_pernet_sys(&net_defaults_ops))
panic("Cannot initialize net default settings");
 
return 0;



[PATCH RFC 18/25] net: Move uevent_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This postcore_initcall() created pernet_operations
are registered from ./lib directory, and they have
to go right after audit_net_ops.

uevent_net_init() and uevent_net_exit() create and
destroy netlink socket, and these actions serialized
in netlink code.

Parallel execution with other pernet_operations
makes the socket disappear earlier from uevent_sock_list
on ->exit. As userspace can't be interested in broadcast
messages of dying net, and, as I see, no one in kernel
listen them, we may safely move uevent_net_ops to pernet_sys
list.

Signed-off-by: Kirill Tkhai 
---
 lib/kobject_uevent.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index c3e84edc47c9..84c9d85477cc 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -647,7 +647,7 @@ static struct pernet_operations uevent_net_ops = {
 
 static int __init kobject_uevent_init(void)
 {
-   return register_pernet_subsys(&uevent_net_ops);
+   return register_pernet_sys(&uevent_net_ops);
 }
 
 



[PATCH RFC 17/25] net: Move audit_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from postcore initcalls.

These pernet_operations are in ./kernel directory, and
there are only one more postcore in ./lib. So, audit_net_ops
have to go the first.

audit_net_init() creates netlink socket, while audit_net_exit()
destroys it. The rest of the pernet_list are not interested
in the socket, so we move audit_net_ops to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 kernel/audit.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 227db99b0f19..bb4626d7e712 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1549,7 +1549,7 @@ static int __init audit_init(void)
 
pr_info("initializing netlink subsys (%s)\n",
audit_default ? "enabled" : "disabled");
-   register_pernet_subsys(&audit_net_ops);
+   register_pernet_sys(&audit_net_ops);
 
audit_initialized = AUDIT_INITIALIZED;
 



[PATCH RFC 16/25] net: Move rtnetlink_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
rtnetlink_net_ops are added the same core initcall
as netlink_net_ops, so they has to be added right
after netlink_net_ops.

rtnetlink_net_init() and rtnetlink_net_exit()
create and destroy netlink socket. It looks like,
another pernet_operations are not interested in
foreiner net::rtnl, so rtnetlink_net_ops may be
safely moved to pernet_sys list.

Signed-off-by: Kirill Tkhai 
---
 net/core/rtnetlink.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index cb06d43c4230..d9cf13554e4d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4503,7 +4503,7 @@ void __init rtnetlink_init(void)
for (i = 0; i < ARRAY_SIZE(rtnl_msg_handlers_ref); i++)
refcount_set(&rtnl_msg_handlers_ref[i], 1);
 
-   if (register_pernet_subsys(&rtnetlink_net_ops))
+   if (register_pernet_sys(&rtnetlink_net_ops))
panic("rtnetlink_init: cannot initialize rtnetlink\n");
 
register_netdevice_notifier(&rtnetlink_dev_notifier);



[PATCH RFC 10/25] net: Move sysctl_pernet_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from core initcalls.

Since net/socket.o is the first linked file in net/Makefile,
its core initcalls execute the first. sysctl_pernet_ops is
the first pernet_subsys, registered from sock_init(), so
it goes ahead of others, registered via core_initcall().

Methods sysctl_net_init() and sysctl_net_exit() initialize
net::sysctls of a namespace.

pernet_operations::init()/exit() methods from the rest
of the list do not touch net::sysctls of strangers,
so it's safe to execute sysctl_pernet_ops's methods
in parallel with any other pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/sysctl_net.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sysctl_net.c b/net/sysctl_net.c
index 9aed6fe1bf1a..1b91db88e54a 100644
--- a/net/sysctl_net.c
+++ b/net/sysctl_net.c
@@ -103,7 +103,7 @@ __init int net_sysctl_init(void)
net_header = register_sysctl("net", empty);
if (!net_header)
goto out;
-   ret = register_pernet_subsys(&sysctl_pernet_ops);
+   ret = register_pernet_sys(&sysctl_pernet_ops);
if (ret)
goto out1;
 out:



[PATCH RFC 11/25] net: Move netfilter_net_ops to pernet_sys list

2017-11-17 Thread Kirill Tkhai
Since net/socket.o is the first linked file in net/Makefile,
its core initcalls execute the first. netfilter_net_ops
is executed right after sysctl_pernet_ops.

Methods netfilter_net_init() and netfilter_net_exit()
initialize net::nf::hooks and change net-related proc
directory of net. Another pernet_operations do not
interested in forein net::nf::hooks or proc entries,
so it's safe to move netfilter_net_ops to pernet list.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 52cd2901a097..2bed28281b67 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -606,7 +606,7 @@ int __init netfilter_init(void)
 {
int ret;
 
-   ret = register_pernet_subsys(&netfilter_net_ops);
+   ret = register_pernet_sys(&netfilter_net_ops);
if (ret < 0)
goto err;
 



[PATCH RFC 06/25] net: Add pernet sys and registration functions

2017-11-17 Thread Kirill Tkhai
This is a new sublist of pernet_list, which will live ahead
of already existing:

sys, subsys, device.

It's aimed for subsystems, which pernet_operations may execute
in parallel with any other's pernet_operations. In further,
step-by-step we will move all subsys there, adding necessary
small synchronization locks, where it's need. After all subsys
are moved to sys, we'll kill subsys list and we'll have
all current subsys not requiring net_mutex and to be able
to init and exit in parallel with others.

Then we'll add dev sublist ahead of device, and will repeat
the cycle.

Suggested-by: Eric W. Biederman 
Signed-off-by: Kirill Tkhai 
---
 include/net/net_namespace.h |2 +
 net/core/net_namespace.c|   75 ++-
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 10f99dafd5ac..2cde5f766ec6 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -324,6 +324,8 @@ struct pernet_operations {
  * device which caused kernel oops, and panics during network
  * namespace cleanup.   So please don't get this wrong.
  */
+int register_pernet_sys(struct pernet_operations *);
+void unregister_pernet_sys(struct pernet_operations *);
 int register_pernet_subsys(struct pernet_operations *);
 void unregister_pernet_subsys(struct pernet_operations *);
 int register_pernet_device(struct pernet_operations *);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1d9712973695..f4f4aaa5ce1f 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -24,10 +24,24 @@
 #include 
 
 /*
- * Our network namespace constructor/destructor lists
+ * Our network namespace constructor/destructor lists
+ * one by one linked in pernet_list. They are (in order
+ * of linking): sys, subsys, device.
+ *
+ * The methods from sys for a network namespace may be
+ * called in parallel with any method from any list
+ * for another net namespace.
+ *
+ * The methods from subsys and device can't be called
+ * in parallel with a method from subsys or device.
+ *
+ * When all subsys pernet_operations are moved to sys
+ * sublist, we'll kill subsys sublist, and create dev
+ * ahead of device sublist, and repeat the cycle.
  */
 
 static LIST_HEAD(pernet_list);
+static struct list_head *first_subsys = &pernet_list;
 static struct list_head *first_device = &pernet_list;
 DEFINE_MUTEX(net_mutex);
 
@@ -987,6 +1001,57 @@ static void unregister_pernet_operations(struct 
pernet_operations *ops)
ida_remove(&net_generic_ids, *ops->id);
 }
 
+/**
+ *  register_pernet_sys - register a network namespace system
+ * @ops:  pernet operations structure for the system
+ *
+ * Register a subsystem which has init and exit functions
+ * that are called when network namespaces are created and
+ * destroyed respectively.
+ *
+ * When registered all network namespace init functions are
+ * called for every existing network namespace.  Allowing kernel
+ * modules to have a race free view of the set of network namespaces.
+ *
+ * When a new network namespace is created all of the init
+ * methods are called in the order in which they were registered.
+ *
+ * When a network namespace is destroyed all of the exit methods
+ * are called in the reverse of the order with which they were
+ * registered.
+ */
+int register_pernet_sys(struct pernet_operations *ops)
+{
+   int error;
+   down_write(&net_sem);
+   if (first_subsys != first_device) {
+   panic("Pernet %ps registered out of order.\n"
+ "There is already %ps.\n", ops,
+ list_entry(first_subsys, struct pernet_operations, list));
+   }
+   error =  register_pernet_operations(first_subsys, ops);
+   up_write(&net_sem);
+   return error;
+}
+EXPORT_SYMBOL_GPL(register_pernet_sys);
+
+/**
+ *  unregister_pernet_sys - unregister a network namespace system
+ * @ops: pernet operations structure to manipulate
+ *
+ * Remove the pernet operations structure from the list to be
+ * used when network namespaces are created or destroyed.  In
+ * addition run the exit method for all existing network
+ * namespaces.
+ */
+void unregister_pernet_sys(struct pernet_operations *ops)
+{
+   down_write(&net_sem);
+   unregister_pernet_operations(ops);
+   up_write(&net_sem);
+}
+EXPORT_SYMBOL_GPL(unregister_pernet_sys);
+
 /**
  *  register_pernet_subsys - register a network namespace subsystem
  * @ops:  pernet operations structure for the subsystem
@@ -1011,6 +1076,8 @@ int register_pernet_subsys(struct pernet_operations *ops)
int error;
down_write(&net_sem);
error =  register_pernet_operations(first_device, ops);
+   if (!error)
+   update_first_on_add(first_subsys, first_device, &ops->list);
up_write(&net_sem);
return error;
 

[PATCH RFC 03/25] net: Introduce net_sem for protection of pernet_list

2017-11-17 Thread Kirill Tkhai
Curently mutex is used to protect pernet operations list. It makes
cleanup_net() to execute ->exit methods of the same operations set,
which was used on the time of ->init, even after net namespace is
unlinked from net_namespace_list.

But the problem is it's need to synchronize_rcu() after net is removed
from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)<--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled.

The solution is to convert net_mutex to the rw_semaphore and add small locks
to really small number of pernet_operations, what really need them. Then,
pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list.

This gives signify performance increase, like you may see here:
https://www.spinics.net/lists/netdev/msg467095.html

It's 4.6 times performance increase on one-thread test.
Multi-thread tests increase may be close to 4.6 multiplied
to number of threads.

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai 
---
 include/linux/rtnetlink.h |1 +
 net/core/net_namespace.c  |   37 +
 net/core/rtnetlink.c  |4 ++--
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2032ce2eb20b..f640fc87fe1d 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -35,6 +35,7 @@ extern int rtnl_is_locked(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
 extern struct mutex net_mutex;
+extern struct rw_semaphore net_sem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2e512965bf42..2254b1639209 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -41,6 +41,11 @@ struct net init_net = {
 EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
+/*
+ * net_sem: protects: pernet_list, net_generic_ids,
+ * init_net_initialized and first_* pointers.
+ */
+DECLARE_RWSEM(net_sem);
 
 #define MIN_PERNET_OPS_ID  \
((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
@@ -411,12 +416,16 @@ struct net *copy_net_ns(unsigned long flags,
net->ucounts = ucounts;
get_user_ns(user_ns);
 
-   rv = mutex_lock_killable(&net_mutex);
+   rv = down_read_killable(&net_sem);
if (rv < 0)
goto put_userns;
-
+   rv = mutex_lock_killable(&net_mutex);
+   if (rv < 0)
+   goto up_read;
rv = setup_net(net, user_ns);
mutex_unlock(&net_mutex);
+up_read:
+   up_read(&net_sem);
if (rv < 0) {
 put_userns:
put_user_ns(user_ns);
@@ -443,6 +452,7 @@ static void cleanup_net(struct work_struct *work)
list_replace_init(&cleanup_list, &net_kill_list);
spin_unlock_irq(&cleanup_list_lock);
 
+   down_read(&net_sem);
mutex_lock(&net_mutex);
 
/* Don't let anyone else find us. */
@@ -484,6 +494,7 @@ static void cleanup_net(struct work_struct *work)
ops_free_list(ops, &net_exit_list);
 
mutex_unlock(&net_mutex);
+   up_read(&net_sem);
 
/* Ensure there are no outstanding rcu callbacks using this
 * network namespace.
@@ -510,8 +521,10 @@ static void cleanup_net(struct work_struct *work)
  */
 void net_ns_barrier(void)
 {
+   down_write(&net_sem);
mutex_lock(&net_mutex);
mutex_unlock(&net_mutex);
+   up_write(&net_sem);
 }
 EXPORT_SYMBOL(net_ns_barrier);
 
@@ -838,12 +851,12 @@ static int __init net_ns_init(void)
 
rcu_assign_pointer(init_net.gen, ng);
 
-   mutex_lock(&net_mutex);
+   down_write(&net_sem);
if (setup_net(&init_net, &init_user_ns))
panic("Could not setup the initial network namespace");
 
init_net_initialized = true;
-   mutex_unlock(&net_mutex);
+   up_write(&net_sem);
 
register_pernet_

[PATCH RFC 04/25] net: Move mutex_unlock() in cleanup_net() up

2017-11-17 Thread Kirill Tkhai
net_sem protects from pernet_list changing, while
ops_free_list() makes simple kfree(), and it can't
race with other pernet_operations callbacks.

So we may release net_mutex earlier then it was.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2254b1639209..a8ea580885d9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -489,11 +489,12 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list);
 
+   mutex_unlock(&net_mutex);
+
/* Free the net generic variables */
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_free_list(ops, &net_exit_list);
 
-   mutex_unlock(&net_mutex);
up_read(&net_sem);
 
/* Ensure there are no outstanding rcu callbacks using this



[PATCH RFC 05/25] net: Add primitives to update heads of pernet_list sublists

2017-11-17 Thread Kirill Tkhai
Currently we have first_device, and device and subsys
sublists. Next patches introduce one more sublist.
So, move the functionality, which will be repeating,
to the primitives.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a8ea580885d9..1d9712973695 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -939,6 +939,18 @@ static void __unregister_pernet_operations(struct 
pernet_operations *ops)
 
 static DEFINE_IDA(net_generic_ids);
 
+#define update_first_on_add(first, delim, added)   \
+   do {\
+   if (first == delim) \
+   first = added;  \
+   } while (0)
+
+#define update_first_on_del(first, to_delete)  \
+   do {\
+   if (first == to_delete) \
+   first = (to_delete)->next;  \
+   } while (0)
+
 static int register_pernet_operations(struct list_head *list,
  struct pernet_operations *ops)
 {
@@ -1045,8 +1057,8 @@ int register_pernet_device(struct pernet_operations *ops)
int error;
down_write(&net_sem);
error = register_pernet_operations(&pernet_list, ops);
-   if (!error && (first_device == &pernet_list))
-   first_device = &ops->list;
+   if (!error)
+   update_first_on_add(first_device, &pernet_list, &ops->list);
up_write(&net_sem);
return error;
 }
@@ -1064,8 +1076,7 @@ EXPORT_SYMBOL_GPL(register_pernet_device);
 void unregister_pernet_device(struct pernet_operations *ops)
 {
down_write(&net_sem);
-   if (&ops->list == first_device)
-   first_device = first_device->next;
+   update_first_on_del(first_device, &ops->list);
unregister_pernet_operations(ops);
up_write(&net_sem);
 }



Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai
On 15.11.2017 19:31, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 12:51, Kirill Tkhai wrote:
>>> On 15.11.2017 06:19, Eric W. Biederman wrote:
 Kirill Tkhai  writes:

> On 14.11.2017 21:39, Cong Wang wrote:
>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  
>> wrote:
>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>
>>> get_user_ns(user_ns);
>>>
>>> -   rv = mutex_lock_killable(&net_mutex);
>>> +   rv = down_read_killable(&net_sem);
>>> if (rv < 0) {
>>> net_free(net);
>>> dec_net_namespaces(ucounts);
>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>> list_add_tail_rcu(&net->list, &net_namespace_list);
>>> rtnl_unlock();
>>> }
>>> -   mutex_unlock(&net_mutex);
>>> +   up_read(&net_sem);
>>> if (rv < 0) {
>>> dec_net_namespaces(ucounts);
>>> put_user_ns(user_ns);
>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>> list_replace_init(&cleanup_list, &net_kill_list);
>>> spin_unlock_irq(&cleanup_list_lock);
>>>
>>> -   mutex_lock(&net_mutex);
>>> +   down_read(&net_sem);
>>>
>>> /* Don't let anyone else find us. */
>>> rtnl_lock();
>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>> ops_free_list(ops, &net_exit_list);
>>>
>>> -   mutex_unlock(&net_mutex);
>>> +   up_read(&net_sem);
>>
>> After your patch setup_net() could run concurrently with cleanup_net(),
>> given that ops_exit_list() is called on error path of setup_net() too,
>> it means ops->exit() now could run concurrently if it doesn't have its
>> own lock. Not sure if this breaks any existing user.
>
> Yes, there will be possible concurrent ops->init() for a net namespace,
> and ops->exit() for another one. I hadn't found pernet operations, which
> have a problem with that. If they exist, they are hidden and not clear 
> seen.
> The pernet operations in general do not touch someone else's memory.
> If suddenly there is one, KASAN should show it after a while.

 Certainly the use of hash tables shared between multiple network
 namespaces would count.  I don't rembmer how many of these we have but
 there used to be quite a few.
>>>
>>> Could you please provide an example of hash tables, you mean?
>>
>> Ah, I see, it's dccp_hashinfo etc.

JFI, I've checked dccp_hashinfo, and it seems to be safe.

> 
> The big one used to be the route cache.  With resizable hash tables
> things may be getting better in that regard.

I've checked some fib-related things, and wasn't able to find that.
Excuse me, could you please clarify, if it's an assumption, or
there is exactly a problem hash table, you know? Could you please
point it me more exactly, if it's so.


Re: [PATCH] net: bridge: add max_fdb_count

2017-11-17 Thread Willy Tarreau
Hi Andrew,

On Fri, Nov 17, 2017 at 03:06:23PM +0100, Andrew Lunn wrote:
> > Usually it's better to apply LRU or random here in my opinion, as the
> > new entry is much more likely to be needed than older ones by definition.
> 
> Hi Willy
> 
> I think this depends on why you need to discard. If it is normal
> operation and the limits are simply too low, i would agree.
> 
> If however it is a DoS, throwing away the new entries makes sense,
> leaving the old ones which are more likely to be useful.
> 
> Most of the talk in this thread has been about limits for DoS
> prevention...

Sure but my point is that it can kick in on regular traffic and in
this case it can be catastrophic. That's only what bothers me. If
we have an unlimited default value with this algorithm I'm fine
because nobody will get caught by accident with a bridge suddenly
replicating high traffic on all ports because an unknown limit was
reached. That's the principle of least surprise.

I know that when fighting DoSes there's never any universally good
solutions and one has to make tradeoffs. I'm perfectly fine with this.

Cheers,
Willy


Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Eric W. Biederman
Kirill Tkhai  writes:

> On 15.11.2017 19:31, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>> 
>>> On 15.11.2017 12:51, Kirill Tkhai wrote:
 On 15.11.2017 06:19, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> On 14.11.2017 21:39, Cong Wang wrote:
>>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  
>>> wrote:
 @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,

 get_user_ns(user_ns);

 -   rv = mutex_lock_killable(&net_mutex);
 +   rv = down_read_killable(&net_sem);
 if (rv < 0) {
 net_free(net);
 dec_net_namespaces(ucounts);
 @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
 list_add_tail_rcu(&net->list, &net_namespace_list);
 rtnl_unlock();
 }
 -   mutex_unlock(&net_mutex);
 +   up_read(&net_sem);
 if (rv < 0) {
 dec_net_namespaces(ucounts);
 put_user_ns(user_ns);
 @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
 list_replace_init(&cleanup_list, &net_kill_list);
 spin_unlock_irq(&cleanup_list_lock);

 -   mutex_lock(&net_mutex);
 +   down_read(&net_sem);

 /* Don't let anyone else find us. */
 rtnl_lock();
 @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
 list_for_each_entry_reverse(ops, &pernet_list, list)
 ops_free_list(ops, &net_exit_list);

 -   mutex_unlock(&net_mutex);
 +   up_read(&net_sem);
>>>
>>> After your patch setup_net() could run concurrently with cleanup_net(),
>>> given that ops_exit_list() is called on error path of setup_net() too,
>>> it means ops->exit() now could run concurrently if it doesn't have its
>>> own lock. Not sure if this breaks any existing user.
>>
>> Yes, there will be possible concurrent ops->init() for a net namespace,
>> and ops->exit() for another one. I hadn't found pernet operations, which
>> have a problem with that. If they exist, they are hidden and not clear 
>> seen.
>> The pernet operations in general do not touch someone else's memory.
>> If suddenly there is one, KASAN should show it after a while.
>
> Certainly the use of hash tables shared between multiple network
> namespaces would count.  I don't rembmer how many of these we have but
> there used to be quite a few.

 Could you please provide an example of hash tables, you mean?
>>>
>>> Ah, I see, it's dccp_hashinfo etc.
>
> JFI, I've checked dccp_hashinfo, and it seems to be safe.
>
>> 
>> The big one used to be the route cache.  With resizable hash tables
>> things may be getting better in that regard.
>
> I've checked some fib-related things, and wasn't able to find that.
> Excuse me, could you please clarify, if it's an assumption, or
> there is exactly a problem hash table, you know? Could you please
> point it me more exactly, if it's so.

Two things.
1) Hash tables are one case I know where we access data from multiple
   network namespaces.  As such it can not be asserted that is no
   possibility for problems.

2) The responsible way to handle this is one patch for each set of
   methods explaining why those methods are safe to run in parallel.

   That ensures there is opportunity for review and people are going
   slowly enough that they actually look at these issues.

The reason I want to see this broken up is that at 200ish sets of
methods it is too much to review all at once.

I completely agree that odds are that this can be made safe and that it
is mostly likely already safe in practically every instance.My guess
would be that if there are problems that need to be addressed they
happen in one or two places and we need to find them.  If possible I
don't want to find them after the code has shipped in a stable release.

Eric



   


Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Eric W. Biederman
Kirill Tkhai  writes:

> On 15.11.2017 19:29, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>> 
>>> On 15.11.2017 09:25, Eric W. Biederman wrote:
 Kirill Tkhai  writes:

> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
>
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
>
> Destroy net_ns:
> cleanup_net()
>   mutex_lock(&net_mutex)
>   list_del_rcu(&net->list)
>   synchronize_rcu()  <--- Sleep there for 
> ages
>   list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_exit_list(ops, &net_exit_list)
>   list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_free_list(ops, &net_exit_list)
>   mutex_unlock(&net_mutex)
>
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net 
> namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>
> Create net_ns:
> copy_net_ns()
>   mutex_lock_killable(&net_mutex)<--- Sleep there for 
> ages
>
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
>
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
>
> 1)int main(int argc, char *argv[])
> {
> unsigned nr;
> if (argc < 2) {
> fprintf(stderr, "Provide nr iterations arg\n");
> return 1;
> }
> nr = atoi(argv[1]);
> while (nr-- > 0) {
> if (unshare(CLONE_NEWNET)) {
> perror("Can't unshare");
> return 1;
> }
> }
> return 0;
> }
>
> Origin, 10 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
>
> Patched, 10 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
>
> 2)for i in {1..1}; do unshare -n bash -c exit; done
>
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s
>
> Patched:
> real 0m18,235s   (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s
>
> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
> down_read_killable()"
> from Linus tree (not in net-next yet).

 Using a rwsem to protect the list of operations makes sense.

 That should allow removing the sing

 I am not wild about taking a the rwsem down_write in
 rtnl_link_unregister, and net_ns_barrier.  I think that works but it
 goes from being a mild hack to being a pretty bad hack and something
 else that can kill the parallelism you are seeking it add.

 There are about 204 instances of struct pernet_operations.  That is a
 lot of code to have carefully audited to ensure it can in parallel all
 at once.  The existence of the exit_batch method, net_ns_barrier,
 for_each_net and taking of net_mutex in rtnl_link_unregister all testify
 to the fact that there are data structures accessed by multiple network
 namespaces.

 My preference would be to:

 - Add the net_sem in addition to net_mutex with down_write only held in
   register and unregister, and maybe net_ns_barrier and
   rtnl_link_unregister.

 - Factor out struct pernet_ops out of struct pernet_operations.  With
   struct pernet_ops not having the exit_batch method.  With pernet_ops
   being embedded an anonymous member of the old struct pernet_operations.

 - Add [un]register_pernet_{sys,dev} functions that take a struct
   pernet_ops, that don't take net_mutex.  Have them order the
   pernet_list as:

   pernet_sys
   pernet_subsys
   pernet_device
   pernet_dev

   With the chunk in the middle taking the net_mutex.
>>>
>>> I think this approach will work. Thanks for the suggestion. Some more
>>> thoughts to the plan below.
>>>
>>> The only difficult thing there will be to choose the right order
>>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>>> to pernet_dev one by one.
>>>
>>> This is rather easy in case of tristate drivers, as modules may be loaded
>>> at any time, and the only important o

RE: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2017-11-17 Thread Tristram.Ha
> On Thu, Nov 16, 2017 at 06:41:24PM -0800, tristram...@microchip.com
> wrote:
> > From: Tristram Ha 
> >
> > This series of patches is to modify the original KSZ9477 DSA driver so
> > that other KSZ switch drivers can be added and use the common code.
> 
> Hi Tristram
> 
> http://vger.kernel.org/~davem/net-next.html
> 
> It is better to send an RFC patchset while netdev is closed and not
> send it to David. He will shout at you otherwise.

Noted.

I really need to monitor the DSA discussion to better contribute to its success.
I just found out the DSA API set_addr was removed last month due to not
everybody is using it.  It cited the Marvell switch was the only switch using 
that
API and found a new way to program the MAC address.  But looking at that
driver I found it simply uses a randomized MAC address.

For big switch with many ports where the main function is forwarding that MAC
address may not matter.  For small switch with 2 ports it acts more like an 
Ethernet
controller where the switch is mainly used for daisy chaining in a ring network 
the MAC
address can be used in feature like source address filtering.



Re: [PATCH v1 net-next 0/7] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2017-11-17 Thread Andrew Lunn
> I really need to monitor the DSA discussion to better contribute to its 
> success.
> I just found out the DSA API set_addr was removed last month due to not
> everybody is using it.  It cited the Marvell switch was the only switch using 
> that
> API and found a new way to program the MAC address.  But looking at that
> driver I found it simply uses a randomized MAC address.
> 
> For big switch with many ports where the main function is forwarding that MAC
> address may not matter.  For small switch with 2 ports it acts more like an 
> Ethernet
> controller where the switch is mainly used for daisy chaining in a ring 
> network the MAC
> address can be used in feature like source address filtering.

Hi Tristram

The MAC address set by set_addr was only used for pause
frames. Nothing else. So a random address is fine.

The switch itself should not be sending any other frames.

Andrew


Re: [patch net-next RFC v2 00/11] Add support for resource abstraction

2017-11-17 Thread David Ahern
On 11/14/17 9:18 AM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Arkadi says:
> 
> Many of the ASIC's internal resources are limited and are shared between
> several hardware procedures. For example, unified hash-based memory can
> be used for many lookup purposes, like FDB and LPM. In many cases the user
> can provide a partitioning scheme for such a resource in order to perform
> fine tuning for his application. In many cases after setting the
> partitioning of the resource driver reload is needed. This patchset add
> support for hot reset of the driver.
> 
> Such an abstraction can be coupled with devlink's dpipe interface, which
> models the ASIC's pipeline as an graph of match/action tables. By modeling
> the hardware resource object, and by coupling it to several dpipe tables,
> further visibility can be achieved in order to debug ASIC-wide issues.
> 
> The proposed interface will provide the user the ability to understand the
> limitations of the hardware, and receive notification regarding its occupancy.
> Furthermore, monitoring the resource occupancy can be done in real-time and
> can be useful in many cases.
> 
> Userspace part prototype can be found at https://github.com/arkadis/iproute2/
> at resource_dev branch.
> 

now that my firmware problem is fixed, I installed a build with this
patch set. Trying to run devlink to split a port hangs:

$ devlink port split swp1 count 4


[  615.373359] INFO: task devlink:804 blocked for more than 120 seconds.
[  615.379934]   Tainted: GW   4.14.0+ #38
[  615.385238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  615.393111] devlink D0   804771 0x0080
[  615.393115] Call Trace:
[  615.393126]  __schedule+0x1de/0x690
[  615.393130]  schedule+0x36/0x80
[  615.393139]  schedule_preempt_disabled+0xe/0x10
[  615.393146]  __mutex_lock.isra.4+0x211/0x530
[  615.393152]  __mutex_lock_slowpath+0x13/0x20
[  615.393155]  ? __mutex_lock_slowpath+0x13/0x20
[  615.393158]  mutex_lock+0x2f/0x40
[  615.393164]  devlink_port_unregister+0x29/0x60 [devlink]
[  615.393169]  mlxsw_core_port_fini+0x25/0x50 [mlxsw_core]
[  615.393179]  mlxsw_sp_port_remove+0xf0/0x100 [mlxsw_spectrum]
[  615.393186]  mlxsw_sp_port_split+0xdc/0x260 [mlxsw_spectrum]
[  615.393193]  ? _cond_resched+0x19/0x30
[  615.393200]  mlxsw_devlink_port_split+0x36/0x50 [mlxsw_core]
[  615.393206]  devlink_nl_cmd_port_split_doit+0x42/0x50 [devlink]
[  615.393212]  genl_family_rcv_msg+0x1c9/0x390
[  615.393217]  genl_rcv_msg+0x4c/0xa0
[  615.393220]  ? _cond_resched+0x19/0x30
[  615.393228]  ? genl_family_rcv_msg+0x390/0x390
[  615.393232]  netlink_rcv_skb+0xec/0x120
[  615.393235]  genl_rcv+0x28/0x40
[  615.393239]  netlink_unicast+0x170/0x230
[  615.393244]  netlink_sendmsg+0x28e/0x370
[  615.393251]  SYSC_sendto+0x10e/0x1b0
[  615.393258]  ? __audit_syscall_entry+0xc1/0x110
[  615.393261]  ? syscall_trace_enter+0x1c6/0x2d0
[  615.393264]  ? __do_page_fault+0x231/0x4b0
[  615.393268]  SyS_sendto+0xe/0x10
[  615.393272]  do_syscall_64+0x60/0x1f0
[  615.393277]  entry_SYSCALL64_slow_path+0x25/0x25
[  615.393280] RIP: 0033:0x7f4ef43c16f3
[  615.393284] RSP: 002b:7fffb907fbc8 EFLAGS: 0246 ORIG_RAX:
002c
[  615.393287] RAX: ffda RBX: 013660e0 RCX:
7f4ef43c16f3
[  615.393290] RDX: 0040 RSI: 01366110 RDI:
0003
[  615.393291] RBP:  R08: 7f4ef4686d80 R09:
000c
[  615.393292] R10:  R11: 0246 R12:

[  615.393296] R13: 0004 R14:  R15:



Re: [PATCH v10 5/8] ARM: dts: sunxi: Restore EMAC changes (boards)

2017-11-17 Thread Philipp Rossak

Hey,
Sorry for the bringing this up again.
Isn't there a: ethernet0 = &emac; for some boards missing?

Best,
Philipp

(Sorry for sending this to some persons more than once! My Thunderbird 
sent mails in html and didn't reach the mailing lists. I hope it works 
now :) )


On 31.10.2017 09:19, Corentin Labbe wrote:

The original dwmac-sun8i DT bindings have some issue on how to handle
integrated PHY and was reverted in last RC of 4.13.
But now we have a solution so we need to get back that was reverted.

This patch restore all boards DT about dwmac-sun8i
This reverts partially commit fe45174b72ae ("arm: dts: sunxi: Revert EMAC 
changes")

Signed-off-by: Corentin Labbe 
Acked-by: Florian Fainelli 
---
  arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts |  9 +
  arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts   | 19 +++
  arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts | 19 +++
  arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts |  7 +++
  arch/arm/boot/dts/sun8i-h3-orangepi-2.dts |  8 
  arch/arm/boot/dts/sun8i-h3-orangepi-one.dts   |  8 
  arch/arm/boot/dts/sun8i-h3-orangepi-pc-plus.dts   |  5 +
  arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts|  8 
  arch/arm/boot/dts/sun8i-h3-orangepi-plus.dts  | 22 ++
  arch/arm/boot/dts/sun8i-h3-orangepi-plus2e.dts| 16 
  10 files changed, 121 insertions(+)

diff --git a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts 
b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts
index b1502df7b509..6713d0f2b3f4 100644
--- a/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts
+++ b/arch/arm/boot/dts/sun8i-h2-plus-orangepi-zero.dts
@@ -56,6 +56,8 @@
  
  	aliases {

serial0 = &uart0;
+   /* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */
+   ethernet0 = &emac;
ethernet1 = &xr819;
};
  
@@ -102,6 +104,13 @@

status = "okay";
  };
  
+&emac {

+   phy-handle = <&int_mii_phy>;
+   phy-mode = "mii";
+   allwinner,leds-active-low;
+   status = "okay";
+};
+
  &mmc0 {
pinctrl-names = "default";
pinctrl-0 = <&mmc0_pins_a>;
diff --git a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts 
b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts
index e1dba9ffa94b..f2292deaa590 100644
--- a/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts
+++ b/arch/arm/boot/dts/sun8i-h3-bananapi-m2-plus.dts
@@ -52,6 +52,7 @@
compatible = "sinovoip,bpi-m2-plus", "allwinner,sun8i-h3";
  
  	aliases {

+   ethernet0 = &emac;
serial0 = &uart0;
serial1 = &uart1;
};
@@ -111,6 +112,24 @@
status = "okay";
  };
  
+&emac {

+   pinctrl-names = "default";
+   pinctrl-0 = <&emac_rgmii_pins>;
+   phy-supply = <®_gmac_3v3>;
+   phy-handle = <&ext_rgmii_phy>;
+   phy-mode = "rgmii";
+
+   allwinner,leds-active-low;
+   status = "okay";
+};
+
+&external_mdio {
+   ext_rgmii_phy: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   };
+};
+
  &ir {
pinctrl-names = "default";
pinctrl-0 = <&ir_pins_a>;
diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts 
b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts
index 73766d38ee6c..cfb96da3cfef 100644
--- a/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts
+++ b/arch/arm/boot/dts/sun8i-h3-nanopi-m1-plus.dts
@@ -66,6 +66,25 @@
status = "okay";
  };
  
+&emac {

+   pinctrl-names = "default";
+   pinctrl-0 = <&emac_rgmii_pins>;
+   phy-supply = <®_gmac_3v3>;
+   phy-handle = <&ext_rgmii_phy>;
+   phy-mode = "rgmii";
+
+   allwinner,leds-active-low;
+
+   status = "okay";
+};
+
+&external_mdio {
+   ext_rgmii_phy: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <7>;
+   };
+};
+
  &ir {
pinctrl-names = "default";
pinctrl-0 = <&ir_pins_a>;
diff --git a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts 
b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts
index 8d2cc6e9a03f..78f6c24952dd 100644
--- a/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts
+++ b/arch/arm/boot/dts/sun8i-h3-nanopi-neo.dts
@@ -46,3 +46,10 @@
model = "FriendlyARM NanoPi NEO";
compatible = "friendlyarm,nanopi-neo", "allwinner,sun8i-h3";
  };
+
+&emac {
+   phy-handle = <&int_mii_phy>;
+   phy-mode = "mii";
+   allwinner,leds-active-low;
+   status = "okay";
+};
diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts 
b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
index 1bf51802f5aa..b20be95b49d5 100644
--- a/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
+++ b/arch/arm/boot/dts/sun8i-h3-orangepi-2.dts
@@ -54,6 +54,7 @@
aliases {
serial0 = &uart0;
/* ethernet0 is the H3 emac, defined in sun8i-h3.dtsi */
+   ethernet0 = &emac;
ethernet1 = &rtl81

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai
On 17.11.2017 21:54, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 19:29, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 15.11.2017 09:25, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
>
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
>> But the problem is it's need to synchronize_rcu() after net is removed
>> from net_namespace_list():
>>
>> Destroy net_ns:
>> cleanup_net()
>>   mutex_lock(&net_mutex)
>>   list_del_rcu(&net->list)
>>   synchronize_rcu()  <--- Sleep there 
>> for ages
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>> ops_exit_list(ops, &net_exit_list)
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>> ops_free_list(ops, &net_exit_list)
>>   mutex_unlock(&net_mutex)
>>
>> This primitive is not fast, especially on the systems with many 
>> processors
>> and/or when preemptible RCU is enabled in config. So, all the time, while
>> cleanup_net() is waiting for RCU grace period, creation of new net 
>> namespaces
>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>
>> Create net_ns:
>> copy_net_ns()
>>   mutex_lock_killable(&net_mutex)<--- Sleep there 
>> for ages
>>
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
>> This gives signify performance increase, like you may see below. There
>> is measured sequential net namespace creation in a cycle, in single
>> thread, without other tasks (single user mode):
>>
>> 1)int main(int argc, char *argv[])
>> {
>> unsigned nr;
>> if (argc < 2) {
>> fprintf(stderr, "Provide nr iterations arg\n");
>> return 1;
>> }
>> nr = atoi(argv[1]);
>> while (nr-- > 0) {
>> if (unshare(CLONE_NEWNET)) {
>> perror("Can't unshare");
>> return 1;
>> }
>> }
>> return 0;
>> }
>>
>> Origin, 10 unshare():
>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>
>> Patched, 10 unshare():
>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>
>> 2)for i in {1..1}; do unshare -n bash -c exit; done
>>
>> Origin:
>> real 1m24,190s
>> user 0m6,225s
>> sys 0m15,132s
>>
>> Patched:
>> real 0m18,235s   (4.6 times faster)
>> user 0m4,544s
>> sys 0m13,796s
>>
>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
>> down_read_killable()"
>> from Linus tree (not in net-next yet).
>
> Using a rwsem to protect the list of operations makes sense.
>
> That should allow removing the sing
>
> I am not wild about taking a the rwsem down_write in
> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
> goes from being a mild hack to being a pretty bad hack and something
> else that can kill the parallelism you are seeking it add.
>
> There are about 204 instances of struct pernet_operations.  That is a
> lot of code to have carefully audited to ensure it can in parallel all
> at once.  The existence of the exit_batch method, net_ns_barrier,
> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
> to the fact that there are data structures accessed by multiple network
> namespaces.
>
> My preference would be to:
>
> - Add the net_sem in addition to net_mutex with down_write only held in
>   register and unregister, and maybe net_ns_barrier and
>   rtnl_link_unregister.
>
> - Factor out struct pernet_ops out of struct pernet_operations.  With
>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>   being embedded an anonymous member of the old struct pernet_operations.
>
> - Add [un]register_pernet_{sys,dev} functions that take a struct
>   pernet_ops, that don't take net_mutex.  Have them order the
>   pernet_list as:
>
>   pernet_sys
>   pernet_subsys
>   pernet_device
>   pernet_dev
>
>   With the chunk in the middle taking the net_mutex.

 I think this approach will work. Thanks for the suggestion. Some more
 thoughts to the plan below.

 The only difficult thing there will be to choose the right order
 to move ops from pernet_subsys to per

[PATCH] usbnet: ipheth: fix potential null pointer dereference in ipheth_carrier_set

2017-11-17 Thread Gustavo A. R. Silva
_dev_ is being dereferenced before it is null checked, hence there
is a potential null pointer dereference.

Fix this by moving the pointer dereference after _dev_ has been null
checked.

Addresses-Coverity-ID: 1462020
Fixes: bb1b40c7cb86 ("usbnet: ipheth: prevent TX queue timeouts when device not 
ready")
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/usb/ipheth.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c
index ca71f6c..7275761 100644
--- a/drivers/net/usb/ipheth.c
+++ b/drivers/net/usb/ipheth.c
@@ -291,12 +291,15 @@ static void ipheth_sndbulk_callback(struct urb *urb)
 
 static int ipheth_carrier_set(struct ipheth_device *dev)
 {
-   struct usb_device *udev = dev->udev;
+   struct usb_device *udev;
int retval;
+
if (!dev)
return 0;
if (!dev->confirmed_pairing)
return 0;
+
+   udev = dev->udev;
retval = usb_control_msg(udev,
usb_rcvctrlpipe(udev, IPHETH_CTRL_ENDP),
IPHETH_CMD_CARRIER_CHECK, /* request */
-- 
2.7.4



Re: [pull request][net V2 0/5] Mellanox, mlx5 fixes 2017-11-08

2017-11-17 Thread Saeed Mahameed
On Sat, Nov 11, 2017 at 2:42 AM, David Miller  wrote:
> From: Saeed Mahameed 
> Date: Fri, 10 Nov 2017 15:50:15 +0900
>
>> The follwoing series includes some fixes for mlx5 core and etherent
>> driver.
>>
>> Sorry for the late submission but as you can see i have some very
>> critical fixes below that i would like them merged into this RC.
>>
>> Please pull and let me know if there is any problem.
>
> Pulled.
>
>> For -stable:
>> ('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13
>> ('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13
>> ('net/mlx5: Cancel health poll before sending panic teardown command') 
>> kernels >= 4.13
>
> That FPGA change doesn't appear in this pull request.
>

Sorry about that, I had to drop it as you see in "V1->V2" log, but
forgot to remove it from the -stable list.


[PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-17 Thread Nishanth Devarajan
This patch adapts the tc command line interface to allow bandwidth limits
to be specified as a percentage of the interface's capacity.

Adding this functionality requires passing the specified device string to
each class/qdisc which changes the prototype for a couple of functions: the
.parse_qopt and .parse_copt interfaces. The device string is a required
parameter for tc-qdisc and tc-class, and when not specified, the kernel
returns ENODEV. In this patch, if the user tries to specify a bandwidth
percentage without naming the device, we return an error from userspace.

v2:
* Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c,
to make it accessible to tc. 

v3:
* Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for
use in tc.

* Changed couple variable names in int parse_percent_rate().

* Handled showing error message when device speed is unknown.

* Updated man page to warn users that when specifying rates in %, tc only
uses the current device speed and does not recalculate if it changes after.

During cases when properties (like device speed) are unknown, read_prop()
assumes that if the property file can be opened but not read, it means
that the property is unknown.

Signed-off by: Nishanth Devarajan

---
 include/utils.h |  2 ++
 ip/iptuntap.c   | 32 ---
 lib/utils.c | 68 +
 man/man8/tc.8   |  5 -
 tc/q_atm.c  |  2 +-
 tc/q_cbq.c  | 25 -
 tc/q_choke.c|  9 ++--
 tc/q_clsact.c   |  2 +-
 tc/q_codel.c|  2 +-
 tc/q_drr.c  |  4 ++--
 tc/q_dsmark.c   |  4 ++--
 tc/q_fifo.c |  2 +-
 tc/q_fq.c   | 16 +++---
 tc/q_fq_codel.c |  2 +-
 tc/q_gred.c |  9 ++--
 tc/q_hfsc.c | 45 +-
 tc/q_hhf.c  |  2 +-
 tc/q_htb.c  | 18 +++
 tc/q_ingress.c  |  2 +-
 tc/q_mqprio.c   |  2 +-
 tc/q_multiq.c   |  2 +-
 tc/q_netem.c| 23 ++-
 tc/q_pie.c  |  2 +-
 tc/q_prio.c |  2 +-
 tc/q_qfq.c  |  4 ++--
 tc/q_red.c  |  9 ++--
 tc/q_rr.c   |  2 +-
 tc/q_sfb.c  |  2 +-
 tc/q_sfq.c  |  2 +-
 tc/q_tbf.c  | 16 +++---
 tc/tc.c |  2 +-
 tc/tc_class.c   |  2 +-
 tc/tc_qdisc.c   |  2 +-
 tc/tc_util.c| 63 
 tc/tc_util.h|  7 --
 35 files changed, 283 insertions(+), 110 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 3d91c50..9377266 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -87,6 +87,8 @@ int get_prefix(inet_prefix *dst, char *arg, int family);
 int mask2bits(__u32 netmask);
 int get_addr_ila(__u64 *val, const char *arg);
 
+int read_prop(const char *dev, char *prop, long *value);
+int parse_percent(double *val, const char *str);
 int get_hex(char c);
 int get_integer(int *val, const char *arg, int base);
 int get_unsigned(unsigned *val, const char *arg, int base);
diff --git a/ip/iptuntap.c b/ip/iptuntap.c
index b46e452..09f2be2 100644
--- a/ip/iptuntap.c
+++ b/ip/iptuntap.c
@@ -223,38 +223,6 @@ static int do_del(int argc, char **argv)
return tap_del_ioctl(&ifr);
 }
 
-static int read_prop(char *dev, char *prop, long *value)
-{
-   char fname[IFNAMSIZ+25], buf[80], *endp;
-   ssize_t len;
-   int fd;
-   long result;
-
-   sprintf(fname, "/sys/class/net/%s/%s", dev, prop);
-   fd = open(fname, O_RDONLY);
-   if (fd < 0) {
-   if (strcmp(prop, "tun_flags"))
-   fprintf(stderr, "open %s: %s\n", fname,
-   strerror(errno));
-   return -1;
-   }
-   len = read(fd, buf, sizeof(buf)-1);
-   close(fd);
-   if (len < 0) {
-   fprintf(stderr, "read %s: %s", fname, strerror(errno));
-   return -1;
-   }
-
-   buf[len] = 0;
-   result = strtol(buf, &endp, 0);
-   if (*endp != '\n') {
-   fprintf(stderr, "Failed to parse %s\n", fname);
-   return -1;
-   }
-   *value = result;
-   return 0;
-}
-
 static void print_flags(long flags)
 {
if (flags & IFF_TUN)
diff --git a/lib/utils.c b/lib/utils.c
index 4f2fa28..9d5ba2a 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -39,6 +39,74 @@
 int resolve_hosts;
 int timestamp_short;
 
+int read_prop(const char *dev, char *prop, long *value)
+{
+   char fname[128], buf[80], *endp, *nl;
+   FILE *fp;
+   long result;
+   int ret;
+
+   ret = snprintf(fname, sizeof(fname), "/sys/class/net/%s/%s",
+   dev, prop);
+
+   if (ret <= 0 || ret >= sizeof(fname)) {
+   fprintf(stderr, "could not build pathname for property\n");
+   return -1;
+   }
+
+   fp = fopen(fname, "r");
+   if (fp == NULL) {
+   fprintf(stderr, "fopen %s: %s\n", fname, strerror(errno));
+   return -1;
+   }
+
+   if (!fgets(buf, size

Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-17 Thread Stephen Hemminger
On Sat, 18 Nov 2017 02:13:38 +0530
Nishanth Devarajan  wrote:

> + result = strtoul(buf, &endp, 0);
> +
> + if (*endp || buf == endp) {
> + fprintf(stderr, "value \"%s\" in file %s is not a number\n",
> + buf, fname);
> + goto out;
> + }
> +
> + if (result == ULONG_MAX && errno == ERANGE) {
> + fprintf(stderr, "strtoul %s: %s", fname, strerror(errno));
> + goto out;
> + }

Since speed value of unknown is represented as "-1" I think you need to
change this API to take signed value (ie use strtol)


Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-17 Thread Kirill Tkhai
On 17.11.2017 21:52, Eric W. Biederman wrote:
> Kirill Tkhai  writes:
> 
>> On 15.11.2017 19:31, Eric W. Biederman wrote:
>>> Kirill Tkhai  writes:
>>>
 On 15.11.2017 12:51, Kirill Tkhai wrote:
> On 15.11.2017 06:19, Eric W. Biederman wrote:
>> Kirill Tkhai  writes:
>>
>>> On 14.11.2017 21:39, Cong Wang wrote:
 On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  
 wrote:
> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>
> get_user_ns(user_ns);
>
> -   rv = mutex_lock_killable(&net_mutex);
> +   rv = down_read_killable(&net_sem);
> if (rv < 0) {
> net_free(net);
> dec_net_namespaces(ucounts);
> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
> list_add_tail_rcu(&net->list, &net_namespace_list);
> rtnl_unlock();
> }
> -   mutex_unlock(&net_mutex);
> +   up_read(&net_sem);
> if (rv < 0) {
> dec_net_namespaces(ucounts);
> put_user_ns(user_ns);
> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
> list_replace_init(&cleanup_list, &net_kill_list);
> spin_unlock_irq(&cleanup_list_lock);
>
> -   mutex_lock(&net_mutex);
> +   down_read(&net_sem);
>
> /* Don't let anyone else find us. */
> rtnl_lock();
> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
> list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_free_list(ops, &net_exit_list);
>
> -   mutex_unlock(&net_mutex);
> +   up_read(&net_sem);

 After your patch setup_net() could run concurrently with cleanup_net(),
 given that ops_exit_list() is called on error path of setup_net() too,
 it means ops->exit() now could run concurrently if it doesn't have its
 own lock. Not sure if this breaks any existing user.
>>>
>>> Yes, there will be possible concurrent ops->init() for a net namespace,
>>> and ops->exit() for another one. I hadn't found pernet operations, which
>>> have a problem with that. If they exist, they are hidden and not clear 
>>> seen.
>>> The pernet operations in general do not touch someone else's memory.
>>> If suddenly there is one, KASAN should show it after a while.
>>
>> Certainly the use of hash tables shared between multiple network
>> namespaces would count.  I don't rembmer how many of these we have but
>> there used to be quite a few.
>
> Could you please provide an example of hash tables, you mean?

 Ah, I see, it's dccp_hashinfo etc.
>>
>> JFI, I've checked dccp_hashinfo, and it seems to be safe.
>>
>>>
>>> The big one used to be the route cache.  With resizable hash tables
>>> things may be getting better in that regard.
>>
>> I've checked some fib-related things, and wasn't able to find that.
>> Excuse me, could you please clarify, if it's an assumption, or
>> there is exactly a problem hash table, you know? Could you please
>> point it me more exactly, if it's so.
> 
> Two things.
> 1) Hash tables are one case I know where we access data from multiple
>network namespaces.  As such it can not be asserted that is no
>possibility for problems.
> 
> 2) The responsible way to handle this is one patch for each set of
>methods explaining why those methods are safe to run in parallel.
> 
>That ensures there is opportunity for review and people are going
>slowly enough that they actually look at these issues.
> 
> The reason I want to see this broken up is that at 200ish sets of
> methods it is too much to review all at once.

Ok, it's possible to split the changes in 400 patches, but there is
a problem with three-state (no compile, module, built-in) drivers.
Git bisect won't work anyway. Please see the description of the problem
in cover message "[PATCH RFC 00/25] Replacing net_mutex with rw_semaphore"
I sent today.
 
> I completely agree that odds are that this can be made safe and that it
> is mostly likely already safe in practically every instance.My guess
> would be that if there are problems that need to be addressed they
> happen in one or two places and we need to find them.  If possible I
> don't want to find them after the code has shipped in a stable release.

Kirill


[PATCH iproute2] tc: cleanup qdisc arg parsing

2017-11-17 Thread Stephen Hemminger
The qdisc arg parsing has magic limit of 16 for class which is not required
by kernel. Also the limit of 16 for device name is really IFNAMSIZ.

Signed-off-by: Stephen Hemminger 
---
 tc/tc_qdisc.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index fcb75f29128e..1066ae05a4b5 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -49,8 +50,7 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int 
argc, char **argv)
struct tc_sizespec  szopts;
__u16   *data;
} stab = {};
-   char  d[16] = {};
-   char  k[16] = {};
+   char  d[IFNAMSIZ] = {};
struct {
struct nlmsghdr n;
struct tcmsgt;
@@ -89,8 +89,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int 
argc, char **argv)
return -1;
}
req.t.tcm_parent = TC_H_CLSACT;
-   strncpy(k, "clsact", sizeof(k) - 1);
-   q = get_qdisc_kind(k);
+
+   q = get_qdisc_kind("clsact");
req.t.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
NEXT_ARG_FWD();
break;
@@ -100,8 +100,8 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int 
argc, char **argv)
return -1;
}
req.t.tcm_parent = TC_H_INGRESS;
-   strncpy(k, "ingress", sizeof(k) - 1);
-   q = get_qdisc_kind(k);
+
+   q = get_qdisc_kind("ingress");
req.t.tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
NEXT_ARG_FWD();
break;
@@ -124,26 +124,23 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, 
int argc, char **argv)
} else if (matches(*argv, "help") == 0) {
usage();
} else {
-   strncpy(k, *argv, sizeof(k)-1);
-
-   q = get_qdisc_kind(k);
+   q = get_qdisc_kind(*argv);
argc--; argv++;
break;
}
argc--; argv++;
}
 
-   if (k[0])
-   addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1);
if (est.ewma_log)
addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est));
 
if (q) {
+   addattr_l(&req.n, sizeof(req), TCA_KIND, q->id, strlen(q->id) + 
1);
if (q->parse_qopt) {
if (q->parse_qopt(q, argc, argv, &req.n))
return 1;
} else if (argc) {
-   fprintf(stderr, "qdisc '%s' does not support option 
parsing\n", k);
+   fprintf(stderr, "qdisc '%s' does not support option 
parsing\n", q->id);
return -1;
}
} else {
-- 
2.11.0



Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-17 Thread Stephen Hemminger
On Sat, 18 Nov 2017 02:13:38 +0530
Nishanth Devarajan  wrote:

> diff --git a/tc/tc_util.h b/tc/tc_util.h
> index 583a21a..7b7420a 100644
> --- a/tc/tc_util.h
> +++ b/tc/tc_util.h
> @@ -24,14 +24,14 @@ struct qdisc_util {
>   struct  qdisc_util *next;
>   const char *id;
>   int (*parse_qopt)(struct qdisc_util *qu, int argc,
> -   char **argv, struct nlmsghdr *n);
> +   char **argv, struct nlmsghdr *n, char *dev);

One more nit...
Since parsing queue options should not modify the device name, that should
be const char *.


Greetings From Mrs. Sarah Smith

2017-11-17 Thread Mrs. Sarah Smith
Greetings From Mrs. Sarah Smith,

With Due Respect and Humanity, I was compelled to write to you under a
humanitarian ground. My names are Mrs.Sarah Smith , am 52 years old From
Switzerland; I am married to Late Mr. Hazard  Smith; but  we Living Benin
Republic, We were married for 25 years without a child. He died after a
Cardiac Arteries Operation. And Recently, My Doctor told me that I would
not last for the  next 3 months due to my cancer problem (Breast cancer).
Before my husband died last year  there is this sum $2,800,000.00 United
State Dollars that he deposited with a bank in  Benin and presently the
fund is still with the Bank.  Having known my condition I decided to
donate this fund to individual that will utilize  this fund the way I am
going to instruct herein. I want somebody that will use this fund
according to the desire of my late husband to help less privileged people,
orphanages,  widows. I took this decision because I don't have any child
that will inherit this fund,  and I don't want in a way were this fund
will be used in wrong way. If you wish to help me actualize this vision,
please indicate your readiness immediately you received this proposal.
Remain blessed you as you listing to the voice of reasoning.

Your beloved sister Mrs. Sarah Smith,


Re: [LTP] [RFC] [PATCH] netns: Fix race in virtual interface bringup

2017-11-17 Thread Dan Rue
Alexey, Li, thank you for your suggestions.

On Fri, Nov 17, 2017 at 03:08:20PM +0300, Alexey Kodanev wrote:
> On 11/17/2017 09:09 AM, Li Wang wrote:
> > Hi Dan,
> >
> > On Fri, Nov 10, 2017 at 4:38 AM, Dan Rue  wrote:
> >> Symptoms (+ command, error):
> >> netns_comm_ip_ipv6_ioctl:
> >> + ip netns exec tst_net_ns1 ping6 -q -c2 -I veth1 fd00::2
> >> connect: Cannot assign requested address
> >>
> >> netns_comm_ip_ipv6_netlink:
> >> + ip netns exec tst_net_ns0 ping6 -q -c2 -I veth0 fd00::3
> >> connect: Cannot assign requested address
> >>
> >> netns_comm_ns_exec_ipv6_ioctl:
> >> + ns_exec 6689 net ping6 -q -c2 -I veth0 fd00::3
> >> connect: Cannot assign requested address
> >>
> >> netns_comm_ns_exec_ipv6_netlin:
> >> + ns_exec 6891 net ping6 -q -c2 -I veth0 fd00::3
> >> connect: Cannot assign requested address
> >>
> >> The error is coming from ping6, which is trying to get an IP address for
> >> veth0 (due to -I veth0), but cannot. Waiting for two seconds fixes the
> >> test in my testcases. 1 second is not long enough.
> >>
> >> dmesg shows the following during the test:
> >>
> >> [Nov 7 15:39] LTP: starting netns_comm_ip_ipv6_ioctl (netns_comm.sh ip 
> >> ipv6 ioctl)
> >> [  +0.302401] IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
> >> [  +0.048059] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> 
> It's quite strange that veth interface needs 2 seconds to become
> operational and it is up in less than 0.3s according to dmesg, but
> you said that it's not enough even 1 sec... Are you sure that IPv6
> address not in tentative state and dad process actually disabled?
> I'm asking because you don't have it disabled in the script:
> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3

Investigating further, the dmesg output is reporting on the status of
the link between veth0 and veth1, not the veth0 interface itself. That
is, the first dmesg message comes from "ip netns exec tst_net_ns0
ifconfig veth0 up" and the second comes from "ip netns exec tst_net_ns1
ifconfig veth1 up". This explains why we see .3s in dmesg but a 2 second
sleep being required. There is not actually anything in dmesg that is
helpful here.

Regarding dad (duplicate address detection), we have seen similar issues
on low power ARM64 boards and IPv4. Anyway, I tried disabling dad on the
interface and it did not make a difference.

> 
> >>
> >> Signed-off-by: Dan Rue 
> >> ---
> >>
> >> We've periodically hit this problem across many arm64 kernels and boards, 
> >> and
> >> it seems to be caused by "ping6" running before the virtual interface is
> >> actually ready. "sleep 2" works around the issue and proves that it is a 
> >> race
> >> condition, but I would prefer something faster and deterministic. Please
> >> suggest a better implementation.
> > Just FYI:
> >
> > I'm not good at network things, but one method I copied from ltp/numa
> > test is to split the '2s' into many smaller pieces of time.
> >
> > which something like:
> >
> > --- a/testcases/kernel/containers/netns/netns_helper.sh
> > +++ b/testcases/kernel/containers/netns/netns_helper.sh
> > @@ -240,6 +240,22 @@ netns_ip_setup()
> > tst_brkm TBROK "unable to add device veth1 to the
> > separate network namespace"
> >  }
> >
> > +wait_for_set_ip()
> > +{
> > +   local dev=$1
> > +   local retries=200
> > +
> > +   while [ $retries -gt 0 ]; do
> > +   dmesg -c | grep -q "IPv6: ADDRCONF(NETDEV_CHANGE):
> > $dev: link becomes ready"
> 
> 
> What about "grep -q up /sys/class/net/$dev/operstate && break"?

Since dmesg will not help, I explored /sys as proposed.

operstate shows "up", and ping6 still fails.
carrier shows "1" (up), and ping6 still fails.
dormant shows "0" (interface is not dormant), and ping6 still fails.
flags shows "0x1003" before and after a 2s sleep (they don't change)

So it seems there is nothing in dmesg, or /sys that can help here.

Dan

> 
> Thanks,
> Alexey
> 
> 
> > +   if [ $? -eq 0 ]; then
> > +   break
> > +   fi
> > +
> > +   retries=$((retries-1))
> > +   tst_sleep 10ms
> > +   done
> > +}
> > +
> >  ##
> >  # Enables virtual ethernet devices and assigns IP addresses for both
> >  # of them (IPv4/IPv6 variant is decided by netns_setup() function).
> > @@ -285,6 +301,9 @@ netns_set_ip()
> > tst_brkm TBROK "enabling veth1 device failed"
> > ;;
> > esac
> > +
> > +   wait_for_set_ip veth0
> > +   wait_for_set_ip veth1
> >  }
> >
> >  netns_ns_exec_cleanup()
> >
> >> Also, is it correct that "ifconfig veth0 up" returns before the interface 
> >> is
> >> actually ready?
> >>
> >> See also this isolated test script:
> >> https://gist.github.com/danrue/7b76bbcbc23a6296030b7295650b69f3
> >>
> >>  testcases/kernel/containers/netns/netns_helper.sh | 1 +
> >>  1 file changed, 1

[PATCH net] net: accept UFO datagrams from tuntap and packet

2017-11-17 Thread Willem de Bruijn
From: Willem de Bruijn 

Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216fa8 ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32, which is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Link: 
http://lkml.kernel.org/r/
Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek 
Signed-off-by: Willem de Bruijn 
---
 drivers/net/tap.c   |  2 +-
 drivers/net/tun.c   |  4 ++
 include/linux/netdev_features.h |  4 +-
 include/linux/netdevice.h   |  1 +
 include/linux/skbuff.h  |  2 +
 include/linux/virtio_net.h  |  5 ++-
 include/net/ipv6.h  |  1 +
 net/core/dev.c  |  3 +-
 net/ipv4/af_inet.c  | 12 +-
 net/ipv4/udp_offload.c  | 49 ++--
 net/ipv6/output_core.c  | 31 +++
 net/ipv6/udp_offload.c  | 85 +++--
 net/openvswitch/datapath.c  | 14 +++
 net/openvswitch/flow.c  |  6 ++-
 net/sched/act_csum.c|  6 +++
 15 files changed, 211 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index b13890953ebb..e9489b88407c 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1077,7 +1077,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-   TUN_F_TSO_ECN))
+   TUN_F_TSO_ECN | TUN_F_UFO))
return -EINVAL;
 
rtnl_lock();
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6bb1e604aadd..a33385d8ac65 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2369,6 +2369,10 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
features |= NETIF_F_TSO6;
arg &= ~(TUN_F_TSO4|TUN_F_TSO6);
}
+
+   if (arg & TUN_F_UFO) {
+   arg &= ~TUN_F_UFO;
+   }
}
 
/* This gives the user a way to test for new features in future by
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index dc8b4896b77b..b1b0ca7ccb2b 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -54,8 +54,9 @@ enum {
NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
NETIF_F_GSO_SCTP_BIT,   /* ... SCTP fragmentation */
NETIF_F_GSO_ESP_BIT,/* ... ESP with TSO */
+   NETIF_F_GSO_UDP_BIT,/* ... UFO, deprecated except tuntap */
/**/NETIF_F_GSO_LAST =  /* last bit, see GSO_MASK */
-   NETIF_F_GSO_ESP_BIT,
+   NETIF_F_GSO_UDP_BIT,
 
NETIF_F_FCOE_CRC_BIT,   /* FCoE CRC32 */
NETIF_F_SCTP_CRC_BIT,   /* SCTP checksum offload */
@@ -132,6 +133,7 @@ enum {
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_GSO_SCTP   __NETIF_F(GSO_SCTP)
 #define NETIF_F_GSO_ESP__NETIF_F(GSO_ESP)
+#define NETIF_F_GSO_UDP__NETIF_F(GSO_UDP)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX__NETIF_F(HW_VLAN_STAG_TX)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6b274bfe489f..ef789e1d679e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4140,6 +4140,7 @@ static inline bool net_gso_ok(netdev_features_t features, 
int gso_ty

Re: regression: UFO removal breaks kvm live migration

2017-11-17 Thread Willem de Bruijn
On Fri, Nov 17, 2017 at 9:48 AM, Willem de Bruijn
 wrote:
>>> Okay, I will send a patch to reinstate UFO for this use case (only). There
>>> is some related work in tap_handle_frame and packet_direct_xmit to
>>> segment directly in the device. I will be traveling the next few days, so
>>> it won't be in time for 4.14 (but can go in stable later, of course).
>>
>> I'm finishing up and running some tests. The majority of the patch is a
>> straightforward partial revert of the patchset, so while fairly large for a
>> patch to net (~150 lines, esp. in udp[46]_ufo_fragment), that is all
>> thoroughly tested code. Notably absent are the protocol layer and
>> hardware support (NETIF_F_UFO) portions.
>>
>> The only open issue is whether to rely on existing skb_gso_segment
>> processing in the transmit path from validate_xmit_skb or to add new
>> skb_gso_segment calls directly to tun_get_user, tap_get_user and
>> pf_packet. Tun has to loop around four different ways of injecting
>> packets into the device. Something like the below snippet.
>>
>> More conservative is to introduce no completely new code and rely on
>> validate_xmit_skb, but that means having to protect the entire stack
>> against skbs with SKB_GSO_UDP, so also bringing back some
>> checksum and fragment handling snippets in gre_gso_segment,
>> __skb_udp_tunnel_segment, act_csum and openvswitch.
>
> Come to think of it, as this patch does not bring back NETIF_F_UFO
> support to NETIF_F_GSO_SOFTWARE, the tunnel cases can be
> excluded.
>
> Then this is probably the simpler and more obviously correct approach.

Sent: http://patchwork.ozlabs.org/patch/839168/


[iproute2 PATCH] man: tc-mqprio: add documentation for new offload options

2017-11-17 Thread Amritha Nambiar
This patch adds documentation for additional offload modes and
associated parameters in tc-mqprio.

Signed-off-by: Amritha Nambiar 
---
 man/man8/tc-mqprio.8 |   60 +-
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/man/man8/tc-mqprio.8 b/man/man8/tc-mqprio.8
index 0e1d305..a1bedd3 100644
--- a/man/man8/tc-mqprio.8
+++ b/man/man8/tc-mqprio.8
@@ -16,7 +16,17 @@ P0 P1 P2...
 count1@offset1 count2@offset2 ...
 .B ] [ hw
 1|0
-.B ]
+.B ] [ mode
+dcb|channel]
+.B ] [ shaper
+dcb|
+.B [ bw_rlimit
+.B min_rate
+min_rate1 min_rate2 ...
+.B max_rate
+max_rate1 max_rate2 ...
+.B ]]
+
 
 .SH DESCRIPTION
 The MQPRIO qdisc is a simple queuing discipline that allows mapping
@@ -36,14 +46,16 @@ and
 By default these parameters are configured by the hardware
 driver to match the hardware QOS structures.
 
-Enabled hardware can provide hardware QOS with the ability to steer
-traffic flows to designated traffic classes provided by this qdisc.
-Configuring the hardware based QOS mechanism is outside the scope of
-this qdisc. Tools such as
-.B lldpad
-and
-.B ethtool
-exist to provide this functionality. Also further qdiscs may be added
+.B Channel
+mode supports full offload of the mqprio options, the traffic classes, the 
queue
+configurations and QOS attributes to the hardware. Enabled hardware can provide
+hardware QOS with the ability to steer traffic flows to designated traffic
+classes provided by this qdisc. Hardware based QOS is configured using the
+.B shaper
+parameter.
+.B bw_rlimit
+with minimum and maximum bandwidth rates can be used for setting
+transmission rates on each traffic class. Also further qdiscs may be added
 to the classes of MQPRIO to create more complex configurations.
 
 .SH ALGORITHM
@@ -104,9 +116,35 @@ contiguous range of queues.
 hw
 Set to
 .B 1
-to use hardware QOS defaults. Set to
+to support hardware offload. Set to
 .B 0
-to override hardware defaults with user specified values.
+to configure user specified values in software only.
+
+.TP
+mode
+Set to
+.B channel
+for full use of the mqprio options. Use
+.B dcb
+to offload only TC values and use hardware QOS defaults. Supported with 'hw'
+set to 1 only.
+
+.TP
+shaper
+Use
+.B bw_rlimit
+to set bandwidth rate limits for a traffic class. Use
+.B dcb
+for hardware QOS defaults. Supported with 'hw' set to 1 only.
+
+.TP
+min_rate
+Minimum value of bandwidth rate limit for a traffic class.
+
+.TP
+max_rate
+Maximum value of bandwidth rate limit for a traffic class.
+
 
 .SH AUTHORS
 John Fastabend, 



[iproute2 PATCH] man: tc-flower: add explanation for hw_tc option

2017-11-17 Thread Amritha Nambiar
Add details explaining the hw_tc option.

Signed-off-by: Amritha Nambiar 
---
 man/man8/tc-flower.8 |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index be46f02..fd9098e 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -10,7 +10,10 @@ flower \- flow based traffic control filter
 .B action
 .IR ACTION_SPEC " ] [ "
 .B classid
-.IR CLASSID " ]"
+.IR CLASSID " ] [ "
+.B hw_tc
+.IR TCID " ]"
+
 
 .ti -8
 .IR MATCH_LIST " := [ " MATCH_LIST " ] " MATCH
@@ -77,6 +80,10 @@ is in the form
 .BR X : Y ", while " X " and " Y
 are interpreted as numbers in hexadecimal format.
 .TP
+.BI hw_tc " TCID"
+Specify a hardware traffic class to pass matching packets on to. TCID is in the
+range 0 through 15.
+.TP
 .BI indev " ifname"
 Match on incoming interface name. Obviously this makes sense only for forwarded
 flows.



[PATCH 1/8] mm: kmemleak: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by kmemleak at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
---
 mm/kmemleak.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 7780cd8..25b977f 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1



[PATCH 6/8] net: caif: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by caif at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Dmitry Tarnyagin 
Cc: "David S. Miller" 
---
 net/caif/cfpkt_skbuff.c | 1 -
 net/caif/chnl_net.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/net/caif/cfpkt_skbuff.c b/net/caif/cfpkt_skbuff.c
index 71b6ab2..38c2b7a 100644
--- a/net/caif/cfpkt_skbuff.c
+++ b/net/caif/cfpkt_skbuff.c
@@ -8,7 +8,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 
diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c
index 922ac1d..53ecda1 100644
--- a/net/caif/chnl_net.c
+++ b/net/caif/chnl_net.c
@@ -8,7 +8,6 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ":%s(): " fmt, __func__
 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1



[PATCH 8/8] net: tipc: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by TIPC at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Jon Maloy 
Cc: Ying Xue 
Cc: "David S. Miller" 
---
 net/tipc/core.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index 5cc5398..099e072 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -49,7 +49,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1



[PATCH 4/8] vfs: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by vfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Alexander Viro 
---
 fs/dcache.c | 1 -
 fs/file_table.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f901413..9340e8c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/fs/file_table.c b/fs/file_table.c
index 61517f5..dab099e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -23,7 +23,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1



[PATCH 3/8] fs: btrfs: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
---
 fs/btrfs/extent_map.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2e348fb..cced7f1 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -2,7 +2,6 @@
 #include 
 #include 
 #include 
-#include 
 #include "ctree.h"
 #include "extent_map.h"
 #include "compression.h"
-- 
1.8.3.1



[PATCH 7/8] net: ovs: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by openvswitch at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Pravin Shelar 
Cc: "David S. Miller" 
Cc: d...@openvswitch.org
---
 net/openvswitch/vport-internal_dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 04a3128..2f47c65 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -16,7 +16,6 @@
  * 02110-1301, USA
  */
 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1



[PATCH 2/8] fs: pstore: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by pstore at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Kees Cook 
Cc: Anton Vorontsov 
Cc: Colin Cross 
Cc: Tony Luck 
---
 fs/pstore/platform.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 2b21d18..25dcef4 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -41,7 +41,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
-- 
1.8.3.1



[PATCH 5/8] crypto: remove unused hardirq.h

2017-11-17 Thread Yang Shi
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by crypto at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: linux-cry...@vger.kernel.org
---
 crypto/ablk_helper.c | 1 -
 crypto/blkcipher.c   | 1 -
 crypto/mcryptd.c | 1 -
 3 files changed, 3 deletions(-)

diff --git a/crypto/ablk_helper.c b/crypto/ablk_helper.c
index 1441f07..ee52660 100644
--- a/crypto/ablk_helper.c
+++ b/crypto/ablk_helper.c
@@ -28,7 +28,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 6c43a0a..01c0d4a 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/crypto/mcryptd.c b/crypto/mcryptd.c
index 4e64726..9fa362c 100644
--- a/crypto/mcryptd.c
+++ b/crypto/mcryptd.c
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MCRYPTD_MAX_CPU_QLEN 100
 #define MCRYPTD_BATCH 9
-- 
1.8.3.1



Re: [ftrace-bpf 1/5] add BPF_PROG_TYPE_FTRACE to bpf

2017-11-17 Thread Alexei Starovoitov
On Mon, Nov 13, 2017 at 12:06:17AM -0800, peng yu wrote:
> > 1. anything bpf related has to go via net-next tree.
> I found there is a net-next git repo:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> I will use this repo for the further bpf-ftrace patch set.
> 
> > 2.
> > this obviously breaks ABI. New types can only be added to the end.
> Sure, I will add the new type at the end.
> 
> > 3.
> > this won't even compile, since ftrace_regs is only added in the patch 4.
> It could compile, as the ftrace_regs related code is inside the
> "#ifdef FTRACE_BPF_FILTER" macro, if this macro is not defined, no
> ftrace_regs related code would be compiled.
> 
> 
> > Since bpf program will see ftrace_regs as an input it becomes
> > abi, so has to be defined in uapi/linux/bpf_ftrace.h or similar.
> > We need to think through how to make it generic across archs
> > instead of defining ftrace_regs for each arch.
> I'm not sure whether I'm fully understand your meaning. Like kprobe,
> the ftrace-bpf need to get a function's parameters and check them. So
> it won't be abi stable, and it should depend on architecture
> implement. I can create a header file like uapi/linux/bpf_ftrace.h,
> but I noticed that kprobe doesn't have such a header file, if I'm
> wrong, please let me know. And about make it generic across archs, I
> know kprobe use pt_regs as parameter, the pt_regs is defined on each
> arch, so I can't see how bpf-ftrace can get a generic interface across
> archs if it need to check function's parameters. If I misunderstand
> anything, please let me know.

all of ftrace are called at function entry and calling convention
is fixed per architecture, so we can make a generic and stable
struct bpf_ftrace_args {
  __u64 arg1, arg2, .. arg5;
};
save_mcount_regs doesn't care what order the regs are stored
so the same stack space can be used to keep bpf_ftrace_args
and used in restore_mcount_regs.
I'd also make it depend on DYNAMIC_FTRACE_WITH_REGS to avoid
dealing with obscure corner cases.

> 
> > 4.
> > the patch 2/3 takes an approach of passing FD integer value in text form
> > to the kernel. That approach was discussed years ago and rejected.
> > It has to use binary interface like perf_event + ioctl.
> > See RFC patches where we're extending perf_event_open syscall to
> > support binary access to kprobe/uprobe.
> > imo binary interface to ftrace is pre-requisite to ftrace+bpf work.
> > We've had too many issues with text based kprobe api to repeat
> > the same mistake here.
> I notice the kprobe-bpf prog is set through the PERF_EVENT_IOC_SET_BPF
> ioctl, I may try to see whether I can reuse this interface, or if it
> is not suitable, I will try to define a new binary interface.
> 
> > 5.
> > patch 4 hacks save_mcount_regs asm to pass ctx pointer in %rcx
> > whereas it's only used in ftrace_graph_caller which doesn't seem right.
> > It points out to another issue that such ftrace+bpf integration
> > is only done for ftrace_graph_caller without extensibility in mind.
> > If we do ftrace+bpf I'd rather see generic framework that applies
> > to all of ftrace instead of single feature of it.
> It is a hard problem. The ftrace framework has lots of tracers,
> function tracer and function graph tracer use the 'gcc -pg' directly,
> other tracers use tracepoint, I should spend more time to find a
> suitable solution.

since all of ftrace goes through the same function entry point
it should be possible to have one generic bpf filter interface
suitable for all tracers that ftrace supports.



  1   2   >