Re: [PATCH] net: dsa: ksz: Increase the tag alignment

2018-12-08 Thread Florian Fainelli
Le 12/7/18 à 8:25 PM, Marek Vasut a écrit :
> On 12/08/2018 01:52 AM, tristram...@microchip.com wrote:
>>> -   padlen = (skb->len >= ETH_ZLEN) ? 0 : ETH_ZLEN - skb->len;
>>> +   padlen = (skb->len >= VLAN_ETH_ZLEN) ? 0 : VLAN_ETH_ZLEN - skb-
 len;
>>
>> The requirement is the tail tag should be at the end of frame before FCS.
>> When the length is less than 60 the MAC controller will pad the frame to
>> legal size.  That is why this function makes sure the padding is done
>> manually.  Increasing the size just increases the length of the frame and the
>> chance to allocate new socket buffer.
>>
>> Example of using ping size of 18 will have the sizes of request and response
>> differ by 4 bytes.  Not that it matters much.
>>
>> Are you concerned the MAC controller will remove the VLAN tag and so the 
>> frame
>> will not be sent? Or the switch removes the VLAN tag and is not able to send?
> 
> With TI CPSW in dual-ethernet configuration, which adds internal VLAN
> tag at the end of the frame, the KSZ switch fails. The CPU will send out
> packets and the switch will reject them as corrupted. It needs this
> extra VLAN tag padding.

Oh so they add the internal VLAN at the end of the frame, not the
beginning? That is quite surprising but that would not be the one single
oddity found with CPSW I am afraid.. The way I would approach this is
with layering where the padding needs to occur:

- within the tag driver you need to make sure there is enough room to
insert the KSZ tag

- within the ethernet MAC driver (which comes last in the transmit
path), you need to make sure there is enough room to insert that trailer
VLAN tag to permit internal transmission
-- 
Florian


Re: [PATCH v2 net-next 0/2] platform data controls for mdio-gpio

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> Soon to be mainlined is an x86 platform with a Marvell switch, and a
> bit-banging MDIO bus. In order to make this work, the phy_mask of the
> MDIO bus needs to be set to prevent scanning for PHYs, and the
> phy_ignore_ta_mask needs to be set because the switch has broken
> turnaround.
> 
> Add a platform_data structure with these parameters.

Looks good thanks Andrew. I do wonder if we could define a common
mii_bus_platform_data structure eventually which is comprised of these
two members (if nothing else) and maybe update the common
mdiobus_register() code path to look for these members. If a subsequent
platform data/device MDIO bus shows up we could do that at that time.

Thanks!

> 
> Andrew Lunn (2):
>   net: phy: mdio-gpio: Add platform_data support for phy_mask
>   net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
> 
>  MAINTAINERS |  1 +
>  drivers/net/phy/mdio-gpio.c |  7 +++
>  include/linux/platform_data/mdio-gpio.h | 14 ++
>  3 files changed, 22 insertions(+)
>  create mode 100644 include/linux/platform_data/mdio-gpio.h
> 


-- 
Florian


Re: [PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> The Marvell 6390 Ethernet switch family does not perform MDIO
> turnaround correctly. Many hardware MDIO bus masters don't care about
> this, but the bitbangging implementation in Linux does by default. Add
> phy_ignore_ta_mask to the platform data so that the bitbangging code
> can be told which devices are known to get TA wrong.
> 
> v2
> --
> int -> u32 in platform data structure
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> It is sometimes necessary to instantiate a bit-banging MDIO bus as a
> platform device, without the aid of device tree.
> 
> When device tree is being used, the bus is not scanned for devices,
> only those devices which are in device tree are probed. Without device
> tree, by default, all addresses on the bus are scanned. This may then
> find a device which is not a PHY, e.g. a switch. And the switch may
> have registers containing values which look like a PHY. So during the
> scan, a PHY device is wrongly created.
> 
> After the bus has been registered, a search is made for
> mdio_board_info structures which indicates devices on the bus, and the
> driver which should be used for them. This is typically used to
> instantiate Ethernet switches from platform drivers.  However, if the
> scanning of the bus has created a PHY device at the same location as
> indicated into the board info for a switch, the switch device is not
> created, since the address is already busy.
> 
> This can be avoided by setting the phy_mask of the mdio bus. This mask
> prevents addresses on the bus being scanned.
> 
> v2
> --
> int -> u32 in platform data structure
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH] gianfar: Add gfar_change_carrier()

2018-12-07 Thread Florian Fainelli
On 12/7/18 9:26 AM, Andrew Lunn wrote:
>> Would you be happier if .ndo_change_carrier() only acted on Fixed PHYs?
> 
> I think it makes sense to allow a fixed phy carrier to be changed from
> user space. However, i don't think you can easily plumb that to
> .ndo_change_carrier(), since that is a MAC feature. You need to change
> the fixed_phy_status to indicate the PHY has lost link, and then let
> the usual mechanisms tell the MAC it is down and change the carrier
> status.

Joakim, I still don't understand what did not work with:

- adding a ndo_change_carrier() interface which keeps a boolean flag
whether the link was up or not

- register a fixed_link_update callback for your fixed PHY, which just
propagates that flag back to the fixed PHY

and that should take care of having the carrier go down, as driven by
the PHY state machine, for that fixed device.
-- 
Florian


Re: [PATCH] net: dsa: ksz: Fix port membership

2018-12-07 Thread Florian Fainelli
On 12/7/18 11:37 AM, tristram...@microchip.com wrote:
>> If two ports are in the same bridge and in forwarding state, the packets
>> must be able to pass between them in both directions. The current code
>> only configures this bridge membership for a port newly added to the
>> bridge, but does not update all the other ports. Thus, ingress packets
>> on the new port will be forwarded, but ingress packets on other ports
>> destined for the new port (eg. a reply) will not be forwarded back to
>> the new port, because they are not configured to do so. This patch fixes
>> that by updating the membership registers of all ports.
>>
>> Signed-off-by: Marek Vasut 
>> Cc: Vivien Didelot 
>> Cc: Woojung Huh 
>> Cc: David S. Miller 
>> Cc: Tristram Ha 
>> ---
>>  drivers/net/dsa/microchip/ksz9477.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/dsa/microchip/ksz9477.c
>> b/drivers/net/dsa/microchip/ksz9477.c
>> index 0684657fbf9a9..e24dd14ccde77 100644
>> --- a/drivers/net/dsa/microchip/ksz9477.c
>> +++ b/drivers/net/dsa/microchip/ksz9477.c
>> @@ -396,7 +396,7 @@ static void ksz9477_port_stp_state_set(struct
>> dsa_switch *ds, int port,
>>  struct ksz_device *dev = ds->priv;
>>  struct ksz_port *p = >ports[port];
>>  u8 data;
>> -int member = -1;
>> +int i, member = -1;
>>
>>  ksz_pread8(dev, port, P_STP_CTRL, );
>>  data &= ~(PORT_TX_ENABLE | PORT_RX_ENABLE |
>> PORT_LEARN_DISABLE);
>> @@ -454,8 +454,8 @@ static void ksz9477_port_stp_state_set(struct
>> dsa_switch *ds, int port,
>>  dev->tx_ports &= ~(1 << port);
>>
>>  /* Port membership may share register with STP state. */
>> -if (member >= 0 && member != p->member)
>> -ksz9477_cfg_port_member(dev, port, (u8)member);
>> +for (i = 0; i < SWITCH_PORT_NUM; i++)
>> +ksz9477_cfg_port_member(dev, i, (u8)member);
>>
>>  /* Check if forwarding needs to be updated. */
>>  if (state != BR_STATE_FORWARDING) {
> 
> The original DSA model did not have a way to tell the bridge device not to
> forward the frame, so the switch driver always setup the membership to
> disable forwarding between ports.
> 
> When lan devices are setup they act like individual devices.  A bridge device
> adding them under it will forward the frames.
> 
> The new switchdev model adds the offload_fwd_mark bit to tell the bridge not 
> to
> forward frame.
> 
> The ksz_update_port_member function in ksz_common.c is doing this membership
> setup for all forwarding ports.  It was finally enabled in one of the RFC 
> patches I
> submitted recently (Add switch forward offloading support).
> 
> I think if you do this without setting offload_fwd_mark you will receive 
> duplicate
> frame.
> 

Either I am misreading Marek's patch, or I don't quite understand your
response, but what is happening when you enslave a switch port into a
bridge is that you need to make sure that:

- the switch port being enslaved will be part of the same forwarding
group as any other switch port already in the bridge
- any existing switch port already enslaved in the bridge must now also
be allowed to forward to the port that is being enslaved

That is to me, exactly what Marek's patch is fixing, your response is
about something slightly orthogonal here.
-- 
Florian


Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread Florian Fainelli
On 12/6/18 2:36 AM, Andrew Lunn wrote:
> DSA tagging of frames sent over the master interface to the switch
> increases the size of the frame. Such frames can then be bigger than
> the normal MTU of the master interface, and it may drop them. Use the
> overhead information from the tagger to set the MTU of the master
> device to include this overhead.
> 
> Signed-off-by: Andrew Lunn 
> ---
>  net/dsa/master.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/net/dsa/master.c b/net/dsa/master.c
> index c90ee3227dea..42f525bc68e2 100644
> --- a/net/dsa/master.c
> +++ b/net/dsa/master.c
> @@ -158,8 +158,24 @@ static void dsa_master_ethtool_teardown(struct 
> net_device *dev)
>   cpu_dp->orig_ethtool_ops = NULL;
>  }
>  
> +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
> +{
> + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead;
> + int err;
> +
> + rtnl_lock();
> + if (mtu <= dev->max_mtu) {
> + err = dev_set_mtu(dev, mtu);
> + if (err)
> + netdev_dbg(dev, "Unable to set MTU to include for DSA 
> overheads\n");
> + }

Would it make sense to warn the user that there might be
transmit/receive issues with the DSA tagging protocol if either
dev_set_mtu() fails or mtu > dev->max_mtu?

> + rtnl_unlock();
> +}
> +
>  int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp)
>  {
> + dsa_master_set_mtu(dev,  cpu_dp);

Not that I think it matters too much to people because unbinding the
switch driver and expecting the CPU port to continue operating is
wishful thinking, but we should probably unwind that operation in
dsa_master_teardown(), right?

> +
>   /* If we use a tagging format that doesn't have an ethertype
>* field, make sure that all packets from this point on get
>* sent to the tag format's receive function.
> 


-- 
Florian


Re: [PATCH net-next 0/2] platform data controls for mdio-gpio

2018-12-06 Thread Florian Fainelli
Hi Andrew,

On 12/6/18 5:58 AM, Andrew Lunn wrote:
> Soon to be mainlined is an x86 platform with a Marvell switch, and a
> bit-banging MDIO bus. In order to make this work, the phy_mask of the
> MDIO bus needs to be set to prevent scanning for PHYs, and the
> phy_ignore_ta_mask needs to be set because the switch has broken
> turnaround.

This looks good, I would just make one/two changes which is to match the
internal phy_mask and phy_ignore_ta_mask types from the struct mii_bus
and use u32 instead of int.

> 
> Add a platform_data structure with these parameters.
> 
> Andrew Lunn (2):
>   net: phy: mdio-gpio: Add platform_data support for phy_mask
>   net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
> 
>  MAINTAINERS |  1 +
>  drivers/net/phy/mdio-gpio.c |  7 +++
>  include/linux/platform_data/mdio-gpio.h | 14 ++
>  3 files changed, 22 insertions(+)
>  create mode 100644 include/linux/platform_data/mdio-gpio.h
> 

-- 
Florian


Re: DSA support for Marvell 88e6065 switch

2018-12-05 Thread Florian Fainelli
On 12/5/18 12:46 PM, Pavel Machek wrote:
> Hi!
> 
> But I'm running into problems with tagging code, and I guess I'd like
> some help understanding.
>
> tag_trailer: allocates new skb, then copies data around.
>
> tag_qca: does dev->stats.tx_packets++, and reuses existing skb.
>
> tag_brcm: reuses existing skb.
>>>
>>> Any idea why tag trailer allocates new skb,
>>
>> I wrote this code over 10 years ago, so I don't remember all that
>> well, but I think that it is because you have to do manual checksumming
>> of the packet, as there's no way to pass down the stack that you don't
>> want to checksum all the way down to the end of the data area (and you
>> don't want the tag to be included in the checksum), and so you want to
>> do that before you add the trailer tag, and you'll probably have to
>> reallocate the data area to be able to add the tag, and you probably
>> won't get an exclusive skb here anyway, so you might as well allocate
>> a new one.
> 
> Ok, thanks. Whether tag is checksummed or not is indeed an interesting 
> question.
> 
>>> and what is going on with dev->stats.tx_packets++?
>>
>> trailer_xmit would be the hard_start_xmit function for the virtual
>> (slave) network interface, so this would be the right thing to do?
> 
> Some tag_*.c do this and some do not, so I'm still confused.

This is likely just historical, or we don't have an easy way to
determine whether the SKB transmitted succeeded or not because it was
reallocated, and therefore the tagging code must make sure statistics
are maintained. I suspect it is just historical and that tag module just
has not been updated to leverage the statistics accounting done a layer
above it within DSA.

> 
>>> 6065 indeed has some kind of "egress tagging mode" (with four
>>> options), but I have trouble understanding what it really does.
>>
>> What are the options?
> 
> "transmit unmodified", "transmit untagged", "transmit tagged" and
> "transmit all frames double tagged".

It seems to me that this applies to VLAN headers, not DSA/trailer
headers though not having the data sheet I should be flat out wrong.
-- 
Florian


Re: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver

2018-12-05 Thread Florian Fainelli
On 12/5/18 10:18 AM, Andrew Lunn wrote:
> On Wed, Dec 05, 2018 at 07:00:38PM +0100, Andrew Lunn wrote:
>> On Mon, Dec 03, 2018 at 03:34:56PM -0800, tristram...@microchip.com wrote:
>>> From: Tristram Ha 
>>>
>>> Update tag_ksz.c to access switch driver's tail tagging operations.
>>
>> Hi Tristram
>>
>> Humm, i'm not sure we want this, the tagging spit into two places.  I
>> need to take a closer look at the previous patch, to see why it cannot
>> be done here.
> 
> O.K, i think i get what is going on.
> 
> I would however implement it differently.
> 
> One net/dsa/tag_X.c file can export two dsa_device_ops structures,
> allowing you to share common code for the two taggers. You could call
> these DSA_TAG_PROTO_KSZ_1_BYTE, and DSA_TAG_PROTO_KSZ_2_BYTE, and the
> .get_tag_protocol call would then return the correct one for the
> switch.

Agreed, that is what is done by net/dsa/tag_brcm.c because there are two
formats for the Broadcom tag:

- TAG_BRCM: the 4-bytes Broadcom tag is between MAC SA and Ethertype
- TAG_BRCM_PREPEND: the 4-bytes Broadcom tag is before the MAC DA

And the code to process them is basically using relative offsets from
the start of the frame to access correct data.

This is done largely for performance reasons because we have 1/2
Gigabit/secs capable CPU ports and so we want to avoid as little cache
trashing as possible and immediately get the right rcv() function to
process the packets.

> 
> It might also be possible to merge in tag_trailer, or at least share
> some code.
> 
> What i don't yet understand is how you are passing PTP information
> around. The commit messages need to explain that, since it is not
> obvious, and it is the first time we have needed PTP info in a tag
> driver.
> 
>   Andrew
> 


-- 
Florian


Re: [PATCH net-next v4 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-12-03 Thread Florian Fainelli
On 12/3/18 3:42 PM, Steve Douthit wrote:
>> Not directly related to this patch series, are you using the legacy or
>> "new" way of passing platform data in order to register the DSA switch?
>> Since you mentioned 6390, I would assume this must be the "new" way of
>> registering DSA devices with mdio_boardinfo etc. In that case, have you
>> found yourself limited by the current dsa_chip_platform_data?
> 
> With the new method I had an off-by-one error where the 'enabled_ports'
> bitmask and 'port_names' array didn't match up in my first attempt.
> That's definitely my fault, but the loop that searches for the "cpu"
> string didn't like it and segfaulted.
> 
> I've got something of a chicken-and-the-egg problem between where I'm
> deferring while waiting for netdevs to show up, and registering the
> mdio_board_info that needs those same pointers.  For testing I exported
> mdiobus_setup_mdiodev_from_board_info() and mdiobus_create_device() so I
> could call the setup function again when the data was actually ready.

Yes the current solution whereby we need to get a hold on the network
device's struct device reference is not quite ideal, AFAIR, Andrew has
had the same problem.

> 
> It'd be nice to have more than one "cpu" port on a switch and have some
> way to associate downstream and "cpu" ports.  Not sure exactly what that
> would look like, and it's not something I'm going to try and tackle at
> the moment, but it's one for the wish list.

Yes, we have been discussing that topic with Andrew and have a few ideas
on how that could be achieved, but not code to use that at the moment.
One of the idea was to finally allow enslaving the DSA master network
device, that way you could assign specific ports to a specific CPU/DSA
master network port within the switch, though that requires setting up a
bridge. Would that work for your use case?
--
Florian


Re: [PATCH net-next v4 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-12-03 Thread Florian Fainelli
On 12/3/18 12:14 PM, Steve Douthit wrote:
> Changes from v3 -> v4
> * Remove unecessary pointer casts
> * Fix copy/paste issues in comments
> * Simplify setting of swfw semaphore flags
> * Collect Reviewed-by: tags
> 
> Changes from v2 -> v3
> * Added warnings about interactions between this code and PHY polling
>   unit to the commit messages.
> 
> Changes from v1 -> v2
> [PATCH 1/2] ixgbe: register a mdiobus
> * Add intel-wired-...@lists.osuosl.org to CC list, see
> * select MII in Kconfig (thanks to the kbuild bot)
> * Only call mdiobus_regsiter for single x500em_a device
> * Use readx_poll_timeout() in ixgbe_msca_cmd()
> * Register different bus->read/write callbacks in ixgbe_mii_bus_init()
>   so there's device id check on every access
> * Use device pci_name() in bus->id instead of parent bridge's name
> 
> [PATCH 2/2] ixgbe: use mii_bus to handle MII related ioctls
> * Only use mdiobus_read/write for adapters that registered a mdiobus

Not directly related to this patch series, are you using the legacy or
"new" way of passing platform data in order to register the DSA switch?
Since you mentioned 6390, I would assume this must be the "new" way of
registering DSA devices with mdio_boardinfo etc. In that case, have you
found yourself limited by the current dsa_chip_platform_data?

> 
> Stephen Douthit (2):
>   ixgbe: register a mdiobus
>   ixgbe: use mii_bus to handle MII related ioctls
> 
>  drivers/net/ethernet/intel/Kconfig|   1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  23 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 307 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
>  5 files changed, 335 insertions(+)
> 


-- 
Florian


Re: [PATCH net-next v3 1/2] ixgbe: register a mdiobus

2018-12-03 Thread Florian Fainelli
On 12/3/18 11:44 AM, Steve Douthit wrote:
> On 12/3/18 2:07 PM, Florian Fainelli wrote:
>> On 12/3/18 10:55 AM, Steve Douthit wrote:
>>> Most dsa devices expect a 'struct mii_bus' pointer to talk to switches
>>> via the MII interface.
>>>
>>> While this works for dsa devices, it will not work safely with Linux
>>> PHYs in all configurations since the firmware of the ixgbe device may
>>> be polling some PHY addresses in the background.
>>>
>>> Signed-off-by: Stephen Douthit 
>>> ---
>>
>> [snip]
>>
>>> +/**
>>> + *  ixgbe_mii_bus_write - Write a clause 22/45 register
>>> + *  @hw: pointer to hardware structure
>>> + *  @addr: address
>>> + *  @regnum: register number
>>> + *  @regnum: valueto write
>>
>> This should be @val to match the function parameters
> 
> OK
> 
>>> + **/
>>> +static s32 ixgbe_mii_bus_write(struct mii_bus *bus, int addr, int regnum,
>>> +  u16 val)
>>> +{
>>> +   struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)bus->priv;
>>
>> Nitpick: cast is not necessary since this is a void * pointer (for that
>> reason).
> 
> OK, I'll clean up this and other unnecessary casts.
> 
> I forgot Andrew's suggestion to squash the swfw semaphore masks from:
> 
> + u32 gssr = hw->phy.phy_semaphore_mask | IXGBE_GSSR_TOKEN_SM;
> +
> + if (hw->bus.lan_id)
> + gssr |= IXGBE_GSSR_PHY1_SM;
> + else
> + gssr |= IXGBE_GSSR_PHY0_SM;
> 
> to
> 
> + u32 gssr = hw->phy.phy_semaphore_mask;
> + gssr |= IXGBE_GSSR_TOKEN_SM | IXGBE_GSSR_PHY0_SM;
> 
> Is it ok to collect both of your 'Reviewed-by:'s with that additional
> change for v4?

I'd think so.
-- 
Florian


Re: [PATCH net-next v3 2/2] ixgbe: use mii_bus to handle MII related ioctls

2018-12-03 Thread Florian Fainelli
On 12/3/18 10:55 AM, Steve Douthit wrote:
> Use the mii_bus callbacks to address the entire clause 22/45 address
> space.  Enables userspace to poke switch registers instead of a single
> PHY address.
> 
> The ixgbe firmware may be polling PHYs in a way that is not protected by
> the mii_bus lock.  This isn't new behavior, but as Andrew Lunn pointed
> out there are more addresses available for conflicts.
> 
> Signed-off-by: Stephen Douthit 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next v3 1/2] ixgbe: register a mdiobus

2018-12-03 Thread Florian Fainelli
On 12/3/18 10:55 AM, Steve Douthit wrote:
> Most dsa devices expect a 'struct mii_bus' pointer to talk to switches
> via the MII interface.
> 
> While this works for dsa devices, it will not work safely with Linux
> PHYs in all configurations since the firmware of the ixgbe device may
> be polling some PHY addresses in the background.
> 
> Signed-off-by: Stephen Douthit 
> ---

[snip]

> +/**
> + *  ixgbe_mii_bus_write - Write a clause 22/45 register
> + *  @hw: pointer to hardware structure
> + *  @addr: address
> + *  @regnum: register number
> + *  @regnum: valueto write

This should be @val to match the function parameters

> + **/
> +static s32 ixgbe_mii_bus_write(struct mii_bus *bus, int addr, int regnum,
> +u16 val)
> +{
> + struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)bus->priv;

Nitpick: cast is not necessary since this is a void * pointer (for that
reason).

> + struct ixgbe_hw *hw = >hw;
> + u32 gssr = hw->phy.phy_semaphore_mask;
> +
> + return ixgbe_mii_bus_write_generic(hw, addr, regnum, val, gssr);
> +}
> +
> +/**
> + *  ixgbe_x550em_a_mii_bus_read - Read a clause 22/45 register on x550em_a
> + *  @hw: pointer to hardware structure
> + *  @addr: address
> + *  @regnum: register number
> + **/
> +static s32 ixgbe_x550em_a_mii_bus_read(struct mii_bus *bus, int addr,
> +int regnum)
> +{
> + struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)bus->priv;

Likewise, the cast is not necessary.

Everything else looked fine, so with that fixed:

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH iproute2] iproute2: Installation errors without libnml

2018-12-02 Thread Florian Fainelli
Hi Émeric,

On December 2, 2018 11:57:00 AM PST, "Émeric Dupont"  
wrote:
>When performing make install in iproute2 (current git master),
> if $(HAVE_MNL) is not selected, some Makefiles try to call
>  install with an empty target, which causes a non-critical make error.

You need to add a Signed-off-by tag here as per the iproute2/Linux certificate 
of origin.

[Snip]

>
>This email and any files transmitted with it are confidential &
>proprietary to Zodiac Inflight Innovations. This information is
>intended solely for the use of the individual or entity to which it is
>addressed. Access or transmittal of the information contained in this
>e-mail, in full or in part, to any other organization or persons is not
>authorized.

This is incompatible with you sending patches for open source software, can you 
work with your IT to either remove that footer by BCC/CC'ing a specific email 
address or use a personal email just to get the patch out?
-- 
Florian


Re: [PATCH net-next v2] net: phy: Also request modules for C45 IDs

2018-12-02 Thread Florian Fainelli



On December 2, 2018 7:33:14 AM PST, Jose Abreu  wrote:
>Logic of phy_device_create() requests PHY modules according to PHY ID
>but for C45 PHYs we use different field for the IDs.
>
>Let's also request the modules for these IDs.
>
>Changes from v1:
>- Only request C22 modules if C45 are not present (Andrew)

Usually you can put the patch changelog under the '---' such that it gets 
stripped out when applying, not a big deal at all.

I don't have an easy way of testing this at the moment but this looks good:

Reviewed-by: Florian Fainelli 

Thanks!
-- 
Florian


Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Florian Fainelli



On 11/30/2018 9:34 AM, Steve Douthit wrote:
> On 11/30/18 11:34 AM, Andrew Lunn wrote:
>>> Yep, registering multiple interfaces is wrong.  The first board I tested
>>> against only had a single MAC enabled (they can be disabled/hidden via
>>> straps) so it just happened to work.
>>
>> Hi Steve
>>
>> Can you hide any/all via straps, or is 00.0 always guaranteed to
>> exist?
> 
> You can hide all the devices, but if function 1 is enabled then function
> 0 must also be enabled, so not all combinations are valid.
> 
>>> The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
>>> structured as two devices of two functions each on fixed internal root
>>> ports.
>>>
>>> from lspci:
>>> 
>>>  +-16.0-[05]--+-00.0
>>>  |\-00.1
>>>  +-17.0-[06]--+-00.0
>>>  |\-00.1
>>> 
>>
>> Is there any other hardware resource which is shared between the MAC
>> interfaces? I'm just wondering if the driver has already solved this
>> once. Is there an EEPROM per interface for the MAC address, or one
>> shared EEPROM?
> 
> NVM is shared, it's actually the same SPI flash that holds the BIOS for
> this SoC.  Access is serialized by the swfw_sync semaphore.  I think the
> device firmware is automagically handling offset translation.
> 
> I don't think that helps for this case.
> 
> There might be a better match for shared resources, but nothing springs
> to mind.
> 
>> Ah, how about using the 'cards_found' found variable. It is not
>> perfect, in that it is not decremented in ixgb_remove(), and i wonder
>> about race conditions since there does not appear to be any lock when
>> it is incremented. But if cards_found == 0, register the MDIO bus.
> 
> 'cards_found' doesn't exist for the ixgbe driver.  I could add it and
> fix the race/decrement issues you mention, but it'd have to be a device
> type specific count.  It's still possible there are other non-x550em_a
> ixgbe devices in external PCIe slots that have different resource
> sharing setups.
> 
> It's still a lighter weight solution than poking around the parent bus
> so I'll add a 'x550em_a_devs_found' counter to v2.
> 

If the MDIO bus block is hardwired to function 0, would it be acceptable
to walk the PCI bus hierarchy from any driver's probe routine where PCI
function != 0 and see if it is claimed/enabled already, and if so, skip
registering the MDIO bus entirely?

There is also possibly a problem if you have a shared MDIO device, and
you turn off/power off the PCI function that does provide it. How can we
make sure it remains alive for other functions to use it?
-- 
Florian


Re: [PATCH 3/3] dt-bindings: net: dsa: add new bindings MT7530

2018-11-30 Thread Florian Fainelli
Hi Greg,

On 11/29/2018 11:57 PM, g...@kernel.org wrote:
> From: Greg Ungerer 
> 
> Add descriptive entries for the new bindings introduced to support the
> MT7530 implementation in the MediaTek MT7621 SoC.
> 
> New bindings added for:
> 
>   mediatek,no-clock-regulator
>   mediatek,mfc-has-cpuport

I don't think any of these properties are necessary, if you can either
use a compatible string, and/or infer the actual model at runtime in the
driver's probe function, then you can assess based on that chip model as
well as the properties being provided in Device Tree whether these
resources must be grabbed and used. See mv88e6xxx and b53 for how these
drivers deal with supporting several distinct models within the same
code base.

As far as the MFC programming goes, this is definitively something that
must be done once you know the chip model you are dealing with.
-- 
Florian


Re: [RFC PATCH] net: mvpp2: fix detection of 10G SFP modules

2018-11-29 Thread Florian Fainelli



On 11/29/2018 4:49 AM, Baruch Siach wrote:
> The mvpp2_phylink_validate() relies on the interface field of
> phylink_link_state to determine valid link modes. However, when called
> from phylink_sfp_module_insert() this field in not initialized. The
> default switch case then excludes 10G link modes. This allows 10G SFP
> modules that are detected correctly to be configured at max rate of
> 2.5G.
> 
> Catch the uninitialized PHY mode case, and allow 10G rates.
> 
> Cc: Maxime Chevallier 
> Cc: Antoine Tenart 
> Signed-off-by: Baruch Siach 
> ---
> Is that the right fix?

It would be a bit surprising that this is the right fix, you would
expect validate to be called once everything has been parsed
successfully from the SFP, is not that the case here? If not, can you
find out what happens?
-- 
Florian


Re: [PATCH net-next,v3 09/12] flow_dissector: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-11-21 Thread Florian Fainelli



On 11/20/2018 6:51 PM, Pablo Neira Ayuso wrote:
> This patch adds a function to translate the ethtool_rx_flow_spec
> structure to the flow_rule representation.
> 
> This allows us to reuse code from the driver side given that both flower
> and ethtool_rx_flow interfaces use the same representation.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---
> v3: Suggested by Jiri Pirko:
> - Add struct ethtool_rx_flow_rule, keep placeholder to private
>   dissector information.
> Reported by Manish Chopra:
>   - Fix incorrect dissector user_keys flags.
> 
>  include/linux/ethtool.h |  10 +++
>  net/core/ethtool.c  | 189 
> 
>  2 files changed, 199 insertions(+)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index afd9596ce636..99849e0858b2 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -400,4 +400,14 @@ struct ethtool_ops {
>   void(*get_ethtool_phy_stats)(struct net_device *,
>struct ethtool_stats *, u64 *);
>  };
> +
> +struct ethtool_rx_flow_rule {
> + struct flow_rule*rule;
> + unsigned long   priv[0];

This forces you to cast to/from that member, any reason this is just not
a void *priv here?
-- 
Florian


Re: [PATCH net-next,v3 09/12] flow_dissector: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-11-21 Thread Florian Fainelli



On 11/20/2018 6:51 PM, Pablo Neira Ayuso wrote:
> This patch adds a function to translate the ethtool_rx_flow_spec
> structure to the flow_rule representation.
> 
> This allows us to reuse code from the driver side given that both flower
> and ethtool_rx_flow interfaces use the same representation.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---
> v3: Suggested by Jiri Pirko:
> - Add struct ethtool_rx_flow_rule, keep placeholder to private
>   dissector information.
> Reported by Manish Chopra:
>   - Fix incorrect dissector user_keys flags.
> 
>  include/linux/ethtool.h |  10 +++
>  net/core/ethtool.c  | 189 
> 
>  2 files changed, 199 insertions(+)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index afd9596ce636..99849e0858b2 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -400,4 +400,14 @@ struct ethtool_ops {
>   void(*get_ethtool_phy_stats)(struct net_device *,
>struct ethtool_stats *, u64 *);
>  };
> +
> +struct ethtool_rx_flow_rule {
> + struct flow_rule*rule;
> + unsigned long   priv[0];
> +};
> +
> +struct ethtool_rx_flow_rule *
> +ethtool_rx_flow_rule_alloc(const struct ethtool_rx_flow_spec *fs);
> +void ethtool_rx_flow_rule_free(struct ethtool_rx_flow_rule *rule);
> +
>  #endif /* _LINUX_ETHTOOL_H */
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index d05402868575..e679d6478371 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Some useful ethtool_ops methods that're device independent.
> @@ -2808,3 +2809,191 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
>  
>   return rc;
>  }
> +
> +struct ethtool_rx_flow_key {
> + struct flow_dissector_key_basic basic;
> + union {
> + struct flow_dissector_key_ipv4_addrsipv4;
> + struct flow_dissector_key_ipv6_addrsipv6;
> + };
> + struct flow_dissector_key_ports tp;
> + struct flow_dissector_key_ipip;
> +} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as 
> longs. */
> +
> +struct ethtool_rx_flow_match {
> + struct flow_dissector   dissector;
> + struct ethtool_rx_flow_key  key;
> + struct ethtool_rx_flow_key  mask;
> +};
> +
> +struct ethtool_rx_flow_rule *
> +ethtool_rx_flow_rule_alloc(const struct ethtool_rx_flow_spec *fs)

This is more than alloc, it's allocate and map, no reason to split the
two operations AFAICT, but the name could be improved, how about
alloc_from()?

> +{
> + static struct in6_addr zero_addr = {};
> + struct ethtool_rx_flow_match *match;
> + struct ethtool_rx_flow_rule *flow;
> + struct flow_action_entry *act;
> +
> + flow = kzalloc(sizeof(struct ethtool_rx_flow_rule) +
> +sizeof(struct ethtool_rx_flow_match), GFP_KERNEL);
> + if (!flow)
> + return NULL;
> +
> + /* ethtool_rx supports only one single action per rule. */
> + flow->rule = flow_rule_alloc(1);
> + if (!flow->rule) {
> + kfree(flow);
> + return NULL;
> + }
> +
> + match = (struct ethtool_rx_flow_match *)flow->priv;
> + flow->rule->match.dissector = >dissector;
> + flow->rule->match.mask  = >mask;
> + flow->rule->match.key   = >key;
> +
> + match->mask.basic.n_proto = 0x;
> +
> + switch (fs->flow_type & ~FLOW_EXT) {
> + case TCP_V4_FLOW:
> + case UDP_V4_FLOW: {
> + const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
> +
> + match->key.basic.n_proto = htons(ETH_P_IP);
> +
> + v4_spec = >h_u.tcp_ip4_spec;
> + v4_m_spec = >m_u.tcp_ip4_spec;
> +
> + if (v4_m_spec->ip4src) {
> + match->key.ipv4.src = v4_spec->ip4src;
> + match->mask.ipv4.src = v4_m_spec->ip4src;
> + }
> + if (v4_m_spec->ip4dst) {
> + match->key.ipv4.dst = v4_spec->ip4dst;
> + match->mask.ipv4.dst = v4_m_spec->ip4dst;
> + }

I got confused a while ago between the ethtool ntuple and nfc semantics,
and I can't remember if the following is true:

- bits set to 1 indicate a match and bit set to 0 indicate a don't care
for nfc
- bits set to 0 indicate a match and bit set to 1 indicate a don't care
for ntuple

Depending on the answer that could mean that this check on a zero
address may have to change.

> + if (v4_m_spec->ip4src ||
> + v4_m_spec->ip4dst) {
> + match->dissector.used_keys |=
> + (1 << FLOW_DISSECTOR_KEY_IPV4_ADDRS);

Can you use BIT() here (and likewise for every one below).

[snip]
> +
> + return flow;

What about the extended fields 

Re: [PATCH net-next,v3 10/12] dsa: bcm_sf2: use flow_rule infrastructure

2018-11-21 Thread Florian Fainelli



On 11/20/2018 6:51 PM, Pablo Neira Ayuso wrote:
> Update this driver to use the flow_rule infrastructure, hence we can use
> the same code to populate hardware IR from ethtool_rx_flow and the
> cls_flower interfaces.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---

[snip]

> @@ -398,9 +411,10 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
> *priv, int port,
>* Reserved [1]
>* UDF_Valid[8] [0]
>*/
> - core_writel(priv, v4_spec->tos << IPTOS_SHIFT |
> - ip_proto << IPPROTO_SHIFT | ip_frag << IP_FRAG_SHIFT |
> - udf_upper_bits(num_udf),
> + core_writel(priv, ip.key->tos << IPTOS_SHIFT |
> +   basic.key->n_proto << IPPROTO_SHIFT |
> +   ip_frag << IP_FRAG_SHIFT |
> +   udf_upper_bits(num_udf),

Can you maintain the alignment of the arguments here?

Every else looked okay, but I need to check the ipv6 structure to make
sure we can access it as 4x32 bits words like we did before.
-- 
Florian


Re: [PATCH net-next,v3 03/12] flow_dissector: add flow action infrastructure

2018-11-21 Thread Florian Fainelli



On 11/20/2018 6:51 PM, Pablo Neira Ayuso wrote:
> This new infrastructure defines the nic actions that you can perform
> from existing network drivers. This infrastructure allows us to avoid a
> direct dependency with the native software TC action representation.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---

[snip]

> +#define flow_action_for_each(__i, __act, __actions)  \
> +for (__i = 0, __act = &(__actions)->entries[0]; __i < 
> (__actions)->num_entries; __act = &(__actions)->entries[++__i])

Post increment is more common in for_each* constructs, any reason why
you are you using a pre increment here?
-- 
Florian


Re: [PATCH v4 net-next 0/6] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-11-20 Thread Florian Fainelli



On 11/20/2018 3:55 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> This series of patches is to modify the original KSZ9477 DSA driver so
> that other KSZ switch drivers can be added and use the common code.
> 
> There are several steps to accomplish this achievement.  First is to
> rename some function names with a prefix to indicate chip specific
> function.  Second is to move common code into header that can be shared.
> Last is to modify tag_ksz.c so that it can handle many tail tag formats
> used by different KSZ switch drivers.
> 
> ksz_common.c will contain the common code used by all KSZ switch drivers.
> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
> ksz9477_spi.c is renamed from ksz_spi.c.
> ksz9477_reg.h is renamed from ksz_9477_reg.h.
> ksz_common.h is added to provide common code access to KSZ switch
> drivers.
> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
> drivers.

Thanks a lot for getting this series out, hopefully there is no blocker
getting it merged now and you can follow-up with additional features.

> 
> v4
> - Patches were removed to concentrate on changing driver structure without
> adding new code.
> 
> v3
> - The phy_device structure is used to hold port link information
> - A structure is passed in ksz_xmit and ksz_rcv instead of function pointer
> - Switch offload forwarding is supported
> 
> v2
> - Initialize reg_mutex before use
> - The alu_mutex is only used inside chip specific functions
> 
> v1
> - Each patch in the set is self-contained
> - Use ksz9477 prefix to indicate KSZ9477 specific code
> 
> Tristram Ha (6):
>   net: dsa: microchip: replace license with GPL
>   net: dsa: microchip: clean up code
>   net: dsa: microchip: rename some functions with ksz9477 prefix
>   net: dsa: microchip: rename ksz_spi.c to ksz9477_spi.c
>   net: dsa: microchip: break KSZ9477 DSA driver into two files
>   net: dsa: microchip: rename ksz_9477_reg.h to ksz9477_reg.h
> 
>  drivers/net/dsa/microchip/Kconfig  |   16 +-
>  drivers/net/dsa/microchip/Makefile |5 +-
>  drivers/net/dsa/microchip/ksz9477.c| 1316 
> 
>  .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}|   17 +-
>  drivers/net/dsa/microchip/ksz9477_spi.c|  177 +++
>  drivers/net/dsa/microchip/ksz_common.c | 1183 +++---
>  drivers/net/dsa/microchip/ksz_common.h |  214 
>  drivers/net/dsa/microchip/ksz_priv.h   |  245 ++--
>  drivers/net/dsa/microchip/ksz_spi.c|  217 
>  drivers/net/dsa/microchip/ksz_spi.h|   69 +
>  10 files changed, 2039 insertions(+), 1420 deletions(-)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477.c
>  rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_common.h
>  delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_spi.h
> 

-- 
Florian


Re: [PATCH v4 net-next 6/6] net: dsa: microchip: rename ksz_9477_reg.h to ksz9477_reg.h

2018-11-20 Thread Florian Fainelli



On 11/20/2018 3:55 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
> name is always KSZ.
> 
> Signed-off-by: Tristram Ha 
> Reviewed-by: Woojung Huh 
> Reviewed-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v4 net-next 1/6] net: dsa: microchip: replace license with GPL

2018-11-20 Thread Florian Fainelli



On 11/20/2018 3:55 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Replace license with GPL.
> 
> Signed-off-by: Tristram Ha 
> Reviewed-by: Woojung Huh 
> Reviewed-by: Andrew Lunn 
> Acked-by: Pavel Machek 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi_poll

2018-11-20 Thread Florian Fainelli
On 11/20/18 1:39 PM, bryan.whiteh...@microchip.com wrote:
>> -Original Message-
>> From: Andrew Lunn 
>> Sent: Tuesday, November 20, 2018 2:31 PM
>> To: Bryan Whitehead - C21958 
>> Cc: da...@davemloft.net; netdev@vger.kernel.org; UNGLinuxDriver
>> 
>> Subject: Re: [PATCH v1 net] lan743x: fix return value for
>> lan743x_tx_napi_poll
>>
>> On Tue, Nov 20, 2018 at 01:26:43PM -0500, Bryan Whitehead wrote:
>>> It has been noticed that under stress the lan743x driver will
>>> sometimes hang or cause a kernel panic. It has been noticed that
>>> returning '0' instead of 'weight' fixes this issue.
>>>
>>> fixes: rare kernel panic under heavy traffic load.
>>> Signed-off-by: Bryan Whitehead 
>>
>> Hi Bryan
>>
>> This sounds like a band aid over something which is broken, not a real fix.
>>
>> Can you show us the stack trace from the panic?
>>
>> Andrew
> 
> Andrew,
> 
> Admittedly, my knowledge of what the kernel is doing behind the scenes is 
> limited.
> 
> But according to documentation found on 
> https://wiki.linuxfoundation.org/networking/napi
> 
> It states the following
> "The poll() function may also process TX completions, in which case if it 
> processes
> the entire TX ring then it should count that work as the rest of the budget.
> Otherwise, TX completions are not counted."
> 
> So based on that, the original driver was returning the full budget. But I 
> was having
> Issues with it. And the above documentation seems to suggest that I could 
> return 0
> As in "not counted" from above.
> 
> I tried it, and my lock up issues disappeared.
> 
> Regarding the kernel panic stack trace. So far its very hard to replicate 
> that on the 
> latest kernel. I've seen it more frequently when back porting to older 
> kernels such
> as 4.14, and 4.9. This same fix caused those kernel panics to disappear.
> Are you interested in seeing a stack dump from older kernels?
> 
> In the latest kernel the issue manifests as a kernel message which states
> "[  945.021101] enp48s0: Budget exhausted after napi rescheduled"
> 
> I'm not sure what that means. But it does not lock up immediately after 
> seeing that
> Message. But it usually locks up with in a minute of seeing that message.
> 
> And the sometimes I get the following warning
> [ 1240.425020] [ cut here ]
> [ 1240.426014] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
> [ 1240.430027] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 
> dev_watchdog+0x1ef/0x200
> [ 1240.430027] Modules linked in: lan743x
> [ 1240.430027] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G  I   
> 4.19.2 #1
> [ 1240.430027] Hardware name: Hewlett-Packard HP Compaq dc7900 Convertible 
> Minitower/3032h, BIOS 786G1 v01.16 03/05/2009
> [ 1240.430027] RIP: 0010:dev_watchdog+0x1ef/0x200
> [ 1240.430027] Code: 00 48 63 4d e0 eb 93 4c 89 e7 c6 05 68 30 b3 00 01 e8 25 
> 3d fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 98 92 48 ab e8 f1 28 87 ff <0f> 0b 
> eb c0 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 08 00
> [ 1240.430027] RSP: 0018:98490be03e90 EFLAGS: 00010282
> [ 1240.430027] RAX:  RBX:  RCX: 
> 
> [ 1240.497168] RDX: 00040400 RSI: 00f6 RDI: 
> 0300
> [ 1240.497168] RBP: 984908574440 R08:  R09: 
> 03a4
> [ 1240.497168] R10: 0020 R11: abc928ed R12: 
> 984908574000
> [ 1240.497168] R13:  R14:  R15: 
> 98490be195b0
> [ 1240.497168] FS:  () GS:98490be0() 
> knlGS:
> [ 1240.497168] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1240.497168] CR2: 7f31cd4c CR3: 000109bca000 CR4: 
> 000406f0
> [ 1240.497168] Call Trace:
> [ 1240.497168]  
> [ 1240.497168]  ? qdisc_reset+0xe0/0xe0
> [ 1240.497168]  call_timer_fn+0x26/0x130
> [ 1240.497168]  run_timer_softirq+0x1cd/0x400
> [ 1240.497168]  ? hpet_interrupt_handler+0x10/0x30
> [ 1240.497168]  __do_softirq+0xed/0x2aa
> [ 1240.497168]  irq_exit+0xb7/0xc0
> [ 1240.497168]  do_IRQ+0x45/0xd0
> [ 1240.497168]  common_interrupt+0xf/0xf
> [ 1240.497168]  
> [ 1240.497168] RIP: 0010:cpuidle_enter_state+0xa6/0x330
> [ 1240.497168] Code: 65 8b 3d 1d b0 4d 55 e8 58 6a 95 ff 48 89 c3 66 66 66 66 
> 90 31 ff e8 59 73 95 ff 80 7c 24 0b 00 0f 85 25 02 00 00 fb 4c 29 eb <48> ba 
> cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
> [ 1240.497168] RSP: 0018:ab603e60 EFLAGS: 0216 ORIG_RAX: 
> ffde
> [ 1240.497168] RAX: 98490be20a80 RBX: 0081035c RCX: 
> 0120cf178c49
> [ 1240.497168] RDX: 0120cf178ca0 RSI: 0120cf178ca0 RDI: 
> 
> [ 1240.497168] RBP: 984908fbd000 R08: fffb58ea5f9e R09: 
> 01208e0b48df
> [ 1240.497168] R10: 18c4 R11: 2468 R12: 
> 0002
> [ 1240.497168] R13: 0120ce968944 R14: ab6a68a0 R15: 
> ab611740
> [ 

Re: [PATCH 00/10] add flow_rule infrastructure

2018-11-19 Thread Florian Fainelli
On 11/19/18 1:57 AM, Jiri Pirko wrote:
> Mon, Nov 19, 2018 at 10:19:28AM CET, manish.cho...@cavium.com wrote:
>>> -Original Message-
>>> From: netdev-ow...@vger.kernel.org  On
>>> Behalf Of Pablo Neira Ayuso
>>> Sent: Friday, November 16, 2018 7:11 AM
>>> To: netdev@vger.kernel.org
>>> Cc: da...@davemloft.net; thomas.lenda...@amd.com;
>>> f.faine...@gmail.com; Elior, Ariel ;
>>> michael.c...@broadcom.com; sant...@chelsio.com;
>>> madalin.bu...@nxp.com; yisen.zhu...@huawei.com;
>>> salil.me...@huawei.com; jeffrey.t.kirs...@intel.com; tar...@mellanox.com;
>>> sae...@mellanox.com; j...@mellanox.com; ido...@mellanox.com;
>>> jakub.kicin...@netronome.com; peppe.cavall...@st.com;
>>> grygorii.stras...@ti.com; and...@lunn.ch;
>>> vivien.dide...@savoirfairelinux.com; alexandre.tor...@st.com;
>>> joab...@synopsys.com; linux-net-driv...@solarflare.com;
>>> ganes...@chelsio.com
>>> Subject: [PATCH 00/10] add flow_rule infrastructure
>>>
>>> External Email
>>>
>>> This patchset introduces a kernel intermediate representation (IR) to 
>>> express
>>> ACL hardware offloads, this is heavily based on the existing flow dissector
>>> infrastructure and the TC actions. This IR can be used by different frontend
>>> ACL interfaces such as ethtool_rxnfc and tc to represent ACL hardware
>>> offloads. Main goal is to simplify the development of ACL hardware offloads
>>> for the existing frontend interfaces, the idea is that driver developers do 
>>> not
>>> need to add one specific parser for each ACL frontend, instead each frontend
>>> can just generate this flow_rule IR and pass it to drivers to populate the
>>> hardware IR.
>>>
>>> .   ethtool_rxnfc   tc
>>>|   (ioctl)(netlink)
>>>|  | | translate native
>>>   Frontend |  | |  interface representation
>>>|  | |  to flow_rule IR
>>>|  | |
>>> . \/\/
>>> . flow_rule IR
>>>||
>>>Drivers || parsing of flow_rule IR
>>>||  to populate hardware IR
>>>|   \/
>>> .  hardware IR (driver)
>>>
>>> For design and implementation details, please have a look at:
>>>
>>> https://lwn.net/Articles/766695/
>>>
>>> As an example, with this patchset, it should be possible to simplify the
>>> existing net/qede driver which already has two parsers to populate the
>>> hardware IR, one for ethtool_rxnfc interface and another for tc.
>>>
>>> This batch is composed of 10 patches:
>>>
>>> Patch #1 adds the flow_match structure, this includes the
>>>  flow_rule_match_key() interface to check for existing selectors
>>>  that are in used in the rule and the flow_rule_match_*()
>>>  functions to fetch the selector value and the mask. This
>>>  also introduces the initial flow_rule structure skeleton to
>>>  avoid a follow up patch that would update the same LoCs.
>>>
>>> Patch #2 makes changes to packet edit parser of mlx5e driver, to prepare
>>>  introduction of the new flow_action to mangle packets.
>>>
>>> Patch #3 Introduce flow_action infrastructure. This infrastructure is
>>>  based on the TC actions. Patch #8 extends it so it also
>>>  supports two new actions that are only available through the
>>>  ethtool_rxnfc interface.
>>>
>>> Patch #4 Add function to translate TC action to flow_action from
>>>  cls_flower.
>>>
>>> Patch #5 Add infrastructure to fetch statistics into container structure
>>>  and synchronize them to TC actions from cls_flower. Another
>>>  preparation patch before patch #7, so we can stop exposing the
>>>  TC action native layout to the drivers.
>>>
>>> Patch #6 Use flow_action infrastructure from drivers.
>>>
>>> Patch #7 Do not expose TC actions to drivers anymore, now that drivers
>>>  have been converted to use the flow_action infrastructure after
>>>  patch #5.
>>>
>>> Patch #8 Support to wake-up-on-lan and queue actions for the flow_action
>>>  infrastructure, two actions supported by NICs. This is used by
>>>  the ethtool_rx_flow interface.
>>>
>>> Patch #9 Add a function to translate from ethtool_rx_flow_spec structure
>>>  to the flow_action structure. This is a simple enough for its
>>>  first client: the ethtool_rxnfc interface in the bcm_sf2 driver.
>>>
>>> Patch #10 Update bcm_sf2 to use this new translator function and
>>>   update codebase to configure hardware IR using the
>>>   flow_action representation. This will allow later development
>>>   of cls_flower using the same codebase from the driver.
>>>
>>> This patchset has passed here functional tests of the codepath that 
>>> generates
>>> the flow_rule structure and the 

Re: [PATCH net-next] MAINTAINERS: Add myself as third phylib maintainer

2018-11-19 Thread Florian Fainelli
On 11/18/18 11:01 PM, Heiner Kallweit wrote:
> Add myself as third phylib maintainer.
> 
> Signed-off-by: Heiner Kallweit 

Acked-by: Florian Fainelli 

This could probably go to "net" (not a functional code change).
-- 
Florian


Re: [PATCH net-next] net: phy: check for implementation of both callbacks in phy_drv_supports_irq

2018-11-12 Thread Florian Fainelli
On 11/12/18 12:16 PM, Heiner Kallweit wrote:
> Now that the icplus driver has been fixed all PHY drivers supporting
> interrupts have both callbacks (config_intr and ack_interrupt)
> implemented - as it should be. Therefore phy_drv_supports_irq()
> can be changed now to check for both callbacks being implemented.
> 
> Signed-off-by: Heiner Kallweit 

Acked-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next] net: phy: switch to lockdep_assert_held in phylib

2018-11-12 Thread Florian Fainelli
On 11/12/18 9:44 AM, Andrew Lunn wrote:
> On Sun, Nov 11, 2018 at 10:33:08PM +0100, Heiner Kallweit wrote:
>> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
>> index 2e59a8419..5cb06f021 100644
>> --- a/drivers/net/phy/mdio_bus.c
>> +++ b/drivers/net/phy/mdio_bus.c
>> @@ -541,7 +541,7 @@ int __mdiobus_read(struct mii_bus *bus, int addr, u32 
>> regnum)
>>  {
>>  int retval;
>>  
>> -WARN_ON_ONCE(!mutex_is_locked(>mdio_lock));
>> +lockdep_assert_held_once(>mdio_lock);
> 
> Hi Heiner
> 
> I don't think there is a clear right/wrong here. This is not hot path
> code. The cost for checking the lock is held is very small compared to
> the actual MDIO transaction. So i don't think we really need to
> optimise this. I do sometimes build with lockdep on, but not
> always. So it is good to know when locking is broken on normal builds.
> 
> Florian, what do you think?

lockdep_assert_held_once() also looks at debug_locks (global variable)
so it sounds like in that regard, it would be superior in that it allows
an user-configurable, general debugging facility to behave consistently
as opposed to always having opt-in debugging within the mdio_bus.c file,
but that also has a lot of value. I have to admit debugging MDIO bus
locking issues is not particularly fun, so I would probably stick with
the current code in that regard.
-- 
Florian


Re: [PATCH net-next] net: phy: icplus: add config_intr callback

2018-11-12 Thread Florian Fainelli
On 11/11/18 12:49 PM, Heiner Kallweit wrote:
> Move IRQ configuration for IP101A/G from config_init to config_intr
> callback. Reasons:
> 
> 1. This allows phylib to disable interrupts if needed.
> 2. Icplus was the only driver supporting interrupts w/o defining a
>config_intr callback. Now we can add a phylib plausibility check
>disabling interrupt mode if one of the two irq-related callbacks
>isn't defined.
> 
> I don't own hardware with this PHY, and the change is based on the
> datasheet for IP101A LF (which is supposed to be register-compatible
> with IP101A/G). Change is compile-tested only.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 

A bit surprising that this is not a read/modify/write sequence.
-- 
Florian


Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling

2018-11-09 Thread Florian Fainelli
On 11/9/18 9:35 AM, Heiner Kallweit wrote:
> As a heritage from the very early days of phylib member interrupts is
> defined as u32 even though it's just a flag whether interrupts are
> enabled. So we can change it to a bitfield member. In addition change
> the code dealing with this member in a way that it's clear we're
> dealing with a bool value.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next 1/2] net: phy: use phy_id_mask value zero for exact match

2018-11-08 Thread Florian Fainelli
On 11/8/18 12:53 PM, Andrew Lunn wrote:
>>> Maybe we can find a clever way with a macro to specify only the PHY OUI
>>> and compute a suitable mask automatically?
>>>
>> I don't think so. For Realtek each driver is specific even to a model
>> revision (therefore mask 0x). Same applies to intel-xway.
>> In broadcom.c we have masks 0xfff0, so for each model, independent
>> of revision number. There is no general rule.
>> Also we can't simply check for the first-bit-set to derive a mask.
> 
> I'm crystal ball gazing, but i think Florian was meaning something like
> 
> #define PHY_ID_UNIQUE(_id) \
>   .phy_id = _id_; \
>   .phy_id_mask = 0x;
> 
> It is the boilerplate setting .phy_id_mask which you don't like. This removes 
> that boilerplate.

Your crystal ball gazing skills are good, that is what I meant, we could
also define another macro which does not match the revision bits, and
that would likely cover everything that is already out there.
-- 
Florian


Re: [PATCH net-next 0/5] net: phy: improve and simplify phylib state machine

2018-11-08 Thread Florian Fainelli
On 11/8/18 2:58 PM, David Miller wrote:
> From: Heiner Kallweit 
> Date: Wed, 7 Nov 2018 20:41:52 +0100
> 
>> This patch series is based on two axioms:
>>
>> - During autoneg a PHY always reports the link being down
>>
>> - Info in clause 22/45 registers doesn't allow to differentiate between
>>   these two states:
>>   1. Link is physically down
>>   2. A link partner is connected and PHY is autonegotiating
>>   In both cases "link up" and "aneg finished" bits aren't set.
>>   One consequence is that having separate states PHY_NOLINK and PHY_AN
>>   isn't needed.
>>
>> By using these two axioms the state machine can be significantly
>> simplified.
> 
> So how are we going to move forward on this?
> 
> Maybe we can apply this series and just watch carefully for any
> problems that get reported or are found?

Given Heiner is always responsive and taking care of fixing what might
be/have broken, no objections with me on that.
-- 
Florian


Re: [PATCH net-next 0/2] net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt

2018-11-08 Thread Florian Fainelli
On 11/8/18 2:30 PM, Andrew Lunn wrote:
> On Thu, Nov 08, 2018 at 10:54:50PM +0100, Heiner Kallweit wrote:
>> Flag PHY_HAS_INTERRUPT is used only here for this small check. I think
>> using interrupts isn't possible if a driver defines neither
>> config_intr nor ack_interrupts callback. So we can replace checking
>> flag PHY_HAS_INTERRUPT with checking for these callbacks.
>>
>> This allows to remove this flag from a lot of driver configs, let's
>> start with the Realtek driver.
> 
> Hi Heiner
> 
> Ideally, we should do this to all the drivers, not just one. If we
> leave PHY_HAS_INTERRUPT, people are going to use it. That then makes
> the realtek driver then different to all the other drivers, and at
> some point somebody will do something which breaks it because they
> don't know the realtek driver is special.

Agreed.
-- 
Florian


Re: [PATCH net-next 1/2] net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt

2018-11-08 Thread Florian Fainelli
On 11/8/18 1:55 PM, Heiner Kallweit wrote:
> Flag PHY_HAS_INTERRUPT is used only here for this small check. I think
> using interrupts isn't possible if a driver defines neither
> config_intr nor ack_interrupts callback. So we can replace checking
> flag PHY_HAS_INTERRUPT with checking for these callbacks.
> 
> Signed-off-by: Heiner Kallweit 
> ---
>  drivers/net/phy/phy_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index d165a2c82..33e51b955 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -2104,8 +2104,9 @@ static int phy_probe(struct device *dev)
>   /* Disable the interrupt if the PHY doesn't support it
>* but the interrupt is still a valid one
>*/
> - if (!(phydrv->flags & PHY_HAS_INTERRUPT) &&
> - phy_interrupt_is_valid(phydev))
> +  if (!phydrv->config_intr &&
> +  !phydrv->ack_interrupt &&
> +  phy_interrupt_is_valid(phydev))
>   phydev->irq = PHY_POLL;

I would introduce an inline helper function which checks for
drv->config_intr and config_ack_interrupt, that way if we ever have to
introduce an additional function in the future, we just update the
helper to check for that.

Other than that, LGTM
-- 
Florian


Re: [PATCH net-next 1/2] net: phy: use phy_id_mask value zero for exact match

2018-11-08 Thread Florian Fainelli
On 11/7/18 12:53 PM, Heiner Kallweit wrote:
> A phy_id_mask value zero means every PHYID matches, therefore
> value zero isn't used. So we can safely redefine the semantics
> of value zero to mean "exact match". This allows to avoid some
> boilerplate code in PHY driver configs.

Having run recently into some ethtool quirkyness about how masks are
supposed to be specified between ntuple/nfc, where the meaning of 0 is
either don't care or match, I would rather we stick with the current
behavior where every bit set to 0 is a don't care and bits set t 1 are not.

Maybe we can find a clever way with a macro to specify only the PHY OUI
and compute a suitable mask automatically?

> 
> Signed-off-by: Heiner Kallweit 
> ---
>  drivers/net/phy/phy_device.c | 21 +++--
>  include/linux/phy.h  |  2 +-
>  2 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index ab33d1777..d165a2c82 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -483,15 +483,24 @@ static int phy_bus_match(struct device *dev, struct 
> device_driver *drv)
>   if (!(phydev->c45_ids.devices_in_package & (1 << i)))
>   continue;
>  
> - if ((phydrv->phy_id & phydrv->phy_id_mask) ==
> - (phydev->c45_ids.device_ids[i] &
> -  phydrv->phy_id_mask))
> - return 1;
> + if (!phydrv->phy_id_mask) {
> + if (phydrv->phy_id ==
> + phydev->c45_ids.device_ids[i])
> + return 1;
> + } else {
> + if ((phydrv->phy_id & phydrv->phy_id_mask) ==
> + (phydev->c45_ids.device_ids[i] &
> +  phydrv->phy_id_mask))
> + return 1;
> + }
>   }
>   return 0;
>   } else {
> - return (phydrv->phy_id & phydrv->phy_id_mask) ==
> - (phydev->phy_id & phydrv->phy_id_mask);
> + if (!phydrv->phy_id_mask)
> + return phydrv->phy_id == phydev->phy_id;
> + else
> + return (phydrv->phy_id & phydrv->phy_id_mask) ==
> +(phydev->phy_id & phydrv->phy_id_mask);
>   }
>  }
>  
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 2090277ea..e30ca2fdd 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -500,7 +500,7 @@ struct phy_driver {
>   struct mdio_driver_common mdiodrv;
>   u32 phy_id;
>   char *name;
> - u32 phy_id_mask;
> + u32 phy_id_mask; /* value 0 means exact match */
>   const unsigned long * const features;
>   u32 flags;
>   const void *driver_data;
> 


-- 
Florian


Re: [PATCH RFC net-next 0/3] net: phy: sfp: Warn when using generic PHY driver

2018-11-06 Thread Florian Fainelli
On 11/6/18 4:34 PM, Russell King - ARM Linux wrote:
> On Tue, Nov 06, 2018 at 04:09:35PM -0800, Florian Fainelli wrote:
>> On 11/6/18 4:03 PM, Russell King - ARM Linux wrote:
>>> On Tue, Nov 06, 2018 at 03:38:44PM -0800, David Miller wrote:
>>>> From: Florian Fainelli 
>>>> Date: Tue,  6 Nov 2018 15:29:10 -0800
>>>>
>>>>> This patch series allows warning an user that the generic PHY driver(s)
>>>>> are used when a SFP incorporates a PHY (e.g: 1000BaseT SFP) which is
>>>>> likely not going to work at all.
>>>>>
>>>>> Let me know if you would want to do that differently.
>>>>
>>>> Is there ever a possibility that the generic PHY driver could work
>>>> in an SFP situation?
>>>
>>> I don't yet see the reason for Florian's patch series - all the Marvell
>>> 88e based modules I have, or have come across in information from
>>> manufacturers self-configure themselves and don't really need the
>>> Marvell 1G PHY driver.
>>>
>>> For example, the Source Photonics were offering a range of 1GbaseT
>>> modules with the 88e programmed in different modes, but published
>>> instructions for the register accesses to configure them differently
>>> (eg, SGMII vs 1000base-X interface facing the MAC).  Depending on
>>> the module part number determines which mode the PHY has been
>>> programmed to come up in.
>>>
>>> So in theory, you don't need any PHY driver for these modules - but
>>> it's useful to have a functional PHY driver to be able to read out
>>> the negotiated flow control results.
>>>
>>> I'd like more information from Florian about the reasoning behind
>>> this patch series before it's merged.
>>>
>>
>> The module that I am using [1] would not work, as in , no link up being
>> reported without turning on the Marvell PHY driver:
>>
>> https://www.amazon.com/dp/B01LW2P72V/ref=twister_B07F3WQJQX?_encoding=UTF8=1
>>
>> this module uses a 88E PHY as well (OUI: 0x01410cc2).
> 
> From the above URL:
> 
>  * This is 1000M SFP-T Transceiver, not 10/100/1000M Multi-Rate SFP-T. If
>you want to buy 10/100/1000M Multi-Rate SFP-T, pls contact us.10Gtek
>offer more compatible options, if your brands not listed above, pls
>contact us.
> 
> I wonder if this is like the Source Photonics situation, where the
> 1000base-T only variant of their module uses 1000base-X on the MAC
> side, whereas their 10/100/1000base-T variant uses SGMII.  The only
> difference between these are the part numbers and the programming
> of the 88E to tell it which mode to default to for the host
> side.  (There's no true way to know from the EEPROM whether a module
> wants SGMII or 1000base-X.)
> 
> What I also gather is that this is a 10Gtek-manufactured version of
> the Ubiquiti UF-RJ45-1G - the original Ubiquiti version supports
> 10/100/1G speeds which would require the 88e to configure for
> a SGMII host interface by default.
> 
> Now, the reason that modules with an 88E configured to default to
> 1000base-X will work when the marvell PHY driver is present, but not
> with the generic driver is that the marvell PHY driver will see that
> SFP/phylink is wanting to use SGMII mode, and the Marvell PHY driver
> reprograms the PHY to use SGMII.  This is only a problem for these
> modules.
> 
> So, in so far as your patch 3 goes to give a hint that the Marvell
> driver should be selected, that's correct.
> 
> However, where the 88e is configured for SGMII by default, the
> Marvell driver shouldn't be required, and I wonder whether we ought
> to be issuing a warning in that case.  The problem, however, is there
> is no way to know for certain.
> 
> We could have modules that do not use the Marvell PHY, and if we don't
> have a PHY driver for their particular PHY, do we want a warning to be
> issued?

Another approach could be to maintain a list of modules that do not work
with the generic PHY driver and therefore require a specialized driver,
in that case we could even go as far as not letting sfp_sm_probe_phy()
return success. Not sure how well things would scale, probably not too
bad given there are only a handful of users of the SFP framework thus far...

> 
> The whole 1000base-X vs SGMII with SFP modules is all very icky. :(
> 
-- 
Florian


[PATCH net-next] net: phy: bcm7xxx: Add entry for BCM7255

2018-11-06 Thread Florian Fainelli
From: Justin Chen 

Add support for BCM7255 EPHY.

Signed-off-by: Justin Chen 
Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm7xxx.c | 2 ++
 include/linux/brcmphy.h   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index b2b6307d64a4..712224cc442d 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -650,6 +650,7 @@ static int bcm7xxx_28nm_probe(struct phy_device *phydev)
 
 static struct phy_driver bcm7xxx_driver[] = {
BCM7XXX_28NM_GPHY(PHY_ID_BCM7250, "Broadcom BCM7250"),
+   BCM7XXX_28NM_EPHY(PHY_ID_BCM7255, "Broadcom BCM7255"),
BCM7XXX_28NM_EPHY(PHY_ID_BCM7260, "Broadcom BCM7260"),
BCM7XXX_28NM_EPHY(PHY_ID_BCM7268, "Broadcom BCM7268"),
BCM7XXX_28NM_EPHY(PHY_ID_BCM7271, "Broadcom BCM7271"),
@@ -670,6 +671,7 @@ static struct phy_driver bcm7xxx_driver[] = {
 
 static struct mdio_device_id __maybe_unused bcm7xxx_tbl[] = {
{ PHY_ID_BCM7250, 0xfff0, },
+   { PHY_ID_BCM7255, 0xfff0, },
{ PHY_ID_BCM7260, 0xfff0, },
{ PHY_ID_BCM7268, 0xfff0, },
{ PHY_ID_BCM7271, 0xfff0, },
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index 949e9af8d9d6..9cd00a37b8d3 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -28,6 +28,7 @@
 #define PHY_ID_BCM896100x03625cd0
 
 #define PHY_ID_BCM7250 0xae025280
+#define PHY_ID_BCM7255 0xae025120
 #define PHY_ID_BCM7260 0xae025190
 #define PHY_ID_BCM7268 0xae025090
 #define PHY_ID_BCM7271 0xae0253b0
-- 
2.17.1



Re: [PATCH RFC net-next 0/3] net: phy: sfp: Warn when using generic PHY driver

2018-11-06 Thread Florian Fainelli
On 11/6/18 4:03 PM, Russell King - ARM Linux wrote:
> On Tue, Nov 06, 2018 at 03:38:44PM -0800, David Miller wrote:
>> From: Florian Fainelli 
>> Date: Tue,  6 Nov 2018 15:29:10 -0800
>>
>>> This patch series allows warning an user that the generic PHY driver(s)
>>> are used when a SFP incorporates a PHY (e.g: 1000BaseT SFP) which is
>>> likely not going to work at all.
>>>
>>> Let me know if you would want to do that differently.
>>
>> Is there ever a possibility that the generic PHY driver could work
>> in an SFP situation?
> 
> I don't yet see the reason for Florian's patch series - all the Marvell
> 88e based modules I have, or have come across in information from
> manufacturers self-configure themselves and don't really need the
> Marvell 1G PHY driver.
> 
> For example, the Source Photonics were offering a range of 1GbaseT
> modules with the 88e programmed in different modes, but published
> instructions for the register accesses to configure them differently
> (eg, SGMII vs 1000base-X interface facing the MAC).  Depending on
> the module part number determines which mode the PHY has been
> programmed to come up in.
> 
> So in theory, you don't need any PHY driver for these modules - but
> it's useful to have a functional PHY driver to be able to read out
> the negotiated flow control results.
> 
> I'd like more information from Florian about the reasoning behind
> this patch series before it's merged.
> 

The module that I am using [1] would not work, as in , no link up being
reported without turning on the Marvell PHY driver:

https://www.amazon.com/dp/B01LW2P72V/ref=twister_B07F3WQJQX?_encoding=UTF8=1

this module uses a 88E PHY as well (OUI: 0x01410cc2).
-- 
Florian


Re: [PATCH RFC net-next 0/3] net: phy: sfp: Warn when using generic PHY driver

2018-11-06 Thread Florian Fainelli
On 11/6/18 3:38 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Tue,  6 Nov 2018 15:29:10 -0800
> 
>> This patch series allows warning an user that the generic PHY driver(s)
>> are used when a SFP incorporates a PHY (e.g: 1000BaseT SFP) which is
>> likely not going to work at all.
>>
>> Let me know if you would want to do that differently.
> 
> Is there ever a possibility that the generic PHY driver could work
> in an SFP situation?

Given the PHY has to operate in SGMII mode, I doubt it could work
without a specialized driver, Andrew, Russell, would you concur?

> 
> If not, yes emit the message but also fail the load and registry too
> perhaps?
> 

I was not sure this would be acceptable, but it is definitively an easy
change.
-- 
Florian


[PATCH RFC net-next 2/3] net: phy: sfp: Issue warning when using Generic PHY driver(s)

2018-11-06 Thread Florian Fainelli
1000BaseT SFP modules typically include an Ethernet PHY device, and
while the Generic PHY driver will be able to bind to it, it usually will
not work at all without a specialized PHY driver. Issue a warning in
that case to help toubleshoot things.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/sfp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
index fd8bb998ae52..228205d8ce84 100644
--- a/drivers/net/phy/sfp.c
+++ b/drivers/net/phy/sfp.c
@@ -1203,6 +1203,9 @@ static void sfp_sm_probe_phy(struct sfp *sfp)
}
 
sfp->mod_phy = phy;
+   if (phy_driver_is_genphy(phy) || phy_driver_is_genphy_10g(phy))
+   dev_warn(sfp->dev, "Using Generic PHY driver with a SFP!\n");
+
phy_start(phy);
 }
 
-- 
2.17.1



[PATCH RFC net-next 1/3] net: phy: Add helpers to determine if PHY driver is generic

2018-11-06 Thread Florian Fainelli
We are already checking in phy_detach() that the PHY driver is of
generic kind (1G or 10G) and we are going to make use of that in the SFP
layer as well for 1000BaseT SFP modules, so expose helper functions to
return that information.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phy_device.c | 34 --
 include/linux/phy.h  |  3 +++
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index ab33d1777132..15de7a3263bf 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1262,6 +1262,36 @@ struct phy_device *phy_attach(struct net_device *dev, 
const char *bus_id,
 }
 EXPORT_SYMBOL(phy_attach);
 
+static bool phy_driver_is_genphy_kind(struct phy_device *phydev,
+ struct device_driver *driver)
+{
+   struct device *d = >mdio.dev;
+   bool ret = false;
+
+   if (!phydev->drv)
+   return ret;
+
+   get_device(d);
+   ret = d->driver == driver;
+   put_device(d);
+
+   return ret;
+}
+
+bool phy_driver_is_genphy(struct phy_device *phydev)
+{
+   return phy_driver_is_genphy_kind(phydev,
+_driver.mdiodrv.driver);
+}
+EXPORT_SYMBOL_GPL(phy_driver_is_genphy);
+
+bool phy_driver_is_genphy_10g(struct phy_device *phydev)
+{
+   return phy_driver_is_genphy_kind(phydev,
+_10g_driver.mdiodrv.driver);
+}
+EXPORT_SYMBOL_GPL(phy_driver_is_genphy_10g);
+
 /**
  * phy_detach - detach a PHY device from its network device
  * @phydev: target phy_device struct
@@ -1293,8 +1323,8 @@ void phy_detach(struct phy_device *phydev)
 * from the generic driver so that there's a chance a
 * real driver could be loaded
 */
-   if (phydev->mdio.dev.driver == _10g_driver.mdiodrv.driver ||
-   phydev->mdio.dev.driver == _driver.mdiodrv.driver)
+   if (phy_driver_is_genphy(phydev) ||
+   phy_driver_is_genphy_10g(phydev))
device_release_driver(>mdio.dev);
 
/*
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 3ea87f774a76..84a6c7efef60 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1192,4 +1192,7 @@ module_exit(phy_module_exit)
 #define module_phy_driver(__phy_drivers)   \
phy_module_driver(__phy_drivers, ARRAY_SIZE(__phy_drivers))
 
+bool phy_driver_is_genphy(struct phy_device *phydev);
+bool phy_driver_is_genphy_10g(struct phy_device *phydev);
+
 #endif /* __PHY_H */
-- 
2.17.1



[PATCH RFC net-next 3/3] net: phy: Default MARVELL_PHY to the value of SFP

2018-11-06 Thread Florian Fainelli
Marvell PHYs are typically found in 1000BaseT SFP modules, so give a
chance for users to get the correct PHY driver when using SFP modules.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 3d187cd50eb0..cf7d44ba20c5 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -350,6 +350,7 @@ config LXT_PHY
 
 config MARVELL_PHY
tristate "Marvell PHYs"
+   default SFP
---help---
  Currently has a driver for the 88E1011S
 
-- 
2.17.1



[PATCH RFC net-next 0/3] net: phy: sfp: Warn when using generic PHY driver

2018-11-06 Thread Florian Fainelli
Hi all,

This patch series allows warning an user that the generic PHY driver(s)
are used when a SFP incorporates a PHY (e.g: 1000BaseT SFP) which is
likely not going to work at all.

Let me know if you would want to do that differently.

Florian Fainelli (3):
  net: phy: Add helpers to determine if PHY driver is generic
  net: phy: sfp: Issue warning when using Generic PHY driver(s)
  net: phy: Default MARVELL_PHY to the value of SFP

 drivers/net/phy/Kconfig  |  1 +
 drivers/net/phy/phy_device.c | 34 --
 drivers/net/phy/sfp.c|  3 +++
 include/linux/phy.h  |  3 +++
 4 files changed, 39 insertions(+), 2 deletions(-)

-- 
2.17.1



[PATCH net-next 2/3] net: systemport: Simplify queue mapping logic

2018-11-06 Thread Florian Fainelli
The use of a bitmap speeds up the finding of the first available queue
to which we could start establishing the mapping for, but we still have
to loop over all slave network devices to set them up. Simplify the
logic to have a single loop, and use the fact that a correctly
configured ring has inspect set to true. This will make things simpler
to unwind during device unregistration.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 17 +
 drivers/net/ethernet/broadcom/bcmsysport.h |  1 -
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 0e2d99c737e3..f620c647bb86 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2312,7 +2312,7 @@ static int bcm_sysport_map_queues(struct notifier_block 
*nb,
struct bcm_sysport_priv *priv;
struct net_device *slave_dev;
unsigned int num_tx_queues;
-   unsigned int q, start, port;
+   unsigned int q, qp, port;
struct net_device *dev;
 
priv = container_of(nb, struct bcm_sysport_priv, dsa_notifier);
@@ -2351,20 +2351,21 @@ static int bcm_sysport_map_queues(struct notifier_block 
*nb,
 
priv->per_port_num_tx_queues = num_tx_queues;
 
-   start = find_first_zero_bit(>queue_bitmap, dev->num_tx_queues);
-   for (q = 0; q < num_tx_queues; q++) {
-   ring = >tx_rings[q + start];
+   for (q = 0, qp = 0; q < dev->num_tx_queues && qp < num_tx_queues;
+q++) {
+   ring = >tx_rings[q];
+
+   if (ring->inspect)
+   continue;
 
/* Just remember the mapping actual programming done
 * during bcm_sysport_init_tx_ring
 */
-   ring->switch_queue = q;
+   ring->switch_queue = qp;
ring->switch_port = port;
ring->inspect = true;
priv->ring_map[q + port * num_tx_queues] = ring;
-
-   /* Set all queues as being used now */
-   set_bit(q + start, >queue_bitmap);
+   qp++;
}
 
return 0;
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.h 
b/drivers/net/ethernet/broadcom/bcmsysport.h
index a7a230884a87..94d64b203098 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.h
+++ b/drivers/net/ethernet/broadcom/bcmsysport.h
@@ -795,7 +795,6 @@ struct bcm_sysport_priv {
/* map information between switch port queues and local queues */
struct notifier_block   dsa_notifier;
unsigned intper_port_num_tx_queues;
-   unsigned long   queue_bitmap;
struct bcm_sysport_tx_ring *ring_map[DSA_MAX_PORTS * 8];
 
 };
-- 
2.17.1



[PATCH net-next 3/3] net: systemport: Unmap queues upon DSA unregister event

2018-11-06 Thread Florian Fainelli
Binding and unbinding the switch driver which creates the DSA slave
network devices for which we set-up inspection would lead to
undesireable effects since we were not clearing the port/queue mapping
to the SYSTEMPORT TX queue.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 56 +++---
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index f620c647bb86..f8f0a027b3ae 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2371,17 +2371,61 @@ static int bcm_sysport_map_queues(struct notifier_block 
*nb,
return 0;
 }
 
+static int bcm_sysport_unmap_queues(struct notifier_block *nb,
+   struct dsa_notifier_register_info *info)
+{
+   struct bcm_sysport_tx_ring *ring;
+   struct bcm_sysport_priv *priv;
+   struct net_device *slave_dev;
+   unsigned int num_tx_queues;
+   struct net_device *dev;
+   unsigned int q, port;
+
+   priv = container_of(nb, struct bcm_sysport_priv, dsa_notifier);
+   if (priv->netdev != info->master)
+   return 0;
+
+   dev = info->master;
+
+   if (dev->netdev_ops != _sysport_netdev_ops)
+   return 0;
+
+   port = info->port_number;
+   slave_dev = info->info.dev;
+
+   num_tx_queues = slave_dev->real_num_tx_queues;
+
+   for (q = 0; q < dev->num_tx_queues; q++) {
+   ring = >tx_rings[q];
+
+   if (ring->switch_port != port)
+   continue;
+
+   if (!ring->inspect)
+   continue;
+
+   ring->inspect = false;
+   priv->ring_map[q + port * num_tx_queues] = NULL;
+   }
+
+   return 0;
+}
+
 static int bcm_sysport_dsa_notifier(struct notifier_block *nb,
unsigned long event, void *ptr)
 {
-   struct dsa_notifier_register_info *info;
+   int ret = NOTIFY_DONE;
 
-   if (event != DSA_PORT_REGISTER)
-   return NOTIFY_DONE;
-
-   info = ptr;
+   switch (event) {
+   case DSA_PORT_REGISTER:
+   ret = bcm_sysport_map_queues(nb, ptr);
+   break;
+   case DSA_PORT_UNREGISTER:
+   ret = bcm_sysport_unmap_queues(nb, ptr);
+   break;
+   }
 
-   return notifier_from_errno(bcm_sysport_map_queues(nb, info));
+   return notifier_from_errno(ret);
 }
 
 #define REV_FMT"v%2x.%02x"
-- 
2.17.1



[PATCH net-next 1/3] net: dsa: bcm_sf2: Turn on PHY to allow successful registration

2018-11-06 Thread Florian Fainelli
We are binding to the PHY using the SF2 slave MDIO bus that we create,
binding involves reading the PHY's MII_PHYSID1/2 which won't be possible
if the PHY is turned off. Temporarily turn it on/off for the bus probing
to succeeed. This fixes unbind/bind problems where the port connecting
to that PHY would be in error since it could not connect to it.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 2eb68769562c..2c664aac1e7b 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -1090,12 +1090,16 @@ static int bcm_sf2_sw_probe(struct platform_device 
*pdev)
return ret;
}
 
+   bcm_sf2_gphy_enable_set(priv->dev->ds, true);
+
ret = bcm_sf2_mdio_register(ds);
if (ret) {
pr_err("failed to register MDIO bus\n");
return ret;
}
 
+   bcm_sf2_gphy_enable_set(priv->dev->ds, false);
+
ret = bcm_sf2_cfp_rst(priv);
if (ret) {
pr_err("failed to reset CFP\n");
-- 
2.17.1



[PATCH net-next 0/3] net: systemport: Unmap queues upon DSA unregister event

2018-11-06 Thread Florian Fainelli
Hi all,

This patch series fixes the unbinding/binding of the bcm_sf2 switch
driver along with bcmsysport which monitors the switch port queues.
Because the driver was not processing the DSA_PORT_UNREGISTER event, we
would not be unmapping switch port/queues, which could cause incorrect
decisions to be made by the HW (e.g: queue always back-pressured).

Florian Fainelli (3):
  net: dsa: bcm_sf2: Turn on PHY to allow successful registration
  net: systemport: Simplify queue mapping logic
  net: systemport: Unmap queues upon DSA unregister event

 drivers/net/dsa/bcm_sf2.c  |  4 ++
 drivers/net/ethernet/broadcom/bcmsysport.c | 71 ++
 drivers/net/ethernet/broadcom/bcmsysport.h |  1 -
 3 files changed, 62 insertions(+), 14 deletions(-)

-- 
2.17.1



Re: [PATCH net-next 0/5] net: dsa: bcm_sf2: Store rules in lists

2018-11-06 Thread Florian Fainelli
On 11/6/18 3:06 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Tue,  6 Nov 2018 12:58:36 -0800
> 
>> Hi all,
>>
>> This patch series changes the bcm-sf2 driver to keep a copy of the
>> inserted rules as opposed to using the HW as a storage area for a number
>> of reasons:
>>
>> - this helps us with doing duplicate rule detection in a faster way, it
>>   would have required a full rule read before
>>
>> - this helps with Pablo's on-going work to convert ethtool_rx_flow_spec
>>   to a more generic flow rule structure by having fewer code paths to
>>   convert to the new structure/helpers
>>
>> - we need to cache copies to restore them during drive resumption,
>>   because depending on the low power mode the system has entered, the
>>   switch may have lost all of its context
> 
> Looks good to me, series applied and build testing right now.
> 
> I will say that the ethtool flow spec comparison should probably
> eventually be broken out into a helper function places somewhere
> common.  It's very likely this approach, and thus the helper, can
> be used by other drivers in a similar situation.

Sure, that could be done, I will check with Pablo how he wants to
approach that as well since he is reworking how flow rules are
represented. Thanks!
-- 
Florian


[PATCH net-next 4/5] net: dsa: bcm_sf2: Get rid of unmarshalling functions

2018-11-06 Thread Florian Fainelli
Now that we have migrated the CFP rule handling to a list with a
software copy, the delete/get operation just returns what is on the
list, no need to read from the hardware which is both slow and more
error prone.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2_cfp.c | 310 --
 1 file changed, 310 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 5034e4f56fd5..8f734abde7b3 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -974,316 +974,6 @@ static void bcm_sf2_invert_masks(struct 
ethtool_rx_flow_spec *flow)
flow->m_ext.data[1] ^= cpu_to_be32(~0);
 }
 
-static int __maybe_unused bcm_sf2_cfp_unslice_ipv4(struct bcm_sf2_priv *priv,
-  struct ethtool_tcpip4_spec 
*v4_spec,
-  bool mask)
-{
-   u32 reg, offset, ipv4;
-   u16 src_dst_port;
-
-   if (mask)
-   offset = CORE_CFP_MASK_PORT(3);
-   else
-   offset = CORE_CFP_DATA_PORT(3);
-
-   reg = core_readl(priv, offset);
-   /* src port [15:8] */
-   src_dst_port = reg << 8;
-
-   if (mask)
-   offset = CORE_CFP_MASK_PORT(2);
-   else
-   offset = CORE_CFP_DATA_PORT(2);
-
-   reg = core_readl(priv, offset);
-   /* src port [7:0] */
-   src_dst_port |= (reg >> 24);
-
-   v4_spec->pdst = cpu_to_be16(src_dst_port);
-   v4_spec->psrc = cpu_to_be16((u16)(reg >> 8));
-
-   /* IPv4 dst [15:8] */
-   ipv4 = (reg & 0xff) << 8;
-
-   if (mask)
-   offset = CORE_CFP_MASK_PORT(1);
-   else
-   offset = CORE_CFP_DATA_PORT(1);
-
-   reg = core_readl(priv, offset);
-   /* IPv4 dst [31:16] */
-   ipv4 |= ((reg >> 8) & 0x) << 16;
-   /* IPv4 dst [7:0] */
-   ipv4 |= (reg >> 24) & 0xff;
-   v4_spec->ip4dst = cpu_to_be32(ipv4);
-
-   /* IPv4 src [15:8] */
-   ipv4 = (reg & 0xff) << 8;
-
-   if (mask)
-   offset = CORE_CFP_MASK_PORT(0);
-   else
-   offset = CORE_CFP_DATA_PORT(0);
-   reg = core_readl(priv, offset);
-
-   /* Once the TCAM is programmed, the mask reflects the slice number
-* being matched, don't bother checking it when reading back the
-* mask spec
-*/
-   if (!mask && !(reg & SLICE_VALID))
-   return -EINVAL;
-
-   /* IPv4 src [7:0] */
-   ipv4 |= (reg >> 24) & 0xff;
-   /* IPv4 src [31:16] */
-   ipv4 |= ((reg >> 8) & 0x) << 16;
-   v4_spec->ip4src = cpu_to_be32(ipv4);
-
-   return 0;
-}
-
-static int bcm_sf2_cfp_ipv4_rule_get(struct bcm_sf2_priv *priv, int port,
-struct ethtool_rx_flow_spec *fs)
-{
-   struct ethtool_tcpip4_spec *v4_spec = NULL, *v4_m_spec = NULL;
-   u32 reg;
-   int ret;
-
-   reg = core_readl(priv, CORE_CFP_DATA_PORT(6));
-
-   switch ((reg & IPPROTO_MASK) >> IPPROTO_SHIFT) {
-   case IPPROTO_TCP:
-   fs->flow_type = TCP_V4_FLOW;
-   v4_spec = >h_u.tcp_ip4_spec;
-   v4_m_spec = >m_u.tcp_ip4_spec;
-   break;
-   case IPPROTO_UDP:
-   fs->flow_type = UDP_V4_FLOW;
-   v4_spec = >h_u.udp_ip4_spec;
-   v4_m_spec = >m_u.udp_ip4_spec;
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   fs->m_ext.data[0] = cpu_to_be32((reg >> IP_FRAG_SHIFT) & 1);
-   v4_spec->tos = (reg >> IPTOS_SHIFT) & IPTOS_MASK;
-
-   ret = bcm_sf2_cfp_unslice_ipv4(priv, v4_spec, false);
-   if (ret)
-   return ret;
-
-   return bcm_sf2_cfp_unslice_ipv4(priv, v4_m_spec, true);
-}
-
-static int __maybe_unused bcm_sf2_cfp_unslice_ipv6(struct bcm_sf2_priv *priv,
-  __be32 *ip6_addr,
-  __be16 *port,
-  bool mask)
-{
-   u32 reg, tmp, offset;
-
-   /* C-Tag[31:24]
-* UDF_n_B8 [23:8] (port)
-* UDF_n_B7 (upper) [7:0] (addr[15:8])
-*/
-   if (mask)
-   offset = CORE_CFP_MASK_PORT(4);
-   else
-   offset = CORE_CFP_DATA_PORT(4);
-   reg = core_readl(priv, offset);
-   *port = cpu_to_be32(reg) >> 8;
-   tmp = (u32)(reg & 0xff) << 8;
-
-   /* UDF_n_B7 (lower) [31:24] (addr[7:0])
-* UDF_n_B6 [23:8] (addr[31:16])
-* UDF_n_B5 (upper) [7:0] (addr[47:40])
-*/
-   if (mask)
-   offset = CORE_CFP_MASK_PORT(3);
-   else
-   offset = CORE_CFP_DATA_PORT(3);
-   

[PATCH net-next 5/5] net: systemport: Restore Broadcom tag match filters upon resume

2018-11-06 Thread Florian Fainelli
Some of the system suspend states that we support wipe out entirely the
HW contents. If we had a Wake-on-LAN filter programmed prior to going
into suspend, but we did not actually wake-up from Wake-on-LAN and
instead used a deeper suspend state, make sure we restore the CID number
that we need to match against.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 12 
 drivers/net/ethernet/broadcom/bcmsysport.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 0e2d99c737e3..2e60dda32adc 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1068,6 +1068,7 @@ static void mpd_enable_set(struct bcm_sysport_priv *priv, 
bool enable)
 
 static void bcm_sysport_resume_from_wol(struct bcm_sysport_priv *priv)
 {
+   unsigned int index;
u32 reg;
 
/* Disable RXCHK, active filters and Broadcom tag matching */
@@ -1076,6 +1077,15 @@ static void bcm_sysport_resume_from_wol(struct 
bcm_sysport_priv *priv)
 RXCHK_BRCM_TAG_MATCH_SHIFT | RXCHK_EN | RXCHK_BRCM_TAG_EN);
rxchk_writel(priv, reg, RXCHK_CONTROL);
 
+   /* Make sure we restore correct CID index in case HW lost
+* its context during deep idle state
+*/
+   for_each_set_bit(index, priv->filters, RXCHK_BRCM_TAG_MAX) {
+   rxchk_writel(priv, priv->filters_loc[index] <<
+RXCHK_BRCM_TAG_CID_SHIFT, RXCHK_BRCM_TAG(index));
+   rxchk_writel(priv, 0xff00, RXCHK_BRCM_TAG_MASK(index));
+   }
+
/* Clear the MagicPacket detection logic */
mpd_enable_set(priv, false);
 
@@ -2189,6 +2199,7 @@ static int bcm_sysport_rule_set(struct bcm_sysport_priv 
*priv,
rxchk_writel(priv, reg, RXCHK_BRCM_TAG(index));
rxchk_writel(priv, 0xff00, RXCHK_BRCM_TAG_MASK(index));
 
+   priv->filters_loc[index] = nfc->fs.location;
set_bit(index, priv->filters);
 
return 0;
@@ -2208,6 +2219,7 @@ static int bcm_sysport_rule_del(struct bcm_sysport_priv 
*priv,
 * be taken care of during suspend time by bcm_sysport_suspend_to_wol
 */
clear_bit(index, priv->filters);
+   priv->filters_loc[index] = 0;
 
return 0;
 }
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.h 
b/drivers/net/ethernet/broadcom/bcmsysport.h
index a7a230884a87..7a0b7bfedd19 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.h
+++ b/drivers/net/ethernet/broadcom/bcmsysport.h
@@ -786,6 +786,7 @@ struct bcm_sysport_priv {
/* Ethtool */
u32 msg_enable;
DECLARE_BITMAP(filters, RXCHK_BRCM_TAG_MAX);
+   u32 filters_loc[RXCHK_BRCM_TAG_MAX];
 
struct bcm_sysport_stats64  stats64;
 
-- 
2.17.1



[PATCH net-next 2/5] net: dsa: bcm_sf2: Split rule handling from HW operation

2018-11-06 Thread Florian Fainelli
In preparation for restoring CFP rules during system wide system
suspend/resume where the hardware loses its context, split the rule
validation from its actual insertion as well as the rule removal from
its actual hardware deletion operation.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2_cfp.c | 89 +--
 1 file changed, 54 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 29b6b4204662..396bfa43c2e1 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -789,32 +789,14 @@ static int bcm_sf2_cfp_ipv6_rule_set(struct bcm_sf2_priv 
*priv, int port,
return ret;
 }
 
-static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, int port,
-   struct ethtool_rx_flow_spec *fs)
+static int bcm_sf2_cfp_rule_insert(struct dsa_switch *ds, int port,
+  struct ethtool_rx_flow_spec *fs)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
s8 cpu_port = ds->ports[port].cpu_dp->index;
__u64 ring_cookie = fs->ring_cookie;
unsigned int queue_num, port_num;
-   struct cfp_rule *rule = NULL;
-   int ret = -EINVAL;
-
-   /* Check for unsupported extensions */
-   if ((fs->flow_type & FLOW_EXT) && (fs->m_ext.vlan_etype ||
-fs->m_ext.data[1]))
-   return -EINVAL;
-
-   if (fs->location != RX_CLS_LOC_ANY &&
-   test_bit(fs->location, priv->cfp.used))
-   return -EBUSY;
-
-   if (fs->location != RX_CLS_LOC_ANY &&
-   fs->location > bcm_sf2_cfp_rule_size(priv))
-   return -EINVAL;
-
-   ret = bcm_sf2_cfp_rule_cmp(priv, port, fs);
-   if (ret == 0)
-   return -EEXIST;
+   int ret;
 
/* This rule is a Wake-on-LAN filter and we must specifically
 * target the CPU port in order for it to be working.
@@ -841,10 +823,6 @@ static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, int 
port,
if (port_num >= 7)
port_num -= 1;
 
-   rule = kzalloc(sizeof(*rule), GFP_KERNEL);
-   if (!rule)
-   return -ENOMEM;
-
switch (fs->flow_type & ~FLOW_EXT) {
case TCP_V4_FLOW:
case UDP_V4_FLOW:
@@ -861,6 +839,38 @@ static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, int 
port,
break;
}
 
+   return ret;
+}
+
+static int bcm_sf2_cfp_rule_set(struct dsa_switch *ds, int port,
+   struct ethtool_rx_flow_spec *fs)
+{
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
+   struct cfp_rule *rule = NULL;
+   int ret = -EINVAL;
+
+   /* Check for unsupported extensions */
+   if ((fs->flow_type & FLOW_EXT) && (fs->m_ext.vlan_etype ||
+fs->m_ext.data[1]))
+   return -EINVAL;
+
+   if (fs->location != RX_CLS_LOC_ANY &&
+   test_bit(fs->location, priv->cfp.used))
+   return -EBUSY;
+
+   if (fs->location != RX_CLS_LOC_ANY &&
+   fs->location > bcm_sf2_cfp_rule_size(priv))
+   return -EINVAL;
+
+   ret = bcm_sf2_cfp_rule_cmp(priv, port, fs);
+   if (ret == 0)
+   return -EEXIST;
+
+   rule = kzalloc(sizeof(*rule), GFP_KERNEL);
+   if (!rule)
+   return -ENOMEM;
+
+   ret = bcm_sf2_cfp_rule_insert(ds, port, fs);
if (ret) {
kfree(rule);
return ret;
@@ -910,13 +920,28 @@ static int bcm_sf2_cfp_rule_del_one(struct bcm_sf2_priv 
*priv, int port,
return 0;
 }
 
-static int bcm_sf2_cfp_rule_del(struct bcm_sf2_priv *priv, int port,
-   u32 loc)
+static int bcm_sf2_cfp_rule_remove(struct bcm_sf2_priv *priv, int port,
+  u32 loc)
 {
-   struct cfp_rule *rule;
u32 next_loc = 0;
int ret;
 
+   ret = bcm_sf2_cfp_rule_del_one(priv, port, loc, _loc);
+   if (ret)
+   return ret;
+
+   /* If this was an IPv6 rule, delete is companion rule too */
+   if (next_loc)
+   ret = bcm_sf2_cfp_rule_del_one(priv, port, next_loc, NULL);
+
+   return ret;
+}
+
+static int bcm_sf2_cfp_rule_del(struct bcm_sf2_priv *priv, int port, u32 loc)
+{
+   struct cfp_rule *rule;
+   int ret;
+
/* Refuse deleting unused rules, and those that are not unique since
 * that could leave IPv6 rules with one of the chained rule in the
 * table.
@@ -928,13 +953,7 @@ static int bcm_sf2_cfp_rule_del(struct bcm_sf2_priv *priv, 
int port,
if (!rule)
return -EINVAL;
 
-   ret = bcm_sf2_cfp_rule_del_one(priv, port, loc, _loc);
-   if (ret)
-   return ret;
-
-   /* If this was an IPv6 rule, delete is companion rule too */
-   if (next_

[PATCH net-next 3/5] net: dsa: bcm_sf2: Restore CFP rules during system resume

2018-11-06 Thread Florian Fainelli
The hardware can lose its context during system suspend, and depending
on the switch generation (7445 vs. 7278), while the rules are still
there, they will have their valid bit cleared (because that's the
fastest way for the HW to reset things). Just make sure we re-apply them
coming back from resume. The 7445 switch is an older version of the core
that has some quirky RAM technology requiring a delete then re-inser to
guarantee the RAM entries are properly latched.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c |  4 
 drivers/net/dsa/bcm_sf2.h |  1 +
 drivers/net/dsa/bcm_sf2_cfp.c | 36 +++
 3 files changed, 41 insertions(+)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 89722dbfafb8..d8b93043b789 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -710,6 +710,10 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
return ret;
}
 
+   ret = bcm_sf2_cfp_resume(ds);
+   if (ret)
+   return ret;
+
if (priv->hw_params.num_gphy == 1)
bcm_sf2_gphy_enable_set(ds, true);
 
diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index 03444982c25e..faaef320ec48 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -215,5 +215,6 @@ int bcm_sf2_set_rxnfc(struct dsa_switch *ds, int port,
  struct ethtool_rxnfc *nfc);
 int bcm_sf2_cfp_rst(struct bcm_sf2_priv *priv);
 void bcm_sf2_cfp_exit(struct dsa_switch *ds);
+int bcm_sf2_cfp_resume(struct dsa_switch *ds);
 
 #endif /* __BCM_SF2_H */
diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 396bfa43c2e1..5034e4f56fd5 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -1443,3 +1443,39 @@ void bcm_sf2_cfp_exit(struct dsa_switch *ds)
list_for_each_entry_safe_reverse(rule, n, >cfp.rules_list, next)
bcm_sf2_cfp_rule_del(priv, rule->port, rule->fs.location);
 }
+
+int bcm_sf2_cfp_resume(struct dsa_switch *ds)
+{
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
+   struct cfp_rule *rule;
+   int ret = 0;
+   u32 reg;
+
+   if (list_empty(>cfp.rules_list))
+   return ret;
+
+   reg = core_readl(priv, CORE_CFP_CTL_REG);
+   reg &= ~CFP_EN_MAP_MASK;
+   core_writel(priv, reg, CORE_CFP_CTL_REG);
+
+   ret = bcm_sf2_cfp_rst(priv);
+   if (ret)
+   return ret;
+
+   list_for_each_entry(rule, >cfp.rules_list, next) {
+   ret = bcm_sf2_cfp_rule_remove(priv, rule->port,
+ rule->fs.location);
+   if (ret) {
+   dev_err(ds->dev, "failed to remove rule\n");
+   return ret;
+   }
+
+   ret = bcm_sf2_cfp_rule_insert(ds, rule->port, >fs);
+   if (ret) {
+   dev_err(ds->dev, "failed to restore rule\n");
+   return ret;
+   }
+   };
+
+   return ret;
+}
-- 
2.17.1



[PATCH net-next 0/5] net: dsa: bcm_sf2: Store rules in lists

2018-11-06 Thread Florian Fainelli
Hi all,

This patch series changes the bcm-sf2 driver to keep a copy of the
inserted rules as opposed to using the HW as a storage area for a number
of reasons:

- this helps us with doing duplicate rule detection in a faster way, it
  would have required a full rule read before

- this helps with Pablo's on-going work to convert ethtool_rx_flow_spec
  to a more generic flow rule structure by having fewer code paths to
  convert to the new structure/helpers

- we need to cache copies to restore them during drive resumption,
  because depending on the low power mode the system has entered, the
  switch may have lost all of its context

Florian Fainelli (5):
  net: dsa: bcm_sf2: Keep copy of inserted rules
  net: dsa: bcm_sf2: Split rule handling from HW operation
  net: dsa: bcm_sf2: Restore CFP rules during system resume
  net: dsa: bcm_sf2: Get rid of unmarshalling functions
  net: systemport: Restore Broadcom tag match filters upon resume

 drivers/net/dsa/bcm_sf2.c  |   6 +
 drivers/net/dsa/bcm_sf2.h  |   3 +
 drivers/net/dsa/bcm_sf2_cfp.c  | 497 -
 drivers/net/ethernet/broadcom/bcmsysport.c |  12 +
 drivers/net/ethernet/broadcom/bcmsysport.h |   1 +
 5 files changed, 204 insertions(+), 315 deletions(-)

-- 
2.17.1



[PATCH net-next 1/5] net: dsa: bcm_sf2: Keep copy of inserted rules

2018-11-06 Thread Florian Fainelli
We tried hard to use the hardware as a storage area, which made things
needlessly complex in that we had to both marshall and unmarshall the
ethtool_rx_flow_spec into what the CFP hardware understands but it did
not require any driver level allocations, so that was nice.

Keep a copy of the ethtool_rx_flow_spec rule we want to insert, and also
make sure we don't have a duplicate rule already. This greatly speeds up
the deletion time since we only need to clear the slice's valid bit and
not perform a full read.

This is a preparatory step for being able to restore rules upon system
resumption where the hardware loses its context partially or entirely.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c |   2 +
 drivers/net/dsa/bcm_sf2.h |   2 +
 drivers/net/dsa/bcm_sf2_cfp.c | 144 +++---
 3 files changed, 137 insertions(+), 11 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 2eb68769562c..89722dbfafb8 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -1061,6 +1061,7 @@ static int bcm_sf2_sw_probe(struct platform_device *pdev)
spin_lock_init(>indir_lock);
mutex_init(>stats_mutex);
mutex_init(>cfp.lock);
+   INIT_LIST_HEAD(>cfp.rules_list);
 
/* CFP rule #0 cannot be used for specific classifications, flag it as
 * permanently used
@@ -1166,6 +1167,7 @@ static int bcm_sf2_sw_remove(struct platform_device *pdev)
 
priv->wol_ports_mask = 0;
dsa_unregister_switch(priv->dev->ds);
+   bcm_sf2_cfp_exit(priv->dev->ds);
/* Disable all ports and interrupts */
bcm_sf2_sw_suspend(priv->dev->ds);
bcm_sf2_mdio_unregister(priv);
diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index cc31e986e6e3..03444982c25e 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -56,6 +56,7 @@ struct bcm_sf2_cfp_priv {
DECLARE_BITMAP(used, CFP_NUM_RULES);
DECLARE_BITMAP(unique, CFP_NUM_RULES);
unsigned int rules_cnt;
+   struct list_head rules_list;
 };
 
 struct bcm_sf2_priv {
@@ -213,5 +214,6 @@ int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
 int bcm_sf2_set_rxnfc(struct dsa_switch *ds, int port,
  struct ethtool_rxnfc *nfc);
 int bcm_sf2_cfp_rst(struct bcm_sf2_priv *priv);
+void bcm_sf2_cfp_exit(struct dsa_switch *ds);
 
 #endif /* __BCM_SF2_H */
diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 47c5f272a084..29b6b4204662 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -20,6 +20,12 @@
 #include "bcm_sf2.h"
 #include "bcm_sf2_regs.h"
 
+struct cfp_rule {
+   int port;
+   struct ethtool_rx_flow_spec fs;
+   struct list_head next;
+};
+
 struct cfp_udf_slice_layout {
u8 slices[UDFS_PER_SLICE];
u32 mask_value;
@@ -515,6 +521,61 @@ static void bcm_sf2_cfp_slice_ipv6(struct bcm_sf2_priv 
*priv,
core_writel(priv, reg, offset);
 }
 
+static struct cfp_rule *bcm_sf2_cfp_rule_find(struct bcm_sf2_priv *priv,
+ int port, u32 location)
+{
+   struct cfp_rule *rule = NULL;
+
+   list_for_each_entry(rule, >cfp.rules_list, next) {
+   if (rule->port == port && rule->fs.location == location)
+   break;
+   };
+
+   return rule;
+}
+
+static int bcm_sf2_cfp_rule_cmp(struct bcm_sf2_priv *priv, int port,
+   struct ethtool_rx_flow_spec *fs)
+{
+   struct cfp_rule *rule = NULL;
+   size_t fs_size = 0;
+   int ret = 1;
+
+   if (list_empty(>cfp.rules_list))
+   return ret;
+
+   list_for_each_entry(rule, >cfp.rules_list, next) {
+   ret = 1;
+   if (rule->port != port)
+   continue;
+
+   if (rule->fs.flow_type != fs->flow_type ||
+   rule->fs.ring_cookie != fs->ring_cookie ||
+   rule->fs.m_ext.data[0] != fs->m_ext.data[0])
+   continue;
+
+   switch (fs->flow_type & ~FLOW_EXT) {
+   case TCP_V6_FLOW:
+   case UDP_V6_FLOW:
+   fs_size = sizeof(struct ethtool_tcpip6_spec);
+   break;
+   case TCP_V4_FLOW:
+   case UDP_V4_FLOW:
+   fs_size = sizeof(struct ethtool_tcpip4_spec);
+   break;
+   default:
+   continue;
+   }
+
+   ret = memcmp(>fs.h_u, >h_u, fs_size);
+   ret |= memcmp(>fs.m_u, >m_u, fs_size);
+   if (ret == 0)
+   break;
+   }
+
+   return ret;
+}
+
 static int bcm_sf2_cfp_ipv6_rule_set(struct bcm_sf2_priv *priv, int port,
 

Re: [PATCH] net: phy: realtek: fix RTL8201F sysfs name

2018-11-05 Thread Florian Fainelli



On 11/4/2018 10:43 AM, Andrew Lunn wrote:
> On Sun, Nov 04, 2018 at 07:02:42PM +0100, Holger Hoffstätte wrote:
>> Since 4.19 the following error in sysfs has appeared when using the
>> r8169 NIC driver:
>>
>> $cd /sys/module/realtek/drivers
>> $ls -l
>> ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such file or 
>> directory
>> [..garbled dir entries follow..]
>>
>> Apparently the forward slash in "10/100Mbps Ethernet" is interpreted
>> as directory separator that leads nowhere, and was introduced in commit
>> 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
>>
>> Fix this by removing the offending slash in the driver name.
>>
>> Other drivers in net/phy seem to have the same problem, but I cannot
>> test/verify them.
>>
>> Signed-off-by: Holger Hoffstätte 
> 
> Fixes:513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
> 
> Reviewed-by: Andrew Lunn 
> 
> David, please apply to net.

We should probably seek a more generic solution within sysfs to deny
specific problematic characters from being used, such as ., .., / etc.
-- 
Florian


[PATCH net 0/2] net: timeout fixes for GENET and SYSTEMPORT

2018-11-01 Thread Florian Fainelli
Hi David,

This patch series fixes occasional transmit timeout around the time
the system goes into suspend. GENET and SYSTEMPORT have nearly the same
logic in that regard and were both affected in the same way.

Please queue up for stable, thanks!

Doug Berger (1):
  net: bcmgenet: protect stop from timeout

Florian Fainelli (1):
  net: systemport: Protect stop from timeout

 drivers/net/ethernet/broadcom/bcmsysport.c | 15 +++
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 13 +++--
 2 files changed, 14 insertions(+), 14 deletions(-)

-- 
2.17.1



[PATCH net 2/2] net: systemport: Protect stop from timeout

2018-11-01 Thread Florian Fainelli
A timing hazard exists when the network interface is stopped that
allows a watchdog timeout to be processed by a separate core in
parallel. This creates the potential for the timeout handler to
wake the queues while the driver is shutting down, or access
registers after their clocks have been removed.

The more common case is that the watchdog timeout will produce a
warning message which doesn't lead to a crash. The chances of this
are greatly increased by the fact that bcm_sysport_netif_stop stops
the transmit queues which can easily precipitate a watchdog time-
out because of stale trans_start data in the queues.

This commit corrects the behavior by ensuring that the watchdog
timeout is disabled before enterring bcm_sysport_netif_stop. There
are currently only two users of the bcm_sysport_netif_stop function:
close and suspend.

The close case already handles the issue by exiting the RUNNING
state before invoking the driver close service.

The suspend case now performs the netif_device_detach to exit the
PRESENT state before the call to bcm_sysport_netif_stop rather than
after it.

These behaviors prevent any future scheduling of the driver timeout
service during the window. The netif_tx_stop_all_queues function
in bcm_sysport_netif_stop is replaced with netif_tx_disable to ensure
synchronization with any transmit or timeout threads that may
already be executing on other cores.

For symmetry, the netif_device_attach call upon resume is moved to
after the call to bcm_sysport_netif_start. Since it wakes the transmit
queues it is not necessary to invoke netif_tx_start_all_queues from
bcm_sysport_netif_start so it is moved into the driver open service.

Fixes: 40755a0fce17 ("net: systemport: add suspend and resume support")
Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC 
driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 4122553e224b..0e2d99c737e3 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1902,9 +1902,6 @@ static void bcm_sysport_netif_start(struct net_device 
*dev)
intrl2_1_mask_clear(priv, 0x);
else
intrl2_0_mask_clear(priv, INTRL2_0_TDMA_MBDONE_MASK);
-
-   /* Last call before we start the real business */
-   netif_tx_start_all_queues(dev);
 }
 
 static void rbuf_init(struct bcm_sysport_priv *priv)
@@ -2048,6 +2045,8 @@ static int bcm_sysport_open(struct net_device *dev)
 
bcm_sysport_netif_start(dev);
 
+   netif_tx_start_all_queues(dev);
+
return 0;
 
 out_clear_rx_int:
@@ -2071,7 +2070,7 @@ static void bcm_sysport_netif_stop(struct net_device *dev)
struct bcm_sysport_priv *priv = netdev_priv(dev);
 
/* stop all software from updating hardware */
-   netif_tx_stop_all_queues(dev);
+   netif_tx_disable(dev);
napi_disable(>napi);
cancel_work_sync(>dim.dim.work);
phy_stop(dev->phydev);
@@ -2658,12 +2657,12 @@ static int __maybe_unused bcm_sysport_suspend(struct 
device *d)
if (!netif_running(dev))
return 0;
 
+   netif_device_detach(dev);
+
bcm_sysport_netif_stop(dev);
 
phy_suspend(dev->phydev);
 
-   netif_device_detach(dev);
-
/* Disable UniMAC RX */
umac_enable_set(priv, CMD_RX_EN, 0);
 
@@ -2746,8 +2745,6 @@ static int __maybe_unused bcm_sysport_resume(struct 
device *d)
goto out_free_rx_ring;
}
 
-   netif_device_attach(dev);
-
/* RX pipe enable */
topctrl_writel(priv, 0, RX_FLUSH_CNTL);
 
@@ -2788,6 +2785,8 @@ static int __maybe_unused bcm_sysport_resume(struct 
device *d)
 
bcm_sysport_netif_start(dev);
 
+   netif_device_attach(dev);
+
return 0;
 
 out_free_rx_ring:
-- 
2.17.1



Re: [PATCH v2 net] net: dsa: microchip: initialize mutex before use

2018-11-01 Thread Florian Fainelli
On 11/1/18 3:08 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Initialize mutex before use.  Avoid kernel complaint when
> CONFIG_DEBUG_LOCK_ALLOC is enabled.
> 
> Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> Signed-off-by: Tristram Ha 
> Reviewed-by: Pavel Machek 
> Reviewed-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 

> ---
> v2
> - Add endorsements

FWIW, David uses patchwork which automatically collects those tags into
the patch whenver we reply with one of the recognized/supported tag. Thanks!
-- 
Florian


Re: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem

2018-10-25 Thread Florian Fainelli
On 10/25/18 11:32 AM, David Miller wrote:
> From: 
> Date: Wed, 24 Oct 2018 14:51:23 -0700
> 
>> From: Tristram Ha 
>>
>> Socket buffer is not re-created when headroom is 2 and tailroom is 1.
>>
>> Signed-off-by: Tristram Ha 
> 
> Applied.

No fixes tag?
-- 
Florian


Re: ethernet "bus" number in DTS ?

2018-10-23 Thread Florian Fainelli
On 10/23/18 1:02 PM, Joakim Tjernlund wrote:
> On Tue, 2018-10-23 at 11:20 -0700, Florian Fainelli wrote:
>>
>> On 10/23/18 11:02 AM, Joakim Tjernlund wrote:
>>> On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 10/23/18 9:49 AM, Joakim Tjernlund wrote:
>>>>> SPI (and others) has a way to define bus number in a aliases:
>>>>>   aliases {
>>>>>   ethernet4 = 
>>>>>   ethernet0 = 
>>>>>   ethernet1 = 
>>>>>   ethernet2 = 
>>>>>   ethernet3 = 
>>>>>   spi0 = 
>>>>>   };
>>>>> The 0 in the spi0 alias will translate to bus num 0 so one can control 
>>>>> the /dev nodes, like /dev/spidev0
>>>>> I am looking for the same for ethernet devices:
>>>>>  ethernet4 =   /* should become eth4 */
>>>>>  ethernet0 =   /* should become eth0 */
>>>>> but I cannot find something like that for eth devices.
>>>>>
>>>>> Could such functionality be added?
>>>>
>>>> It could, do we want and need to, no. You have the Ethernet alias in
>>>> /sys/class/net/*/device/uevent already that would allow you to perform
>>>> that (re)naming in user-space:
>>>>
>>>> # cat /sys/class/net/eth0/device/uevent
>>>> DRIVER=bcmgenet
>>>> OF_NAME=ethernet
>>>> OF_FULLNAME=/rdb/ethernet@f048
>>>> OF_TYPE=network
>>>> OF_COMPATIBLE_0=brcm,genet-v5
>>>> OF_COMPATIBLE_N=1
>>>> OF_ALIAS_0=eth0 <==
>>>> MODALIAS=of:NethernetTnetworkCbrcm,genet-v5
>>>
>>> Yes, one can if one uses udev and can find something to identify the hw I/F 
>>> with, my
>>> cat /sys/class/net/eth0/device/uevent looks like:
>>> DRIVER=fsl_dpa
>>> MODALIAS=platform:dpaa-ethernet
>>
>> Does not dpaa have a notion of Ethernet ports and those should have
>> proper information? Maybe that is part of your problem here, it should
>> have the OF_ALIAS information somehow available.
> 
> I cannot say ATM, but this lack of standard does not make it easier to rename 
> I/F's in udev.
> 
>>
>>> not sure mdev supports this, does it?
>>> Our simple installer FS(initramfs) doesn't have either udev or mdev.
>>
>> I don't know, but you could have a simple shell script that looks at
>> specific network device properties to decide on the naming and call
>> ifrename.
> 
> This reinventing of the wheel is what I am trying to avoid.

Embedded is all about being a special snowflake and re-inventing the
wheel, but having some desktop-like distribution user-space would
certainly allow you to re-invent other parts of the wheel.

> 
>>
>>> I also noted that using status = "disabled" didn't work either to create a 
>>> fix name scheme.
>>> Even worse, all the eth I/F after gets renumbered. It seems to me there
>>> is value in having stability in eth I/F naming at boot.
>>> Then userspace(udev) can rename if need be.
>>>
>>> Sure would like to known more about why this feature is not wanted ?
>>>
>>> I found
>>>   https://patchwork.kernel.org/patch/4122441/
>>> You quote policy as reason but surely it must be better to
>>> have something stable, connected to the hardware name, than semirandom 
>>> naming?
>>
>> If the Device Tree nodes are ordered by ascending base register address,
>> my understanding is that you get the same order as far as
>> platform_device creation goes, this may not be true in the future if Rob
>> decides to randomize that, but AFAICT this is still true. This may not
>> work well with status = disabled properties being inserted here and
>> there, but we have used that here and it has worked for as far as I can
>> remember doing it.
> 
> I recall it is the order in which the eth alias appear that controls the 
> naming,
> not 100% sure though.

Aliases are not looked up at all by the platform bus code other that
with of_get_alias() and friends, it is the order in which the nodes are
declared in the Device Tree, preferably ordered by base address that
dictates the order in which platform devices are created.

> 
>>
>> Second, you might want to name network devices ethX, but what if I want
>> to name them ethernetX or fooX or barX? Should we be accepting a
>> mechanism in the kernel that would allow someone to name t

Re: ethernet "bus" number in DTS ?

2018-10-23 Thread Florian Fainelli
On 10/23/18 11:02 AM, Joakim Tjernlund wrote:
> On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:
>> CAUTION: This email originated from outside of the organization. Do not 
>> click links or open attachments unless you recognize the sender and know the 
>> content is safe.
>>
>>
>> On 10/23/18 9:49 AM, Joakim Tjernlund wrote:
>>> SPI (and others) has a way to define bus number in a aliases:
>>>   aliases {
>>>   ethernet4 = 
>>>   ethernet0 = 
>>>   ethernet1 = 
>>>   ethernet2 = 
>>>   ethernet3 = 
>>>   spi0 = 
>>>   };
>>> The 0 in the spi0 alias will translate to bus num 0 so one can control the 
>>> /dev nodes, like /dev/spidev0
>>> I am looking for the same for ethernet devices:
>>>  ethernet4 =   /* should become eth4 */
>>>  ethernet0 =   /* should become eth0 */
>>> but I cannot find something like that for eth devices.
>>>
>>> Could such functionality be added?
>>
>> It could, do we want and need to, no. You have the Ethernet alias in
>> /sys/class/net/*/device/uevent already that would allow you to perform
>> that (re)naming in user-space:
>>
>> # cat /sys/class/net/eth0/device/uevent
>> DRIVER=bcmgenet
>> OF_NAME=ethernet
>> OF_FULLNAME=/rdb/ethernet@f048
>> OF_TYPE=network
>> OF_COMPATIBLE_0=brcm,genet-v5
>> OF_COMPATIBLE_N=1
>> OF_ALIAS_0=eth0 <==
>> MODALIAS=of:NethernetTnetworkCbrcm,genet-v5
> 
> Yes, one can if one uses udev and can find something to identify the hw I/F 
> with, my
> cat /sys/class/net/eth0/device/uevent looks like:
> DRIVER=fsl_dpa
> MODALIAS=platform:dpaa-ethernet

Does not dpaa have a notion of Ethernet ports and those should have
proper information? Maybe that is part of your problem here, it should
have the OF_ALIAS information somehow available.

> 
> not sure mdev supports this, does it?
> Our simple installer FS(initramfs) doesn't have either udev or mdev.

I don't know, but you could have a simple shell script that looks at
specific network device properties to decide on the naming and call
ifrename.

> 
> I also noted that using status = "disabled" didn't work either to create a 
> fix name scheme.
> Even worse, all the eth I/F after gets renumbered. It seems to me there
> is value in having stability in eth I/F naming at boot.
> Then userspace(udev) can rename if need be. 
> 
> Sure would like to known more about why this feature is not wanted ?
> 
> I found
>   https://patchwork.kernel.org/patch/4122441/
> You quote policy as reason but surely it must be better to
> have something stable, connected to the hardware name, than semirandom naming?

If the Device Tree nodes are ordered by ascending base register address,
my understanding is that you get the same order as far as
platform_device creation goes, this may not be true in the future if Rob
decides to randomize that, but AFAICT this is still true. This may not
work well with status = disabled properties being inserted here and
there, but we have used that here and it has worked for as far as I can
remember doing it.

Second, you might want to name network devices ethX, but what if I want
to name them ethernetX or fooX or barX? Should we be accepting a
mechanism in the kernel that would allow someone to name the interfaces
the way they want straight from a name being provided in Device Tree?

Aliases are fine for providing relative stability within the Device Tree
itself and boot programs that might need to modify the Device Tree (e.g:
inserting MAC addresses) such that you don't have to encode logic to
search for nodes by compatible strings etc. but outside of that use
case, it seems to me that you can resolve every naming decision in
user-space.
-- 
Florian


Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks

2018-10-22 Thread Florian Fainelli
On 10/22/18 8:48 AM, Russell King - ARM Linux wrote:
> On Mon, Oct 22, 2018 at 01:47:48PM +0100, Jose Abreu wrote:
>> Hello,
>>
>> On 22-10-2018 13:28, Andrew Lunn wrote:
  EXPORT_SYMBOL_GPL(gen10g_resume);
 @@ -327,7 +381,7 @@ struct phy_driver genphy_10g_driver = {
.phy_id = 0x,
.phy_id_mask= 0x,
.name   = "Generic 10G PHY",
 -  .soft_reset = gen10g_no_soft_reset,
 +  .soft_reset = gen10g_soft_reset,
.config_init= gen10g_config_init,
.features   = 0,
.aneg_done  = genphy_c45_aneg_done,
>>> Hi Jose
>>>
>>> You need to be careful here. There is a reason this is called
>>> gen10g_no_soft_reset, rather than having an empty
>>> gen10g_soft_reset. Some PHYs break when you do a reset.  So adding a
>>> gen10g_soft_reset is fine, but don't change this here, without first
>>> understanding the history, and talking to Russell King.
>>
>> Hmm, the reset function only interacts with standard PCS
>> registers, which should always be available ...
>>
>> >From my tests I need to do at least 1 reset during power-up so in
>> ultimate case I can add a feature quirk or similar.
>>
>> Russell, can you please comment ?
> 
> Setting the reset bit on 88x3310 causes the entire device to become
> completely inaccessible until hardware reset.  Therefore, this bit
> must _never_ be set for these devices.  That said, we have a separate
> driver for these PHYs, but that will only be used for them if it's
> present in the kernel.  If we accidentally fall back to the generic
> driver, then we'll screw the 88x3310 until a full hardware reset.
> 
> We also have a bunch of net devices that make use of this crippled
> "generic" 10G support - we don't know whether resetting the PHY
> for those systems will cause a regression - maybe board firmware
> already configured the PHY?  I can't say either way on that, except
> that we've had crippled 10G support in PHYLIB for a number of years
> now _with_ users, and adding reset support drastically changes the
> subsystem's behaviour for these users.
> 
> I would recommend not touching the generic 10G driver, but instead
> implement your own driver for your PHY to avoid causing regressions.
> 

Agreed.
-- 
Florian


Re: [PATCH net-next 1/4] net: phy: Use C45 Helpers when forcing PHY

2018-10-22 Thread Florian Fainelli
On 10/22/18 3:32 AM, Jose Abreu wrote:
> If PHY is in force state and we have a C45 phy we need to use the
> standard C45 helpers and not the C22 ones.
> 
> Signed-off-by: Jose Abreu 
> Cc: Andrew Lunn 
> Cc: Florian Fainelli 
> Cc: "David S. Miller" 
> Cc: Joao Pinto 
> ---
>  drivers/net/phy/phy.c | 2 +-
>  include/linux/phy.h   | 8 
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 1d73ac3309ce..0ff4946e208e 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -995,7 +995,7 @@ void phy_state_machine(struct work_struct *work)
>   }
>   break;
>   case PHY_FORCING:
> - err = genphy_update_link(phydev);
> + err = phy_update_link(phydev);
>   if (err)
>   break;
>  
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 3ea87f774a76..02c2ee8bc05b 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -1044,6 +1044,14 @@ static inline int phy_read_status(struct phy_device 
> *phydev)
>   return genphy_read_status(phydev);
>  }
>  
> +static inline int phy_update_link(struct phy_device *phydev)
> +{
> + if (phydev->is_c45)
> + return gen10g_read_status(phydev);

Should not this be genphy_c45_read_link() for symmetry with
genphy_update_link() which only updates phydev->link?
-- 
Florian


Re: [PATCH v2 1/2] dt-bindings: net: add MDIO bus multiplexer driven by a regmap device

2018-10-22 Thread Florian Fainelli
On 10/21/18 9:22 PM, Pankaj Bansal wrote:
> 
>> -Original Message-
>> From: Pankaj Bansal
>> Sent: Thursday, October 18, 2018 10:00 AM
>> To: Florian Fainelli ; Andrew Lunn 
>> Cc: netdev@vger.kernel.org
>> Subject: RE: [PATCH v2 1/2] dt-bindings: net: add MDIO bus multiplexer 
>> driven by
>> a regmap device
>>
>> Hi Florian
>>
>>> -Original Message-
>>> From: Florian Fainelli [mailto:f.faine...@gmail.com]
>>> Sent: Sunday, October 7, 2018 11:32 PM
>>> To: Pankaj Bansal ; Andrew Lunn
>>> 
>>> Cc: netdev@vger.kernel.org
>>> Subject: Re: [PATCH v2 1/2] dt-bindings: net: add MDIO bus multiplexer
>>> driven by a regmap device
>>>
>>>
>>>
>>> On 10/07/18 11:24, Pankaj Bansal wrote:
>>>> Add support for an MDIO bus multiplexer controlled by a regmap
>>>> device, like an FPGA.
>>>>
>>>> Tested on a NXP LX2160AQDS board which uses the "QIXIS" FPGA
>>>> attached to the i2c bus.
>>>>
>>>> Signed-off-by: Pankaj Bansal 
>>>> ---
>>>>
>>>> Notes:
>>>> V2:
>>>>  - Fixed formatting error caused by using space instead of tab
>>>>  - Add newline between property list and subnode
>>>>  - Add newline between two subnodes
>>>>
>>>>  .../bindings/net/mdio-mux-regmap.txt | 95 ++
>>>>  1 file changed, 95 insertions(+)
>>>>
>>>> diff --git
>>>> a/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
>>>> b/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
>>>> new file mode 100644
>>>> index ..88ebac26c1c5
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
>>>> @@ -0,0 +1,95 @@
>>>> +Properties for an MDIO bus multiplexer controlled by a regmap
>>>> +
>>>> +This is a special case of a MDIO bus multiplexer.  A regmap device,
>>>> +like an FPGA, is used to control which child bus is connected.  The
>>>> +mdio-mux node must be a child of the device that is controlled by a
>> regmap.
>>>> +The driver currently only supports devices with upto 32-bit registers.
>>>
>>> I would omit any sort of details about Linux constructs designed to
>>> support specific needs (e.g: regmap) as well as putting driver
>>> limitations into a binding document because it really ought to be 
>>> restricted to
>> describing hardware.
>>>
>>
>> Actually the platform driver mdio-mux-regmap.c, is generalization of 
>> mdio-mux-
>> mmioreg.c As such, it doesn't describe any new hardware, so no new properties
>> are described by it.
>> The only new property is compatible field.
>> I don't know how to describe this driver otherwise.  Can you please help?
> 
> I further thought about it. mdio-mux-regmap.c is not meant to control a 
> specific device.
> It is meant to control some registers of parent device. Therefore, IMO this 
> should not be a
> Platform driver and there should not be any "compatible" property to which 
> this driver is associated.
> 
> Rather this driver should expose the APIs, which should be called by parent 
> devices' driver.

Which one is "this driver" in that context? If you already have a driver
that FPGA bus/piece of hardware, then that driver itself could be
offering the MDIO muxing capabilities directly. You might still want to
have some Device Tree representation to properly parent the PHY devices
to the branches of the mux.

> 
> What is your thought on this ?
> 
>>
>>>> +
>>>> +Required properties in addition to the generic multiplexer properties:
>>>> +
>>>> +- compatible : string, must contain "mdio-mux-regmap"
>>>> +
>>>> +- reg : integer, contains the offset of the register that controls the bus
>>>> +  multiplexer. it can be 32 bit number.
>>>
>>> Technically it could be any "reg" property size, the only requirement
>>> is that it must be of the appropriate size with respect to what the
>>> parent node of this "mdio-mux-regmap" node has, determined by #address-
>> cells/#size-cells width.
>>
>> We are reading only single cell of this property using "of_propert_read_u32".
>> That is why I put the size in this.
>>
>>>
>>>> +
>>>> +- mux-mask : integer, contains an

Re: [PATCH 0/9] net: simplify getting .driver_data

2018-10-22 Thread Florian Fainelli
On 10/21/18 1:00 PM, Wolfram Sang wrote:
> I got tired of fixing this in Renesas drivers manually, so I took the big
> hammer. Remove this cumbersome code pattern which got copy-pasted too much
> already:
> 
> - struct platform_device *pdev = to_platform_device(dev);
> - struct ep93xx_keypad *keypad = platform_get_drvdata(pdev);
> + struct ep93xx_keypad *keypad = dev_get_drvdata(dev);
> 
> A branch, tested by buildbot, can be found here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux.git 
> coccinelle/get_drvdata
> 
> I have been asked if it couldn't be done for dev_set_drvdata as well. I 
> checked
> it and did not find one occasion where it could be simplified like this. Not
> much of a surprise because driver_data is usually set in probe() functions
> which access struct platform_device in many other ways.
> 
> I am open for other comments, suggestions, too, of course.

Would it make sense to annotate platform_get_drvdata() with __deprecated
or something like that to prevent further uses by driver authors?

Do you mind submitting the cocci-script to the maintainers of the
coccinelle scripts that way it can be included and run by automated
tools? Thanks!

> 
> Here is the cocci-script I created:
> 
> @@
> struct device* d;
> identifier pdev;
> expression *ptr;
> @@
> (
> - struct platform_device *pdev = to_platform_device(d);
> |
> - struct platform_device *pdev;
>   ...
> - pdev = to_platform_device(d);
> )
>   <... when != pdev
> - >dev
> + d
>   ...>
> 
>   ptr =
> - platform_get_drvdata(pdev)
> + dev_get_drvdata(d)
> 
>   <... when != pdev
> - >dev
> + d
>   ...>
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Wolfram Sang (9):
>   net: dsa: bcm_sf2: simplify getting .driver_data
>   net: dsa: qca8k: simplify getting .driver_data
>   net: ethernet: cadence: macb_main: simplify getting .driver_data
>   net: ethernet: davicom: dm9000: simplify getting .driver_data
>   net: ethernet: smsc: smc91x: simplify getting .driver_data
>   net: ethernet: ti: cpsw: simplify getting .driver_data
>   net: ethernet: ti: davinci_emac: simplify getting .driver_data
>   net: ethernet: wiznet: w5300: simplify getting .driver_data
>   net: phy: mdio-mux-bcm-iproc: simplify getting .driver_data
> 
>  drivers/net/dsa/bcm_sf2.c| 6 ++
>  drivers/net/dsa/qca8k.c  | 6 ++
>  drivers/net/ethernet/cadence/macb_main.c | 6 ++
>  drivers/net/ethernet/davicom/dm9000.c| 6 ++
>  drivers/net/ethernet/smsc/smc91x.c   | 3 +--
>  drivers/net/ethernet/ti/cpsw.c   | 6 ++
>  drivers/net/ethernet/ti/davinci_emac.c   | 6 ++
>  drivers/net/ethernet/wiznet/w5300.c  | 6 ++
>  drivers/net/phy/mdio-mux-bcm-iproc.c | 6 ++
>  9 files changed, 17 insertions(+), 34 deletions(-)
> 


-- 
Florian


Re: C45 Phys and PHY_FORCING state

2018-10-19 Thread Florian Fainelli
On 10/19/2018 05:02 AM, Jose Abreu wrote:
> Hello Andrew and Florian,
> 
> Currently I have a 10G C45 phy that is fixed at 10G link. This
> version does not support auto negotiation so I'm turning off the
> feature in phydev struct field. I found out that when I do this
> phylib is not composing C45 frames and is instead using C22. This
> is due to call to genphy_udpate_link() which doesn't work on my
> phy because it doesn't support C22.
> 
> If I apply attached patch then things work perfectly fine. Can
> you please review it ?

Looks reasonable, I could not find other functions in the state machine
that were not already abstracting the clause type, or letting a driver
callback be called. Can you submit this as a formal patch against
net-next (and not attached, but inline)?

I would suggest creating a helper, e.g: phy_update_link() that way
everything is well namespaced and clear within the state machine itself.
-- 
Florian


Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-10-19 Thread Florian Fainelli
On 08/16/2018 02:34 PM, tristram...@microchip.com wrote:
>> -Original Message-
>> From: Florian Fainelli 
>> Sent: Wednesday, August 15, 2018 5:29 PM
>> To: Tristram Ha - C24268 ; Andrew Lunn
>> ; Pavel Machek ; Ruediger Schmitt
>> 
>> Cc: Arkadi Sharshevsky ; UNGLinuxDriver
>> ; netdev@vger.kernel.org
>> Subject: Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477
>> DSA driver in preparation to add other KSZ switch drivers
>>
>> On 12/05/2017 05:46 PM, tristram...@microchip.com wrote:
>>> From: Tristram Ha 
>>>
>>> This series of patches is to modify the original KSZ9477 DSA driver so
>>> that other KSZ switch drivers can be added and use the common code.
>>>
>>> There are several steps to accomplish this achievement.  First is to
>>> rename some function names with a prefix to indicate chip specific
>>> function.  Second is to move common code into header that can be shared.
>>> Last is to modify tag_ksz.c so that it can handle many tail tag formats
>>> used by different KSZ switch drivers.
>>>
>>> ksz_common.c will contain the common code used by all KSZ switch drivers.
>>> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
>>> ksz9477_spi.c is renamed from ksz_spi.c.
>>> ksz9477_reg.h is renamed from ksz_9477_reg.h.
>>> ksz_common.h is added to provide common code access to KSZ switch
>>> drivers.
>>> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
>>> drivers.
>>
>> Is something gating this series from getting included? It's been nearly
>> 8 months now and this has not been include nor resubmitted, any plans to
>> rebase that patch series and work towards inclusion in net-next when it
>> opens back again?
>>
>> Thank you!
> 
> Sorry for the long delay.  I will restart my kernel submission effort next 
> month
> after finishing the work on current development project.
> 

Tristram, any chance of resubmitting this or should someone with access
to those switches take up your series and submit it?
-- 
Florian


Re: [PATCH 2/3] net: socionext: Add dummy PHY register read in phy_write()

2018-10-18 Thread Florian Fainelli



On 10/18/2018 6:08 PM, masahisa.koj...@linaro.org wrote:
> From: Masahisa Kojima 
> 
> There is a compatibility issue between RTL8211E implemented
> in Developerbox and netsec network controller IP(F_GMAC4).
> 
> RTL8211E expects MDC clock must be kept toggling for several
> clock cycle with MDIO high before entering the IDLE state.
> To meet this requirement, netsec driver needs to issue dummy
> read(e.g. read PHYID1(offset 0x2) register) right after write.
> 
> Signed-off-by: Masahisa Kojima 
> Signed-off-by: Yoshitoyo Osaki 
> ---
>  drivers/net/ethernet/socionext/netsec.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index 273cc5fc07e0..e7faaf8be99e 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -431,9 +431,12 @@ static int netsec_mac_update_to_phy_state(struct 
> netsec_priv *priv)
>   return 0;
>  }
>  
> +static int netsec_phy_read(struct mii_bus *bus, int phy_addr, int reg_addr);
> +
>  static int netsec_phy_write(struct mii_bus *bus,
>   int phy_addr, int reg, u16 val)
>  {
> + int status;
>   struct netsec_priv *priv = bus->priv;
>  
>   if (netsec_mac_write(priv, GMAC_REG_GDR, val))
> @@ -446,8 +449,19 @@ static int netsec_phy_write(struct mii_bus *bus,
> GMAC_REG_SHIFT_CR_GAR)))
>   return -ETIMEDOUT;
>  
> - return netsec_mac_wait_while_busy(priv, GMAC_REG_GAR,
> -   NETSEC_GMAC_GAR_REG_GB);
> + status = netsec_mac_wait_while_busy(priv, GMAC_REG_GAR,
> + NETSEC_GMAC_GAR_REG_GB);
> +
> + /* Developerbox implements RTL8211E PHY and there is
> +  * a compatibility problem with F_GMAC4.
> +  * RTL8211E expects MDC clock must be kept toggling for several
> +  * clock cycle with MDIO high before entering the IDLE state.
> +  * To meet this requirement, netsec driver needs to issue dummy
> +  * read(e.g. read PHYID1(offset 0x2) register) right after write.
> +  */
> + netsec_phy_read(bus, phy_addr, 2);

MII_PHYSID1 instead of 0x2 would be preferable. It is not clear to me
from your commit message if this is a problem specific to your MDIO
controller implementation and the RTL8211E PHY or if this is a general
problem of the PHY irrespective of the MDIO controller it is interface
with. If the latter, then we should seek a solution at a different layer
such that other systems don't have that same problem.

Thank you!
-- 
Florian


Re: [PATCH 1/3] net: socionext: Stop PHY before resetting netsec

2018-10-18 Thread Florian Fainelli



On 10/18/2018 6:08 PM, masahisa.koj...@linaro.org wrote:
> From: Masahisa Kojima 
> 
> After resetting netsec IP, driver have to wait until
> netsec mode turns to NRM mode.
> But sometimes mode transition to NRM will not complete
> if the PHY is in normal operation state.
> To avoid this situation, stop PHY before resetting netsec.
> 
> Signed-off-by: Masahisa Kojima 
> Signed-off-by: Yoshitoyo Osaki 
> ---
>  drivers/net/ethernet/socionext/netsec.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index 4289ccb26e4e..273cc5fc07e0 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -1343,11 +1343,11 @@ static int netsec_netdev_stop(struct net_device *ndev)
>   netsec_uninit_pkt_dring(priv, NETSEC_RING_TX);
>   netsec_uninit_pkt_dring(priv, NETSEC_RING_RX);
>  
> - ret = netsec_reset_hardware(priv, false);
> -
>   phy_stop(ndev->phydev);
>   phy_disconnect(ndev->phydev);
>  
> + ret = netsec_reset_hardware(priv, false);
> +
>   pm_runtime_put_sync(priv->dev);
>  
>   return ret;
> @@ -1415,7 +1415,7 @@ static const struct net_device_ops netsec_netdev_ops = {
>  };
>  
>  static int netsec_of_probe(struct platform_device *pdev,
> -struct netsec_priv *priv)
> +struct netsec_priv *priv, u32 *phy_addr)
>  {
>   priv->phy_np = of_parse_phandle(pdev->dev.of_node, "phy-handle", 0);
>   if (!priv->phy_np) {
> @@ -1423,6 +1423,8 @@ static int netsec_of_probe(struct platform_device *pdev,
>   return -EINVAL;
>   }
>  
> + *phy_addr = of_mdio_parse_addr(>dev, priv->phy_np);
> +
>   priv->clk = devm_clk_get(>dev, NULL); /* get by 'phy_ref_clk' */
>   if (IS_ERR(priv->clk)) {
>   dev_err(>dev, "phy_ref_clk not found\n");
> @@ -1473,6 +1475,7 @@ static int netsec_register_mdio(struct netsec_priv 
> *priv, u32 phy_addr)
>  {
>   struct mii_bus *bus;
>   int ret;
> + u16 data;
>  
>   bus = devm_mdiobus_alloc(priv->dev);
>   if (!bus)
> @@ -1486,6 +1489,10 @@ static int netsec_register_mdio(struct netsec_priv 
> *priv, u32 phy_addr)
>   bus->parent = priv->dev;
>   priv->mii_bus = bus;
>  
> + /* set phy power down */
> + data = netsec_phy_read(bus, phy_addr, 0) | 0x800;
> + netsec_phy_write(bus, phy_addr, 0, data);

This is not explained in your commit message, and this is using open
coded values instead of the symbolic names from include/uapi/linux/mii.h.

Note that depending on the type of PHY you are interfaced with if the
PHY is in power down, the only register it is allowed to accept writes
for is MII_BMCR (per 802.3 clause 22 spec), any other read or write to a
different register can be discarded by its MDIO snooping logic. Since
probing the MDIO bus involves reading MII_PHYSID1/ID2, what is this
trying to achieve?

> +
>   if (dev_of_node(priv->dev)) {
>   struct device_node *mdio_node, *parent = dev_of_node(priv->dev);
>  
> @@ -1623,7 +1630,7 @@ static int netsec_probe(struct platform_device *pdev)
>   }
>  
>   if (dev_of_node(>dev))
> - ret = netsec_of_probe(pdev, priv);
> + ret = netsec_of_probe(pdev, priv, _addr);
>   else
>   ret = netsec_acpi_probe(pdev, priv, _addr);
>   if (ret)
> 

-- 
Florian


Re: [PATCH 0/3] Bugfix for the netsec driver

2018-10-18 Thread Florian Fainelli



On 10/18/2018 6:08 PM, masahisa.koj...@linaro.org wrote:
> From: Masahisa Kojima 
> 
> This patch series include bugfix for the netsec ethernet
> controller driver, fix the problem in interface down/up.

Since all of these are bugfixes you would want to provide a Fixes: tag
for the offending commits, that way you can get automated backporting to
stable trees, also patches should be targeted at the "net" tree, which
is indicated with a subject start with [PATCH net], more comments in the
patches.

> 
> Masahisa Kojima (3):
>   net: socionext: stop PHY before resetting netsec
>   net: socionext: Add dummy PHY register read in phy_write()
>   net: socionext: reset tx queue in ndo_stop
> 
>  drivers/net/ethernet/socionext/netsec.c | 36 
> +++--
>  1 file changed, 30 insertions(+), 6 deletions(-)
> 

-- 
Florian


Re: [PATCH 2/2] net: emac: implement TCP TSO

2018-10-17 Thread Florian Fainelli
On 10/17/2018 12:53 PM, Christian Lamparter wrote:
> This patch enables TSO(v4) hw feature for emac driver.
> As atleast the APM82181's TCP/IP acceleration hardware
> controller (TAH) provides TCP segmentation support in
> the transmit path.
> 
> Signed-off-by: Christian Lamparter 
> ---
>  drivers/net/ethernet/ibm/emac/core.c | 101 ++-
>  drivers/net/ethernet/ibm/emac/core.h |   4 ++
>  drivers/net/ethernet/ibm/emac/emac.h |   7 ++
>  drivers/net/ethernet/ibm/emac/tah.c  |  20 ++
>  drivers/net/ethernet/ibm/emac/tah.h  |   2 +
>  5 files changed, 133 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
> b/drivers/net/ethernet/ibm/emac/core.c
> index be560f9031f4..49ffbd6e1707 100644
> --- a/drivers/net/ethernet/ibm/emac/core.c
> +++ b/drivers/net/ethernet/ibm/emac/core.c
> @@ -38,6 +38,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1410,6 +1413,52 @@ static inline u16 emac_tx_csum(struct emac_instance 
> *dev,
>   return 0;
>  }
>  
> +const u32 tah_ss[TAH_NO_SSR] = { 9000, 4500, 1500, 1300, 576, 176 };
> +
> +static int emac_tx_tso(struct emac_instance *dev, struct sk_buff *skb,
> +u16 *ctrl)
> +{
> + if (emac_has_feature(dev, EMAC_FTR_TAH_HAS_TSO) &&
> + skb_is_gso(skb) && !!(skb_shinfo(skb)->gso_type &
> + (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) {
> + u32 seg_size = 0, i;
> +
> + /* Get the MTU */
> + seg_size = skb_shinfo(skb)->gso_size + tcp_hdrlen(skb)
> + + skb_network_header_len(skb);
> +
> + /* Restriction applied for the segmentation size
> +  * to use HW segmentation offload feature: the size
> +  * of the segment must not be less than 168 bytes for
> +  * DIX formatted segments, or 176 bytes for
> +  * IEEE formatted segments.
> +  *
> +  * I use value 176 to check for the segment size here
> +  * as it can cover both 2 conditions above.
> +  */
> + if (seg_size < 176)
> + return -ENODEV;
> +
> + /* Get the best suitable MTU */
> + for (i = 0; i < ARRAY_SIZE(tah_ss); i++) {
> + u32 curr_seg = tah_ss[i];
> +
> + if (curr_seg > dev->ndev->mtu ||
> + curr_seg > seg_size)
> + continue;
> +
> + *ctrl &= ~EMAC_TX_CTRL_TAH_CSUM;
> + *ctrl |= EMAC_TX_CTRL_TAH_SSR(i);
> + return 0;

This is something that you can possibly take out of your hot path and
recalculate when the MTU actually changes?

[snip]

> +static netdev_tx_t emac_sw_tso(struct sk_buff *skb, struct net_device *ndev)
> +{
> + struct emac_instance *dev = netdev_priv(ndev);
> + struct sk_buff *segs, *curr;
> +
> + segs = skb_gso_segment(skb, ndev->features &
> + ~(NETIF_F_TSO | NETIF_F_TSO6));
> + if (IS_ERR_OR_NULL(segs)) {
> + goto drop;
> + } else {
> + while (segs) {
> + /* check for overflow */
> + if (dev->tx_cnt >= NUM_TX_BUFF) {
> + dev_kfree_skb_any(segs);
> + goto drop;
> + }

Would setting dev->max_gso_segs somehow help make sure the stack does
not feed you oversized GSO'd skbs?
-- 
Florian


Re: [PATCH 1/2] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-17 Thread Florian Fainelli
On 10/17/2018 01:08 PM, Florian Fainelli wrote:
> On 10/17/2018 12:53 PM, Christian Lamparter wrote:
>> As per' APM82181 Embedded Processor User Manual 26.1 EMAC Features:
>> VLAN:
>>  - Support for VLAN tag ID in compliance with IEEE 802.3ac.
>>  - VLAN tag insertion or replacement for transmit packets
>>
>> This patch completes the missing code for the VLAN tx tagging
>> support, as the the EMAC_MR1_VLE was already enabled.
>>
>> Signed-off-by: Christian Lamparter 
>> ---
>>  drivers/net/ethernet/ibm/emac/core.c | 32 
>>  drivers/net/ethernet/ibm/emac/core.h |  6 +-
>>  2 files changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
>> b/drivers/net/ethernet/ibm/emac/core.c
>> index 760b2ad8e295..be560f9031f4 100644
>> --- a/drivers/net/ethernet/ibm/emac/core.c
>> +++ b/drivers/net/ethernet/ibm/emac/core.c
>> @@ -37,6 +37,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -674,7 +675,7 @@ static int emac_configure(struct emac_instance *dev)
>>   ndev->dev_addr[5]);
>>  
>>  /* VLAN Tag Protocol ID */
>> -out_be32(>vtpid, 0x8100);
>> +out_be32(>vtpid, ETH_P_8021Q);
>>  
>>  /* Receive mode register */
>>  r = emac_iff2rmr(ndev);
>> @@ -1435,6 +1436,22 @@ static inline netdev_tx_t emac_xmit_finish(struct 
>> emac_instance *dev, int len)
>>  return NETDEV_TX_OK;
>>  }
>>  
>> +static inline u16 emac_tx_vlan(struct emac_instance *dev, struct sk_buff 
>> *skb)
>> +{
>> +/* Handle VLAN TPID and TCI insert if this is a VLAN skb */
>> +if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX) &&
>> +skb_vlan_tag_present(skb)) {
>> +struct emac_regs __iomem *p = dev->emacp;
>> +
>> +/* update the VLAN TCI */
>> +out_be32(>vtci, (u32)skb_vlan_tag_get(skb));
> 
> The only case where this is likely not going to be 0x8100/ETH_P_8021Q is
> if you do 802.1ad (QinQ) and you decided to somehow offload the S-Tag
> instead of the C-Tag.

Sorry, looks like I mixed up TCI and TPID here, this looks obviously
correct ;)
-- 
Florian


Re: [PATCH 1/2] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-17 Thread Florian Fainelli
On 10/17/2018 12:53 PM, Christian Lamparter wrote:
> As per' APM82181 Embedded Processor User Manual 26.1 EMAC Features:
> VLAN:
>  - Support for VLAN tag ID in compliance with IEEE 802.3ac.
>  - VLAN tag insertion or replacement for transmit packets
> 
> This patch completes the missing code for the VLAN tx tagging
> support, as the the EMAC_MR1_VLE was already enabled.
> 
> Signed-off-by: Christian Lamparter 
> ---
>  drivers/net/ethernet/ibm/emac/core.c | 32 
>  drivers/net/ethernet/ibm/emac/core.h |  6 +-
>  2 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/emac/core.c 
> b/drivers/net/ethernet/ibm/emac/core.c
> index 760b2ad8e295..be560f9031f4 100644
> --- a/drivers/net/ethernet/ibm/emac/core.c
> +++ b/drivers/net/ethernet/ibm/emac/core.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -674,7 +675,7 @@ static int emac_configure(struct emac_instance *dev)
>ndev->dev_addr[5]);
>  
>   /* VLAN Tag Protocol ID */
> - out_be32(>vtpid, 0x8100);
> + out_be32(>vtpid, ETH_P_8021Q);
>  
>   /* Receive mode register */
>   r = emac_iff2rmr(ndev);
> @@ -1435,6 +1436,22 @@ static inline netdev_tx_t emac_xmit_finish(struct 
> emac_instance *dev, int len)
>   return NETDEV_TX_OK;
>  }
>  
> +static inline u16 emac_tx_vlan(struct emac_instance *dev, struct sk_buff 
> *skb)
> +{
> + /* Handle VLAN TPID and TCI insert if this is a VLAN skb */
> + if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX) &&
> + skb_vlan_tag_present(skb)) {
> + struct emac_regs __iomem *p = dev->emacp;
> +
> + /* update the VLAN TCI */
> + out_be32(>vtci, (u32)skb_vlan_tag_get(skb));

The only case where this is likely not going to be 0x8100/ETH_P_8021Q is
if you do 802.1ad (QinQ) and you decided to somehow offload the S-Tag
instead of the C-Tag.

It would be a shame to slow down your TX path with an expensive register
write, when maybe inserting the VLAN in software amounts to the same
performance result ;)
-- 
Florian


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Florian Fainelli



On 10/16/2018 5:23 PM, Eric Dumazet wrote:
> 
> 
> On 10/16/2018 04:08 PM, Florian Fainelli wrote:
> 
>> I had started doing that about a month ago in light of the ixbge
>> ndo_poll_controller vs. napi problem, but have not had time to submit
>> that series yet:
>>
>> https://github.com/ffainelli/linux/commits/napi-check
>>
>> feel free to piggy back on top of that series if you would like to
>> address this.
> 
> 
> But the root cause was different, you remember ?
> 
> We fixed the real issue with netpoll ability to stick all NIC IRQ onto one
> victim CPU.

I do remember, after seeing the resolution I kind of decided to leave
this in a branch but not submit it because it was not particularly
relevant anymore.
-- 
Florian


Re: [PATCH net] r8169: fix NAPI handling under high load

2018-10-16 Thread Florian Fainelli
On 10/16/2018 04:03 PM, Stephen Hemminger wrote:
> On Tue, 16 Oct 2018 23:17:31 +0200
> Holger Hoffstätte  wrote:
> 
>> On 10/16/18 22:37, Heiner Kallweit wrote:
>>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>>> in the interrupt status register. Under high load NAPI may not be
>>> able to process all data (work_done == budget) and it will schedule
>>> subsequent calls to the poll callback.
>>> rtl_ack_events() however resets the bits in the interrupt status
>>> register, therefore subsequent calls to rtl8169_poll() won't call
>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.  
>>
>> Very interesting! Could this be the reason for the mysterious
>> hangs & resets we experienced when enabling BQL for r8169?
>> They happened more often with TSO/GSO enabled and several people
>> attempted to fix those hangs unsuccessfully; it was later reverted
>> and has been since then (#87cda7cb43).
>> If this bug has been there "forever" it might be tempting to
>> re-apply BQL and see what happens. Any chance you could give that
>> a try? I'll gladly test patches, just like I'll run this one.
>>
>> cheers
>> Holger
> 
> Many drivers have buggy usage of napi_complete_done.
> 
> Might even be worth forcing all network drivers to check the return
> value. But fixing 150 broken drivers will be a nuisance.

I had started doing that about a month ago in light of the ixbge
ndo_poll_controller vs. napi problem, but have not had time to submit
that series yet:

https://github.com/ffainelli/linux/commits/napi-check

feel free to piggy back on top of that series if you would like to
address this.

> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dc1d9ed33b31..c38bc66ffe74 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct 
> *napi)
> return false;
>  }
>  
> -bool napi_complete_done(struct napi_struct *n, int work_done);
> +bool __must_check napi_complete_done(struct napi_struct *n, int work_done);
> +
>  /**
>   * napi_complete - NAPI processing complete
>   * @n: NAPI context
> 


-- 
Florian


Re: gianfar: Implement MAC reset and reconfig procedure

2018-10-16 Thread Florian Fainelli
On 10/16/2018 02:36 PM, Daniel Walker wrote:
> Hi,
> 
> I would like to report an issue in the gianfar driver. The issue is as 
> follows. 
> 
> We have a P2020 board that uses the gianfar driver, and we have a m88e1101
> PHY connect. When the interface is initially brought up traffic flows as
> normal. If you take the interface down then bring it back up traffic stops
> flowing. If you do this sequence over and over up/down/up we find that the
> interface will allow traffic to flow at a low percentage.
> 
> In v4.9 interface allows traffic about %10 of the time.
> 
> In v4.19-rc8 the allows traffic %30 of the time.
> 
> After bisecting I found that in v3.14 the interface was rock solid and never 
> did
> we see this issue. However, v3.15 we started to see this issue. After 
> bisecting I
> found the following change is the first one which causes the issue,
> 
> a328ac9 gianfar: Implement MAC reset and reconfig procedure
> 
> I was able to revert this in v3.15 , however with later development a revert
> doesn't appear to be possible. We have no fix for this currently.
> 
> I can do testing if you have an idea what might cause the issue.

What we have seen being typically the problem is that when you have a
PHY connection whereby the PHY provides the RX clock to the MAC (e.g:
RGMII), it is very easy to get in a situation where the PHY clock is
stopped, and the MAC is asked to be reset, but the HW design does not
like that at all since it e.g: stops on packet boundaries and need some
clock cycles to do that, and that results in all sorts of issues (in our
case it was some FIFO corruption). We solved that in bcmgenet.c with
looping internally the TX clock to the RX clock to make sure the
Ethernet MAC (UniMAC in our designs) was successfully reset:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=28c2d1a7a0bfdf3617800d2beae1c67983c03d15

Could that somehow be the problem here?
-- 
Florian


Re: [PATCH stable 4.9 v2 00/29] backport of IP fragmentation fixes

2018-10-15 Thread Florian Fainelli



On 10/10/2018 12:29 PM, Florian Fainelli wrote:
> This is based on Stephen's v4.14 patches, with the necessary merge
> conflicts, and the lack of timer_setup() on the 4.9 baseline.
> 
> Perf results on a gigabit capable system, before and after are below.
> 
> Series can also be found here:
> 
> https://github.com/ffainelli/linux/commits/fragment-stack-v4.9-v2
> 
> Changes in v2:
> 
> - drop "net: sk_buff rbnode reorg"
> - added original "ip: use rb trees for IP frag queue." commit

Eric, does this look reasonable to you?

> 
> Before patches:
> 
>PerfTop: 180 irqs/sec  kernel:78.9%  exact:  0.0% [4000Hz cycles:ppp], 
>  (all, 4 CPUs)
> ---
> 
> 34.81%  [kernel]   [k] ip_defrag
>  4.57%  [kernel]   [k] arch_cpu_idle
>  2.09%  [kernel]   [k] fib_table_lookup
>  1.74%  [kernel]   [k] finish_task_switch
>  1.57%  [kernel]   [k] v7_dma_inv_range
>  1.47%  [kernel]   [k] __netif_receive_skb_core
>  1.06%  [kernel]   [k] __slab_free
>  1.04%  [kernel]   [k] __netdev_alloc_skb
>  0.99%  [kernel]   [k] ip_route_input_noref
>  0.96%  [kernel]   [k] dev_gro_receive
>  0.96%  [kernel]   [k] tick_nohz_idle_enter
>  0.93%  [kernel]   [k] bcm_sysport_poll
>  0.92%  [kernel]   [k] skb_release_data
>  0.91%  [kernel]   [k] __memzero
>  0.90%  [kernel]   [k] __free_page_frag
>  0.87%  [kernel]   [k] ip_rcv
>  0.77%  [kernel]   [k] eth_type_trans
>  0.71%  [kernel]   [k] _raw_spin_unlock_irqrestore
>  0.68%  [kernel]   [k] tick_nohz_idle_exit
>  0.65%  [kernel]   [k] bcm_sysport_rx_refill
> 
> After patches:
> 
>PerfTop: 214 irqs/sec  kernel:80.4%  exact:  0.0% [4000Hz cycles:ppp], 
>  (all, 4 CPUs)
> ---
> 
>  6.61%  [kernel]   [k] arch_cpu_idle
>  3.77%  [kernel]   [k] ip_defrag
>  3.65%  [kernel]   [k] v7_dma_inv_range
>  3.18%  [kernel]   [k] fib_table_lookup
>  3.04%  [kernel]   [k] __netif_receive_skb_core
>  2.31%  [kernel]   [k] finish_task_switch
>  2.31%  [kernel]   [k] _raw_spin_unlock_irqrestore
>  1.65%  [kernel]   [k] bcm_sysport_poll
>  1.63%  [kernel]   [k] ip_route_input_noref
>  1.63%  [kernel]   [k] __memzero
>  1.58%  [kernel]   [k] __netdev_alloc_skb
>  1.47%  [kernel]   [k] tick_nohz_idle_enter
>  1.40%  [kernel]   [k] __slab_free
>  1.32%  [kernel]   [k] ip_rcv
>  1.32%  [kernel]   [k] __softirqentry_text_start
>  1.30%  [kernel]   [k] dev_gro_receive
>  1.23%  [kernel]   [k] bcm_sysport_rx_refill
>  1.11%  [kernel]   [k] tick_nohz_idle_exit
>  1.06%  [kernel]   [k] memcmp
>  1.02%  [kernel]   [k] dma_cache_maint_page
> 
> 
> Dan Carpenter (1):
>   ipv4: frags: precedence bug in ip_expire()
> 
> Eric Dumazet (21):
>   inet: frags: change inet_frags_init_net() return value
>   inet: frags: add a pointer to struct netns_frags
>   inet: frags: refactor ipfrag_init()
>   inet: frags: refactor ipv6_frag_init()
>   inet: frags: refactor lowpan_net_frag_init()
>   ipv6: export ip6 fragments sysctl to unprivileged users
>   rhashtable: add schedule points
>   inet: frags: use rhashtables for reassembly units
>   inet: frags: remove some helpers
>   inet: frags: get rif of inet_frag_evicting()
>   inet: frags: remove inet_frag_maybe_warn_overflow()
>   inet: frags: break the 2GB limit for frags storage
>   inet: frags: do not clone skb in ip_expire()
>   ipv6: frags: rewrite ip6_expire_frag_queue()
>   rhashtable: reorganize struct rhashtable layout
>   inet: frags: reorganize struct netns_frags
>   inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
>   inet: frags: fix ip6frag_low_thresh boundary
>   net: speed up skb_rbtree_purge()
>   net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>   net: add rb_to_skb() and other rb tree helpers
> 
> Florian Westphal (1):
>   ipv6: defrag: drop non-last frags smaller than min mtu
> 
> Peter Oskolkov (5):
>   ip: discard IPv4 datagrams with overlapping segments.
>   net: modify skb_rbtree_purge to return the truesize of all purged
> skbs.
>   ip: use rb trees for IP frag queue.
>   ip: add helpers to process in-order fragments faster.
>   ip: process in-order fragments efficiently
> 
> Taehee Yoo (1):
>   ip: frags: fix crash in ip_do_fragment()
> 
>  Documentation/networking/ip-sysctl.txt  |  13 +-
>  include/linux/rhashtable.h

Re: [PATCH net 0/2] net: dsa: bcm_sf2: Couple of fixes

2018-10-11 Thread Florian Fainelli
On 10/10/2018 10:53 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Tue,  9 Oct 2018 16:48:56 -0700
> 
>> Here are two fixes for the bcm_sf2 driver that were found during
>> testing unbind and analysing another issue during system
>> suspend/resume.
> 
> Series applied and queued up for -stable, thanks.

Did you push that series yet?

git pull
remote: Counting objects: 171, done.
remote: Compressing objects: 100% (71/71), done.
remote: Total 171 (delta 144), reused 118 (delta 100)
Receiving objects: 100% (171/171), 32.80 KiB | 16.40 MiB/s, done.
Resolving deltas: 100% (144/144), completed with 61 local objects.
>From https://git.kernel.org/pub/scm/linux/kernel/git/davem/net
   52b5d6f5dcf0..052858663db3  master -> davem-net/master

git rebase net/master sf2-fixes
First, rewinding head to replay your work on top of it...
Applying: net: dsa: bcm_sf2: Fix unbind ordering
Applying: net: dsa: bcm_sf2: Call setup during switch resume
-- 
Florian


Re: [PATCH stable 4.9 00/29] backport of IP fragmentation fixes

2018-10-10 Thread Florian Fainelli
On 10/10/2018 04:18 PM, Stephen Hemminger wrote:
> On Tue, 9 Oct 2018 21:15:04 -0700
> Florian Fainelli  wrote:
> 
>>>
>>> Strange, I do not see "ip: use rb trees for IP frag queue." in this list ?  
>>
>> And it was not in Stephen's backport to 4.14 either, wait, looks like it
>> was somehow squashed into "net: sk_buff rbnode reorg". Stephen, was
>> there a reason for that?
>>
>> Let me go back and add bffa72cf7f9df842f0016ba03586039296b4caaf as well
>> as eeea10b83a139451130df1594f26710c8fa390c8 to the rebase todo and see
>> how things go from there.
>>
>> Thanks for taking a look.
> 
> I don't remember, spent time doing cherry-pick and fixups. Maybe the reorg
> commit got squashed as part of one rebase.
> 

No worries, I ended dropping these two commits because they would have
required cherry-picking "udp: copy skb->truesize in the first cache
line" which did not seem appropriate. I would appreciate if you could
take a look at v2 and confirm this does look good, in particular the
struct sk_buff layout.

Thanks!
-- 
Florian


[PATCH stable 4.9 v2 28/29] ip: frags: fix crash in ip_do_fragment()

2018-10-10 Thread Florian Fainelli
From: Taehee Yoo 

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 6

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff 
ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 
0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 11002297537b RBX: ed0020639e6e RCX: 0004
[  261.142156] RDX:  RSI:  RDI: 880114ba9bd8
[  261.150157] RBP: 880114ba8a40 R08: ed0022975395 R09: ed0022975395
[  261.158157] R10: 0001 R11: ed0022975394 R12: 880114ba9ca4
[  261.166159] R13: 0010 R14: 880114ba9bc0 R15: dc00
[  261.174169] FS:  7fbae2199700() GS:88011b40() 
knlGS:
[  261.183012] CS:  0010 DS:  ES:  CR0: 80050033
[  261.189013] CR2: 5579244fe000 CR3: 000119bf4000 CR4: 001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet 
Signed-off-by: Taehee Yoo 
Reviewed-by: Eric Dumazet 
Signed-off-by: David S. Miller 
---
 net/ipv4/ip_fragment.c  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9961b2102555..09565b14ba6b 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -597,6 +597,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*skb,
nextp = >next;
fp->prev = NULL;
memset(>rbnode, 0, sizeof(fp->rbnode));
+   fp->sk = NULL;
head->data_len += fp->len;
head->len += fp->len;
if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 907c2d5753dd..b9147558a8f2 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -452,6 +452,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff 
*prev,  struct net_devic
else if (head->ip_summed == CHECKSUM_COMPLETE)
head->csum = csum_add(head->csum, fp->csum);
head->truesize += fp->truesize;
+   fp->sk = NULL;
}
sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.17.1



[PATCH stable 4.9 v2 29/29] ipv4: frags: precedence bug in ip_expire()

2018-10-10 Thread Florian Fainelli
From: Dan Carpenter 

(commit 70837ffe3085c9a91488b52ca13ac84424da1042 upstream)

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter 
Signed-off-by: David S. Miller 
---
 net/ipv4/ip_fragment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 09565b14ba6b..cc8c6ac84d08 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -203,7 +203,7 @@ static void ip_expire(unsigned long arg)
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!qp->q.flags & INET_FRAG_FIRST_IN)
+   if (!(qp->q.flags & INET_FRAG_FIRST_IN))
goto out;
 
/* sk_buff::dev and sk_buff::rbnode are unionized. So we
-- 
2.17.1



[PATCH stable 4.9 v2 23/29] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 88078d98d1bb085d72af8437707279e203524fa5)
---
 include/linux/skbuff.h |  5 ++---
 net/core/skbuff.c  | 14 ++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 724c6abdb9e6..11974e63a41d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2954,6 +2954,7 @@ static inline unsigned char *skb_push_rcsum(struct 
sk_buff *skb,
return skb->data;
 }
 
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len);
 /**
  * pskb_trim_rcsum - trim received skb and update checksum
  * @skb: buffer to trim
@@ -2967,9 +2968,7 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, 
unsigned int len)
 {
if (likely(len >= skb->len))
return 0;
-   if (skb->ip_summed == CHECKSUM_COMPLETE)
-   skb->ip_summed = CHECKSUM_NONE;
-   return __pskb_trim(skb, len);
+   return pskb_trim_rcsum_slow(skb, len);
 }
 
 static inline int __skb_trim_rcsum(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e545e4432eb..038ec74fa131 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1578,6 +1578,20 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL(___pskb_trim);
 
+/* Note : use pskb_trim_rcsum() instead of calling this directly
+ */
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
+{
+   if (skb->ip_summed == CHECKSUM_COMPLETE) {
+   int delta = skb->len - len;
+
+   skb->csum = csum_sub(skb->csum,
+skb_checksum(skb, len, delta, 0));
+   }
+   return __pskb_trim(skb, len);
+}
+EXPORT_SYMBOL(pskb_trim_rcsum_slow);
+
 /**
  * __pskb_pull_tail - advance tail of skb header
  * @skb: buffer to reallocate
-- 
2.17.1



[PATCH stable 4.9 v2 22/29] ipv6: defrag: drop non-last frags smaller than min mtu

2018-10-10 Thread Florian Fainelli
From: Florian Westphal 

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6)
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index ff49d1f2c8cb..b81541701346 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -564,6 +564,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index dbe726c9a2ae..78656bbe50e7 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -516,6 +516,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.17.1



[PATCH stable 4.9 v2 24/29] net: add rb_to_skb() and other rb tree helpers

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
---
 include/linux/skbuff.h | 18 ++
 net/ipv4/tcp_input.c   | 33 -
 2 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11974e63a41d..7e7e12aeaf82 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2988,6 +2988,12 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, 
unsigned int len)
 
 #define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 skb != (struct sk_buff *)(queue);  
\
@@ -3002,6 +3008,18 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, 
unsigned int len)
for (; skb != (struct sk_buff *)(queue);
\
 skb = skb->next)
 
+#define skb_rbtree_walk(skb, root) 
\
+   for (skb = skb_rb_first(root); skb != NULL; 
\
+skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from(skb)  
\
+   for (; skb != NULL; 
\
+skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from_safe(skb, tmp)
\
+   for (; tmp = skb ? skb_rb_next(skb) : NULL, (skb != NULL);  
\
+skb = tmp)
+
 #define skb_queue_walk_from_safe(queue, skb, tmp)  
\
for (tmp = skb->next;   
\
 skb != (struct sk_buff *)(queue);  
\
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9d0b73aa649f..c169a2b261be 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4406,7 +4406,7 @@ static void tcp_ofo_queue(struct sock *sk)
 
p = rb_first(>out_of_order_queue);
while (p) {
-   skb = rb_entry(p, struct sk_buff, rbnode);
+   skb = rb_to_skb(p);
if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
break;
 
@@ -4470,7 +4470,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct 
sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct rb_node **p, *q, *parent;
+   struct rb_node **p, *parent;
struct sk_buff *skb1;
u32 seq, end_seq;
bool fragstolen;
@@ -4529,7 +4529,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
parent = NULL;
while (*p) {
parent = *p;
-   skb1 = rb_entry(parent, struct sk_buff, rbnode);
+   skb1 = rb_to_skb(parent);
if (before(seq, TCP_SKB_CB(skb1)->seq)) {
p = >rb_left;
continue;
@@ -4574,9 +4574,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
 
 merge_right:
/* Remove other segments covered by skb. */
-   while ((q = rb_next(>rbnode)) != NULL) {
-   skb1 = rb_entry(q, struct sk_buff, rbnode);
-
+   while ((skb1 = skb_rb_next(skb)) != NULL) {
if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
break;
if (before(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
@@ -4591,7 +4589,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
tcp_drop(sk, skb1);
}
/* If there is no skb after us, we are the last_skb ! */
-   if (!q)
+   if (!skb1)
tp->ooo_last_skb = skb;
 
 add_sack:
@@ -4792,7 +4790,7 @@ static struct sk_buff *tcp_skb_next(struct sk_buff *skb, 
struct sk_buff_head *li
if (list)
return !skb_queue_is_last(list, skb) ? skb->next : NULL;
 
-   return rb_entry_safe(rb_next(>rbnode), struct sk_buff, rbnode);
+   return skb_rb_next(skb);
 }
 
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
@@ -4821,7 +4819,7 @@ static void tcp_rbtree_insert(struct rb_root *root, 
struct sk_buff *skb)
 
while (*p) {
parent = *p;
-   skb1 = rb_entry(parent, struct sk_buff, rbnode);
+

[PATCH stable 4.9 v2 25/29] ip: use rb trees for IP frag queue.

2018-10-10 Thread Florian Fainelli
From: Peter Oskolkov 

(commit fa0f527358bd900ef92f925878ed6bfbd51305cc upstream)

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H  -F 100 -l 300 -T 20`:
netstat --statistics
Ip:
282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered
18743456 requests sent out
101 fragments dropped after timeout
282077129 reassemblies required
944952 packets reassembled ok
262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).

Reported-by: Jann Horn 
Reported-by: Juha-Matti Tilli 
Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
---
 include/linux/skbuff.h  |   4 +-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 117 insertions(+), 90 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7e7e12aeaf82..e90fe6b83e00 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -643,14 +643,14 @@ struct sk_buff {
struct skb_mstamp skb_mstamp;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
};
 
union {
+   struct sock *sk;
int ip_defrag_offset;
};
 
-   struct sock *sk;
struct net_device   *dev;
 
/*
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f47678d2ccc2..1ff0433d94a7 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -74,7 +74,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
atomic_trefcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 47c240f50b99..535fa57af51e 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8bfb34e9ea32..11d3dc649ef0 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -134,7 +134,7 @@ static bool frag_expire_skip_icmp(u32 user)
 static void ip_expire(unsigned long arg)
 {
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -150,14 +150,31 @@ static void ip_expire(unsigned long arg)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with head->dev.
+*/
+   if (qp->q.fragments) {
+   head = qp->q.fragments;
+   qp->q.fragments = head->next;
+   } else {
+   head = skb_rb_first(>q.rb_fragments);
+   if (!head)
+   goto out;
+   rb_erase(>rbnode, >q.rb_fragments);
+   memset(>rbnode, 0, sizeof(head->rbnode));
+   

[PATCH stable 4.9 v2 26/29] ip: add helpers to process in-order fragments faster.

2018-10-10 Thread Florian Fainelli
From: Peter Oskolkov 

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1ff0433d94a7..a3812e9c8fee 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -56,7 +56,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -77,6 +79,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -112,6 +115,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (atomic_dec_and_test(>refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 11d3dc649ef0..5e2121b82588 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -56,6 +56,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(>rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(>rbnode, >last_run_head->rbnode,
+>last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(>rbnode, NULL, >rb_fragments.rb_node);
+   rb_insert_color(>rbnode, >rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -652,6 +703,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct 

[PATCH stable 4.9 v2 27/29] ip: process in-order fragments efficiently

2018-10-10 Thread Florian Fainelli
From: Peter Oskolkov 

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 535fa57af51e..8323d33c0ce2 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,7 +145,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(>rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5e2121b82588..9961b2102555 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -125,8 +125,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -217,7 +217,12 @@ static void ip_expire(unsigned long arg)
head = skb_rb_first(>q.rb_fragments);
if (!head)
goto out;
-   rb_erase(>rbnode, >q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(>rbnode,
+   _CB(head)->next_frag->rbnode,
+   >q.rb_fragments);
+   else
+   rb_erase(>rbnode, >q.rb_fragments);
memset(>rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -318,7 +323,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(>q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -327,6 +332,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -338,7 +344,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -416,38 +422,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(>rbnode, NULL, >q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(>q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(>rbnode, >rbnode, 
>rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+   ip4_frag_append_to_last_run(>q, skb);
+   else
+   ip4_frag_create_run(>q, skb);
} else {
-   /* Binary search. Note that skb can become the first 

[PATCH stable 4.9 v2 20/29] net: speed up skb_rbtree_purge()

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

Also note that there is not even an increase of text size :
$ size net/core/skbuff.o.before net/core/skbuff.o
   textdata bss dec hex filename
  407111298   0   42009a419 net/core/skbuff.o.before
  407111298   0   42009a419 net/core/skbuff.o

From: Eric Dumazet 

Signed-off-by: David S. Miller 
(cherry picked from commit 7c90584c66cc4b033a3b684b0e0950f79e7b7166)
---
 net/core/skbuff.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 84c731aef0d8..96a553da1518 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2433,12 +2433,15 @@ EXPORT_SYMBOL(skb_queue_purge);
  */
 void skb_rbtree_purge(struct rb_root *root)
 {
-   struct sk_buff *skb, *next;
+   struct rb_node *p = rb_first(root);
 
-   rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
-   kfree_skb(skb);
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
-   *root = RB_ROOT;
+   p = rb_next(p);
+   rb_erase(>rbnode, root);
+   kfree_skb(skb);
+   }
 }
 
 /**
-- 
2.17.1



[PATCH stable 4.9 v2 19/29] ip: discard IPv4 datagrams with overlapping segments.

2018-10-10 Thread Florian Fainelli
From: Peter Oskolkov 

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Acked-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
(cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e7a31f830690..3442a26d36d9 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -55,6 +55,7 @@ enum
IPSTATS_MIB_ECT1PKTS,   /* InECT1Pkts */
IPSTATS_MIB_ECT0PKTS,   /* InECT0Pkts */
IPSTATS_MIB_CEPKTS, /* InCEPkts */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 73c0adc61a65..8bfb34e9ea32 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -275,6 +275,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -355,65 +356,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
dev = skb->dev;
@@ -461,6 +420,10 @@ static int ip_frag_queue(struct 

[PATCH stable 4.9 v2 21/29] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-10-10 Thread Florian Fainelli
From: Peter Oskolkov 

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 385114dec8a49b5e5945e77ba7de6356106713f4)
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c7cca35fcf6d..724c6abdb9e6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2418,7 +2418,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 96a553da1518..4e545e4432eb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2425,23 +2425,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.17.1



[PATCH stable 4.9 v2 11/29] inet: frags: remove inet_frag_maybe_warn_overflow()

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

This function is obsolete, after rhashtable addition to inet defrag.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
---
 include/net/inet_frag.h |  2 --
 net/ieee802154/6lowpan/reassembly.c |  5 ++---
 net/ipv4/inet_fragment.c| 11 ---
 net/ipv4/ip_fragment.c  |  5 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  5 ++---
 net/ipv6/reassembly.c   |  5 ++---
 6 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7e984045b2b7..23161bf5d899 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -109,8 +109,6 @@ void inet_frags_exit_net(struct netns_frags *nf);
 void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-  const char *prefix);
 
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index a63360a05108..b54015981af9 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -83,10 +83,9 @@ fq_find(struct net *net, const struct lowpan_802154_cb *cb,
struct inet_frag_queue *q;
 
q = inet_frag_find(_lowpan->frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct lowpan_frag_queue, q);
 }
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index a50ac25878aa..47c240f50b99 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -217,14 +217,3 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags 
*nf, void *key)
return inet_frag_create(nf, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
-
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-  const char *prefix)
-{
-   static const char msg[] = "inet_frag_find: Fragment hash bucket"
-   " list length grew over limit. Dropping fragment.\n";
-
-   if (PTR_ERR(q) == -ENOBUFS)
-   net_dbg_ratelimited("%s%s", prefix, msg);
-}
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 308592a8ba97..696bfef06caa 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -219,10 +219,9 @@ static struct ipq *ip_find(struct net *net, struct iphdr 
*iph,
struct inet_frag_queue *q;
 
q = inet_frag_find(>ipv4.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct ipq, q);
 }
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 314568d8b84a..267f2ae2d05c 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -177,10 +177,9 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
struct inet_frag_queue *q;
 
q = inet_frag_find(>nf_frag.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct frag_queue, q);
 }
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 629a45a4c79f..6de4cec69054 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -154,10 +154,9 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
key.iif = 0;
 
q = inet_frag_find(>ipv6.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct frag_queue, q);
 }
 
-- 
2.17.1



[PATCH stable 4.9 v2 14/29] ipv6: frags: rewrite ip6_expire_frag_queue()

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

Make it similar to IPv4 ip_expire(), and release the lock
before calling icmp functions.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3)
---
 net/ipv6/reassembly.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 8a4ece339c19..1cb45a0d1a0e 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -92,7 +92,9 @@ EXPORT_SYMBOL(ip6_frag_init);
 void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 {
struct net_device *dev = NULL;
+   struct sk_buff *head;
 
+   rcu_read_lock();
spin_lock(>q.lock);
 
if (fq->q.flags & INET_FRAG_COMPLETE)
@@ -100,28 +102,34 @@ void ip6_expire_frag_queue(struct net *net, struct 
frag_queue *fq)
 
inet_frag_kill(>q);
 
-   rcu_read_lock();
dev = dev_get_by_index_rcu(net, fq->iif);
if (!dev)
-   goto out_rcu_unlock;
+   goto out;
 
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !fq->q.fragments)
-   goto out_rcu_unlock;
+   head = fq->q.fragments;
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   goto out;
 
/* But use as source device on which LAST ARRIVED
 * segment was received. And do not use fq->dev
 * pointer directly, device might already disappeared.
 */
-   fq->q.fragments->dev = dev;
-   icmpv6_send(fq->q.fragments, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 
0);
-out_rcu_unlock:
-   rcu_read_unlock();
+   head->dev = dev;
+   skb_get(head);
+   spin_unlock(>q.lock);
+
+   icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
+   kfree_skb(head);
+   goto out_rcu_unlock;
+
 out:
spin_unlock(>q.lock);
+out_rcu_unlock:
+   rcu_read_unlock();
inet_frag_put(>q);
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
-- 
2.17.1



[PATCH stable 4.9 v2 10/29] inet: frags: get rif of inet_frag_evicting()

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64751d
("inet: frag: release spinlock before calling icmp_send()")

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
---
 include/net/inet_frag.h |  5 
 net/ipv4/ip_fragment.c  | 65 -
 net/ipv6/reassembly.c   |  4 ---
 3 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index a49cdf25cef0..7e984045b2b7 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -118,11 +118,6 @@ static inline void inet_frag_put(struct inet_frag_queue *q)
inet_frag_destroy(q);
 }
 
-static inline bool inet_frag_evicting(struct inet_frag_queue *q)
-{
-   return false;
-}
-
 /* Memory Tracking Functions. */
 
 static inline int frag_mem_limit(struct netns_frags *nf)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8c4072b19296..308592a8ba97 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -141,8 +141,11 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-   struct ipq *qp;
+   struct sk_buff *clone, *head;
+   const struct iphdr *iph;
struct net *net;
+   struct ipq *qp;
+   int err;
 
qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
net = container_of(qp->q.net, struct net, ipv4.frags);
@@ -156,45 +159,41 @@ static void ip_expire(unsigned long arg)
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 
-   if (!inet_frag_evicting(>q)) {
-   struct sk_buff *clone, *head = qp->q.fragments;
-   const struct iphdr *iph;
-   int err;
+   head = qp->q.fragments;
 
-   __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+   __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !qp->q.fragments)
-   goto out;
+   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   goto out;
 
-   head->dev = dev_get_by_index_rcu(net, qp->iif);
-   if (!head->dev)
-   goto out;
+   head->dev = dev_get_by_index_rcu(net, qp->iif);
+   if (!head->dev)
+   goto out;
 
 
-   /* skb has no dst, perform route lookup again */
-   iph = ip_hdr(head);
-   err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+   /* skb has no dst, perform route lookup again */
+   iph = ip_hdr(head);
+   err = ip_route_input_noref(head, iph->daddr, iph->saddr,
   iph->tos, head->dev);
-   if (err)
-   goto out;
+   if (err)
+   goto out;
 
-   /* Only an end host needs to send an ICMP
-* "Fragment Reassembly Timeout" message, per RFC792.
-*/
-   if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
-   (skb_rtable(head)->rt_type != RTN_LOCAL))
-   goto out;
-
-   clone = skb_clone(head, GFP_ATOMIC);
-
-   /* Send an ICMP "Fragment Reassembly Timeout" message. */
-   if (clone) {
-   spin_unlock(>q.lock);
-   icmp_send(clone, ICMP_TIME_EXCEEDED,
- ICMP_EXC_FRAGTIME, 0);
-   consume_skb(clone);
-   goto out_rcu_unlock;
-   }
+   /* Only an end host needs to send an ICMP
+* "Fragment Reassembly Timeout" message, per RFC792.
+*/
+   if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
+   (skb_rtable(head)->rt_type != RTN_LOCAL))
+   goto out;
+
+   clone = skb_clone(head, GFP_ATOMIC);
+
+   /* Send an ICMP "Fragment Reassembly Timeout" message. */
+   if (clone) {
+   spin_unlock(>q.lock);
+   icmp_send(clone, ICMP_TIME_EXCEEDED,
+ ICMP_EXC_FRAGTIME, 0);
+   consume_skb(clone);
+   goto out_rcu_unlock;
}
 out:
spin_unlock(>q.lock);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 8fe3836d1410..629a45a4c79f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -106,10 +106,6 @@ void ip6_expire_frag_queue(struct net *net, struct 
frag_queue *fq)
goto out_rcu_unlock;
 
__IP6_INC_STATS(net, __in6_dev_get(dev), 

[PATCH stable 4.9 v2 18/29] inet: frags: fix ip6frag_low_thresh boundary

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet 
Reported-by: Maciej Żenczykowski 
Signed-off-by: David S. Miller 
(cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
---
 net/ieee802154/6lowpan/reassembly.c |  2 --
 net/ipv4/ip_fragment.c  | 40 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  2 --
 net/ipv6/reassembly.c   |  4 +--
 4 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 122a625d9a66..6fca75581e13 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -410,7 +410,6 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 }
 
 #ifdef CONFIG_SYSCTL
-static long zero;
 
 static struct ctl_table lowpan_frags_ns_ctl_table[] = {
{
@@ -427,7 +426,6 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxlen = sizeof(unsigned long),
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
-   .extra1 = ,
.extra2 = _net.ieee802154_lowpan.frags.high_thresh
},
{
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index e235f62dab58..73c0adc61a65 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -56,14 +56,6 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
-struct ipfrag_skb_cb
-{
-   struct inet_skb_parmh;
-   int offset;
-};
-
-#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
-
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -351,13 +343,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * this fragment, right?
 */
prev = qp->q.fragments_tail;
-   if (!prev || FRAG_CB(prev)->offset < offset) {
+   if (!prev || prev->ip_defrag_offset < offset) {
next = NULL;
goto found;
}
prev = NULL;
for (next = qp->q.fragments; next != NULL; next = next->next) {
-   if (FRAG_CB(next)->offset >= offset)
+   if (next->ip_defrag_offset >= offset)
break;  /* bingo! */
prev = next;
}
@@ -368,7 +360,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * any overlaps are eliminated.
 */
if (prev) {
-   int i = (FRAG_CB(prev)->offset + prev->len) - offset;
+   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
if (i > 0) {
offset += i;
@@ -385,8 +377,8 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 
err = -ENOMEM;
 
-   while (next && FRAG_CB(next)->offset < end) {
-   int i = end - FRAG_CB(next)->offset; /* overlap is 'i' bytes */
+   while (next && next->ip_defrag_offset < end) {
+   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
 
if (i < next->len) {
int delta = -next->truesize;
@@ -399,7 +391,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
delta += next->truesize;
if (delta)
add_frag_mem_limit(qp->q.net, delta);
-   FRAG_CB(next)->offset += i;
+   next->ip_defrag_offset += i;
qp->q.meat -= i;
if (next->ip_summed != CHECKSUM_UNNECESSARY)
next->ip_summed = CHECKSUM_NONE;
@@ -423,7 +415,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
}
 
-   FRAG_CB(skb)->offset = offset;
+   /* Note : skb->ip_defrag_offset and skb->dev share the same location */
+   dev = skb->dev;
+   if (dev)
+   qp->iif = dev->ifindex;
+   /* Makes sure compiler wont do silly aliasing games */
+   barrier();
+   skb->ip_defrag_offset = offset;
 
/* Insert this fragment in the chain of fragments. */
skb->next = next;
@@ -434,11 +432,6 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
else
qp->q.fragments = skb;
 
-   dev = skb->dev;
-   if (dev) {
-   qp->iif = dev->ifindex;
-   

[PATCH stable 4.9 v2 13/29] inet: frags: do not clone skb in ip_expire()

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
---
 net/ipv4/ip_fragment.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 3dd19bebeb55..e235f62dab58 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -141,8 +141,8 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-   struct sk_buff *clone, *head;
const struct iphdr *iph;
+   struct sk_buff *head;
struct net *net;
struct ipq *qp;
int err;
@@ -185,16 +185,12 @@ static void ip_expire(unsigned long arg)
(skb_rtable(head)->rt_type != RTN_LOCAL))
goto out;
 
-   clone = skb_clone(head, GFP_ATOMIC);
+   skb_get(head);
+   spin_unlock(>q.lock);
+   icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+   kfree_skb(head);
+   goto out_rcu_unlock;
 
-   /* Send an ICMP "Fragment Reassembly Timeout" message. */
-   if (clone) {
-   spin_unlock(>q.lock);
-   icmp_send(clone, ICMP_TIME_EXCEEDED,
- ICMP_EXC_FRAGTIME, 0);
-   consume_skb(clone);
-   goto out_rcu_unlock;
-   }
 out:
spin_unlock(>q.lock);
 out_rcu_unlock:
-- 
2.17.1



[PATCH stable 4.9 v2 07/29] rhashtable: add schedule points

2018-10-10 Thread Florian Fainelli
From: Eric Dumazet 

Rehashing and destroying large hash table takes a lot of time,
and happens in process context. It is safe to add cond_resched()
in rhashtable_rehash_table() and rhashtable_free_and_destroy()

Signed-off-by: Eric Dumazet 
Acked-by: Herbert Xu 
Signed-off-by: David S. Miller 
(cherry picked from commit ae6da1f503abb5a5081f9f6c4a6881de97830f3e)
---
 lib/rhashtable.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 101dac085c62..fdffd6232365 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -251,8 +251,10 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
if (!new_tbl)
return 0;
 
-   for (old_hash = 0; old_hash < old_tbl->size; old_hash++)
+   for (old_hash = 0; old_hash < old_tbl->size; old_hash++) {
rhashtable_rehash_chain(ht, old_hash);
+   cond_resched();
+   }
 
/* Publish the new table pointer. */
rcu_assign_pointer(ht->tbl, new_tbl);
@@ -993,6 +995,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
for (i = 0; i < tbl->size; i++) {
struct rhash_head *pos, *next;
 
+   cond_resched();
for (pos = rht_dereference(tbl->buckets[i], ht),
 next = !rht_is_a_nulls(pos) ?
rht_dereference(pos->next, ht) : NULL;
-- 
2.17.1



  1   2   3   4   5   6   7   8   9   10   >