Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-30 Thread Alexander Duyck
I'm dropping all the old comments since the conversation was flattened
and only has one level of marks for everything.

On Thu, Aug 30, 2018 at 7:43 AM Alex Vesker  wrote:



> To which devlink interfaces are you referring?

All of them. Not just the ones in this patch. If you are exposing an
interface to the user you should have documentation for it somewhere.
You should probably look at adding a patch to make certain you have
all the existing devlink interfaces in the driver documented.

I would like to see something added to the documentation folder that
explains what all the DEVLINK_PARAM_GENERIC interfaces are expected to
do, and maybe why I would use them. Then in addition I would like to
see per-driver documentation added for the DEVLINK_PARAM_DRIVER calls.
So for example I can't find any documentation in the kernel on what
enable_64b_cqe_eqe or enable_4k_uar do in mlx4 or why I would need
them, but you have them exposed as interfaces to userspace.

> There are 3 patches here that provide the crdump capability,
> these are the patches I would like to resubmit.
>
> net/mlx5: Add Vendor Specific Capability access gateway:
> This is needed to read from the VSC by only the driver to collect a dump

You should probably work with the linux-pci mailing list on this bit
since you are exposing a new capability and they can probably point
you in the direction of how they want to deal with any potential races
in terms of access to the device versus your capability which you are
adding support for dumping via devlink.

> net/mlx5: Add Crdump FW snapshot support
> This is code that collects the dump and registers a region called crdump
> net/mlx5: Use devlink region_snapshot parameter
> Here I use an already implemented global param that specifies whether
> snapshots are supported.
>
> The devlink region feature is well documented.

Where?

> can it be that you referring to devlink region called "crdump" which mlx5 
> exposes?

I don't care about the internals. I care about user available
documentation for the interface that is exposed. How do you expect the
user to use this functionality? That is what I want documented.



> Will it be sufficient to prevent setcpi access using "pci_cfg_access_lock -
> any userspace reads or writes to config space and concurrent lock requests 
> will sleep"
> otherwise do you have a different solution?

That sounds like a step in the right direction, but that is something
you should work with the linux-pci list on. My main concern is that I
don't want us being able to come at this interface from multiple
directions and screw things up.


Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-29 Thread Alexander Duyck
On Wed, Aug 29, 2018 at 8:43 AM Alex Vesker  wrote:
>
>
> > On Wed, Aug 1, 2018 at 4:13 PM, Saeed Mahameed
> >  wrote:
> >> On Wed, Aug 1, 2018 at 3:34 PM, Alexander Duyck
> >>  wrote:
> >>> On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed 
> >>> wrote:
> >>>> Hi Dave,
> >>>>
> >>>> This series provides devlink parameters updates to both devlink API
> >>>> and
> >>>> mlx5 driver, it is a 2nd iteration of the dropped patches sent in a
> >>>> previous
> >>>> mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via
> >>>> Devlink" to address review comments [1].
> >>>>
> >>>> Changes from the original series:
> >>>> - According to the discussion outcome, we are keeping the
> >>>> congestion control
> >>>>   setting as mlx5 device specific for the current HW generation.
> >>>> - Changed the congestion_mode and congestion action param type to
> >>>> string
> >>>> - Added patches to fix devlink handling of param type string
> >>>> - Added a patch which adds extack messages support for param set.
> >>>> - At the end of this series, I've added yet another mlx5 devlink
> >>>> related
> >>>>  feature, firmware snapshot support.
> >>>>
> >>>> For more information please see tag log below.
> >>>>
> >>>> Please pull and let me know if there's any problem.
> >>>>
> >>>> [1] https://patchwork.ozlabs.org/patch/945996/
> >>>>
> >>>> Thanks,
> >>>> Saeed.
> >>>>
> >>>> ---
> >>>>
> >>>> The following changes since commit
> >>>> e6476c21447c4b17c47e476aade6facf050f31e8:
> >>>>
> >>>>   net: remove bogus RCU annotations on socket.wq (2018-07-31
> >>>> 12:40:22 -0700)
> >>>>
> >>>> are available in the Git repository at:
> >>>>
> >>>> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
> >>>> tags/mlx5-updates-2018-08-01
> >>>>
> >>>> for you to fetch changes up to
> >>>> 2ac6108c65ffcb1e5eab1fba1fd59272604d1c32:
> >>>>
> >>>>   net/mlx5: Use devlink region_snapshot parameter (2018-08-01
> >>>> 14:49:09 -0700)
> >>>>
> >>>> 
> >>>> mlx5-updates-2018-08-01
> >>>>
> >>>> This series provides devlink parameters updates to both devlink API
> >>>> and
> >>>> mlx5 driver,
> >>>>
> >>>> 1) Devlink changes: (Moshe Shemesh)
> >>>> The first two patches fix devlink param infrastructure for string type
> >>>> params.
> >>>> The third patch adds a devlink helper function to safely copy
> >>>> string from
> >>>> driver to devlink.
> >>>> The forth patch adds extack support for param set.
> >>>>
> >>>> 2) mlx5 specific congestion parameters: (Eran Ben Elisha)
> >>>> Next three patches add new devlink driver specific params for
> >>>> controlling
> >>>> congestion action and mode, using string type params and extack
> >>>> messages support.
> >>>>
> >>>> This congestion mode enables hw workaround in specific devices
> >>>> which is
> >>>> controlled by devlink driver-specific params. The workaround is device
> >>>> specific for this NIC generation, the next NIC will not need it.
> >>>>
> >>>> Congestion parameters:
> >>>>  - Congestion action
> >>>> HW W/A mechanism in the PCIe buffer which monitors the
> >>>> amount of
> >>>> consumed PCIe buffer per host.  This mechanism supports
> >>>> the
> >>>> following actions in case of threshold overflow:
> >>>> - Disabled - NOP (Default)
> >>>> - Drop
> >>>> - Mark - Mark CE bit in the CQE of received packet
> >>>> - Congestion mode
> >>>> - Aggressive - Aggressive static trigger threshold
> >>>> (Default)
> >>>> - Dynamic - Dynamical

Re: ixgbe hangs when XDP_TX is enabled

2018-08-21 Thread Alexander Duyck
On Tue, Aug 21, 2018 at 9:59 AM Nikita V. Shirokov  wrote:
>
> On Tue, Aug 21, 2018 at 08:58:15AM -0700, Alexander Duyck wrote:
> > On Mon, Aug 20, 2018 at 12:32 PM Nikita V. Shirokov  
> > wrote:
> > >
> > > we are getting such errors:
> > >
> > > [  408.737313] ixgbe :03:00.0 eth0: Detected Tx Unit Hang (XDP)
> > >  Tx Queue <46>
> > >  TDH, TDT <0>, <2>
> > >  next_to_use  <2>
> > >  next_to_clean<0>
> > >tx_buffer_info[next_to_clean]
> > >  time_stamp   <0>
> > >  jiffies  <1000197c0>
> > > [  408.804438] ixgbe :03:00.0 eth0: tx hang 1 detected on queue 46, 
> > > resetting adapter
> > > [  408.804440] ixgbe :03:00.0 eth0: initiating reset due to tx timeout
> > > [  408.817679] ixgbe :03:00.0 eth0: Reset adapter
> > > [  408.866091] ixgbe :03:00.0 eth0: TXDCTL.ENABLE for one or more 
> > > queues not cleared within the polling period
> > > [  409.345289] ixgbe :03:00.0 eth0: detected SFP+: 3
> > > [  409.497232] ixgbe :03:00.0 eth0: NIC Link is Up 10 Gbps, Flow 
> > > Control: RX/TX
> > >
> > > while running XDP prog on ixgbe nic.
> > > right now i'm seing this on bpfnext kernel
> > > (latest commit from Wed Aug 15 15:04:25 2018 -0700 ;
> > > 9a76aba02a37718242d7cdc294f0a3901928aa57)
> > >
> > > looks like this is the same issue as reported by Brenden in
> > > https://www.spinics.net/lists/netdev/msg439438.html
> > >
> > > --
> > > Nikita V. Shirokov
> >
> > Could you provide some additional information about your setup.
> > Specifically useful would be "ethtool -i", "ethtool -l", and lspci
> > -vvv info for your device. The total number of CPUs on the system
> > would be useful to know as well. In addition could you try
> > reproducing
> sure:
>
> ethtool -l eth0
> Channel parameters for eth0:
> Pre-set maximums:
> RX: 0
> TX: 0
> Other:  1
> Combined:   63
> Current hardware settings:
> RX: 0
> TX: 0
> Other:  1
> Combined:   48
>
> # ethtool -i eth0
> driver: ixgbe
> version: 5.1.0-k
> firmware-version: 0x86f1
> expansion-rom-version:
> bus-info: :03:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
>
> # nproc
> 48
>
> lspci:
>
> 03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
> Network Connection (rev 01)
> Subsystem: Intel Corporation Device 000d
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
> Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> SERR-  Latency: 0, Cache Line Size: 32 bytes
> Interrupt: pin A routed to IRQ 30
> NUMA node: 0
> Region 0: Memory at c7d0 (64-bit, non-prefetchable) [size=1M]
> Region 2: I/O ports at 6000 [size=32]
> Region 4: Memory at c7e8 (64-bit, non-prefetchable) [size=16K]
> Expansion ROM at c7e0 [disabled] [size=512K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> Address:   Data: 
> Masking:   Pending: 
> Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
> Vector table: BAR=4 offset=
> PBA: BAR=4 offset=2000
> Capabilities: [a0] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
> <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ 
> SlotPowerLimit 0.000W
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
> Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ 
> TransPend+
> LnkCap: Port #2, Speed 5GT/s, Width x8, ASPM L0s, Ex

Re: ixgbe hangs when XDP_TX is enabled

2018-08-21 Thread Alexander Duyck
On Mon, Aug 20, 2018 at 12:32 PM Nikita V. Shirokov  wrote:
>
> we are getting such errors:
>
> [  408.737313] ixgbe :03:00.0 eth0: Detected Tx Unit Hang (XDP)
>  Tx Queue <46>
>  TDH, TDT <0>, <2>
>  next_to_use  <2>
>  next_to_clean<0>
>tx_buffer_info[next_to_clean]
>  time_stamp   <0>
>  jiffies  <1000197c0>
> [  408.804438] ixgbe :03:00.0 eth0: tx hang 1 detected on queue 46, 
> resetting adapter
> [  408.804440] ixgbe :03:00.0 eth0: initiating reset due to tx timeout
> [  408.817679] ixgbe :03:00.0 eth0: Reset adapter
> [  408.866091] ixgbe :03:00.0 eth0: TXDCTL.ENABLE for one or more queues 
> not cleared within the polling period
> [  409.345289] ixgbe :03:00.0 eth0: detected SFP+: 3
> [  409.497232] ixgbe :03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: 
> RX/TX
>
> while running XDP prog on ixgbe nic.
> right now i'm seing this on bpfnext kernel
> (latest commit from Wed Aug 15 15:04:25 2018 -0700 ;
> 9a76aba02a37718242d7cdc294f0a3901928aa57)
>
> looks like this is the same issue as reported by Brenden in
> https://www.spinics.net/lists/netdev/msg439438.html
>
> --
> Nikita V. Shirokov

Could you provide some additional information about your setup.
Specifically useful would be "ethtool -i", "ethtool -l", and lspci
-vvv info for your device. The total number of CPUs on the system
would be useful to know as well. In addition could you try reproducing
the issue with one of the sample XDP programs provided with the kernel
such as the xdp2 which I believe uses the XDP_TX function. We need to
try and create a similar setup in our own environment for reproduction
and debugging.

Thanks.

- Alex


Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs

2018-08-17 Thread Alexander Duyck
On Fri, Aug 17, 2018 at 4:19 PM Shannon Nelson
 wrote:
>
>
> On 8/16/2018 2:36 PM, Shannon Nelson wrote:
> > On 8/16/2018 2:15 PM, Alexander Duyck wrote:
> >> On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
> >>  wrote:
> >>>
> >>> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
> >>>> On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
> >>>>  wrote:
> >>>>>
> >>>>> This set of patches implements IPsec hardware offload for VF
> >>>>> devices in
> >>>>> Intel's 10Gbe x540 family of Ethernet devices.
> >>>
> >>> [...]
> >>>
> >>>>
> >>>> So the one question I would have about this patch set is what happens
> >>>> if you are setting up a ipsec connection between the PF and one of the
> >>>> VFs on the same port/function? Do the ipsec offloads get translated
> >>>> across the Tx loopback or do they end up causing issues? Specifically
> >>>> I would be interested in seeing the results of a test either between
> >>>> two VFs, or the PF and one of the VFs on the same port.
> >>>>
> >>>> - Alex
> >>>>
> >>>
> >>> There is definitely something funky in the internal switch connection,
> >>> as messages going from PF to VF with an offloaded encryption don't seem
> >>> to get received by the VF, at least when in a VEB setup.  If I only set
> >>> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
> >>> things work.
> >>>
> >>> I don't have a setup to test this, but I suspect that in a VEPA
> >>> configuration, with packets going out to the switch and turned around
> >>> back in, the Tx encryption offload would happen as expected.
> >>>
> >>> sln
> >>
> >> We should probably look at adding at least one patch to the set then
> >> that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
> >> we don't end up breaking connections should a VF be migrated from a
> >> remote system to a local one that it is connected to.
> >>
> >> - Alex
> >>
> >
> > The problem with this is that someone could set up an IPsec connection
> > on the PF for Tx and Rx use, then set num_vfs, start some VFs, and we
> > still can end up in the same place.  I don't think we want to disallow
> > all Tx IPsec offload.
> >
> > Maybe we can catch it in ixgbe_ipsec_offload_ok()?  If it can find that
> > the dest mac is on the internal switch, perhaps it can NAK the Tx
> > offload?  That would force the XFRM xmit code to do a regular SW encrypt
> > before sending the packet.  I'll look into this.
> >
> > sln
>
> This would be a great idea, but the xdo_state_offload_ok() callback
> happens in the network stack before routing has happened, so there is no
> mac address yet in the skb.  We may be stuck with NAKing *all* Tx
> offloads when num_vfs != 0.  It works, and it is better than no offload
> at all, but it sure harshes the vibe.  Blech.
>
> sln

You can probably just think of the Tx offload as being lumped in with
all the other offloads that don't work when SR-IOV is enabled such as
ATR and RSC.

- Alex


Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs

2018-08-16 Thread Alexander Duyck
On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
 wrote:
>
> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
> > On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
> >  wrote:
> >>
> >> This set of patches implements IPsec hardware offload for VF devices in
> >> Intel's 10Gbe x540 family of Ethernet devices.
>
> [...]
>
> >
> > So the one question I would have about this patch set is what happens
> > if you are setting up a ipsec connection between the PF and one of the
> > VFs on the same port/function? Do the ipsec offloads get translated
> > across the Tx loopback or do they end up causing issues? Specifically
> > I would be interested in seeing the results of a test either between
> > two VFs, or the PF and one of the VFs on the same port.
> >
> > - Alex
> >
>
> There is definitely something funky in the internal switch connection,
> as messages going from PF to VF with an offloaded encryption don't seem
> to get received by the VF, at least when in a VEB setup.  If I only set
> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
> things work.
>
> I don't have a setup to test this, but I suspect that in a VEPA
> configuration, with packets going out to the switch and turned around
> back in, the Tx encryption offload would happen as expected.
>
> sln

We should probably look at adding at least one patch to the set then
that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
we don't end up breaking connections should a VF be migrated from a
remote system to a local one that it is connected to.

- Alex


Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs

2018-08-14 Thread Alexander Duyck
On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
 wrote:
>
> This set of patches implements IPsec hardware offload for VF devices in
> Intel's 10Gbe x540 family of Ethernet devices.
>
> The IPsec HW offload feature has been in the x540/Niantic family of
> network devices since their release in 2009, but there was no Linux
> kernel support for the offload until 2017.  After the XFRM code added
> support for the offload last year, the hw offload was added to the ixgbe
> PF driver.
>
> Since the related x540 VF device uses same setup as the PF for implementing
> the offload, adding the feature to the ixgbevf seemed like a good idea.
> In this case, the PF owns the device registers, so the VF simply packages
> up the request information into a VF<->PF message and the PF does the
> device configuration.  The resulting IPsec throughput is roughly equivalent
> to what we see in the PF - nearly line-rate, with the expected drop in CPU
> cycles burned.  (I'm not great at performance statistics, I'll let better
> folks do the actual measurements as they pertain to their own usage)
>
> To make use of the capability, first two things are needed: the PF must
> be told to enable the offload for VFs (it is off by default) and the VF
> must be trusted.  A new ethtool priv-flag for ixgbe is added to control
> VF offload support.  For example:
>
> ethtool --set-priv-flags eth0 vf-ipsec on
> ip link set eth0 vf 1 trust on
>
> Once those are set up and the VF device is UP, the user can add SAs the
> same as for PFs, whether the VF is in the host or has been assigned to
> a VM.
>
> Note that the x540 chip supports a total of 1024 Rx plus 1024 Tx Security
> Associations (SAs), shared among the PF and VFs that might request them.
> It is entirely possible for a single VF to soak up all the offload
> capability, which would likely annoy some people.  It seems rather
> arbitrary to try to set a limit for how many a VF could be allowed,
> but this is mitigated somewhat by the need for "trust" and "vf-ipsec"
> to be enabled.  I suppose we could come up with a way to make a limit
> configurable, but there is no existing method for adding that kind
> configuration so I'll leave that to a future discussion.
>
> Currently this doesn't support Tx offload as the hardware encryption
> engine doesn't seem to engage on the Tx packets.  This may be a lingering
> driver bug, more investigation is needed.  Until then, requests for a Tx
> offload are failed and the userland requester will need to add Tx SAs
> without the offload attribute.
>
> Given that we don't have Tx offload support, the benefit here is less
> than it could be, but is definitely still noticeable.  For example, with
> informal iperf testing over a 10Gbps link, with full offload in a PF on
> one side and a VF in a VM on the other side on a CPU with AES instructions:
>
> Reference:
> No IPsec: 9.4 Gbps
> IPsec offload btwn two PFs:   9.2 Gbps
> VF as the iperf receiver:
> IPsec offload on PF, none on VF:  6.8 Gbps
> IPsec offload on PF and VF:   9.2 Gbps   << biggest benefit
> VF as the iperf sender:
> IPsec offload on PF, none on VF:  4.8 Gbps
> IPsec offload on PF and VF:   4.8 Gbps
>
> The iperf traffic is primarily uni-directional, and we can see the most
> benefit when VF is the iperf server and is receiving the test traffic.
> Watching output from sar also shows a nice decrease in CPU utilization.
>

So the one question I would have about this patch set is what happens
if you are setting up a ipsec connection between the PF and one of the
VFs on the same port/function? Do the ipsec offloads get translated
across the Tx loopback or do they end up causing issues? Specifically
I would be interested in seeing the results of a test either between
two VFs, or the PF and one of the VFs on the same port.

- Alex


Re: e1000e driver stuck at 10Mbps after reconnection

2018-08-06 Thread Alexander Duyck
On Mon, Aug 6, 2018 at 4:59 AM, Camille Bordignon
 wrote:
> Hello,
>
> Recently we experienced some issues with intel NIC (I219-LM and I219-V).
> It seems that after a wire reconnection, auto-negotation "fails" and
> link speed drips to 10 Mbps.
>
> From kernel logs:
> [17616.346150] e1000e: enp0s31f6 NIC Link is Down
> [17627.003322] e1000e: enp0s31f6 NIC Link is Up 10 Mbps Full Duplex, Flow 
> Control: None
> [17627.003325] e1000e :00:1f.6 enp0s31f6: 10/100 speed: disabling TSO
>
>
> $ethtool enp0s31f6
> Settings for enp0s31f6:
> Supported ports: [ TP ]
> Supported link modes:   10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supported pause frame use: No
> Supports auto-negotiation: Yes
> Supported FEC modes: Not reported
> Advertised link modes:  10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised pause frame use: No
> Advertised auto-negotiation: Yes
> Advertised FEC modes: Not reported
> Speed: 10Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> MDI-X: on (auto)
> Supports Wake-on: pumbg
> Wake-on: g
> Current message level: 0x0007 (7)
>drv probe link
> Link detected: yes
>
>
> Notice that if disconnection last less than about 5 seconds,
> nothing wrong happens.
> And if after last failure, disconnection / connection occurs again and
> last less than 5 seconds, link speed is back to 1000 Mbps.
>
> [18075.350678] e1000e: enp0s31f6 NIC Link is Down
> [18078.716245] e1000e: enp0s31f6 NIC Link is Up 1000 Mbps Full Duplex, Flow 
> Control: None
>
> The following patch seems to fix this issue.
> However I don't clearly understand why.
>
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 3ba0c90e7055..763c013960f1 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -5069,7 +5069,7 @@ static bool e1000e_has_link(struct e1000_adapter 
> *adapter)
> case e1000_media_type_copper:
> if (hw->mac.get_link_status) {
> ret_val = hw->mac.ops.check_for_link(hw);
> -   link_active = !hw->mac.get_link_status;
> +   link_active = false;
> } else {
> link_active = true;
> }
>
> Maybe this is related to watchdog task.
>
> I've found out this fix by comparing with last commit that works fine :
> commit 0b76aae741abb9d16d2c0e67f8b1e766576f897d.
> However I don't know if this information is relevant.
>
> Thank you.
> Camille Bordignon

What kernel were you testing this on? I know there have been a number
of changes over the past few months in this area and it would be
useful to know exactly what code base you started out with and what
the latest version of the kernel is you have tested.

Looking over the code change the net effect of it should be to add a 2
second delay from the time the link has changed until you actually
check the speed/duplex configuration. It is possible we could be
seeing some sort of timing issue and adding the 2 second delay after
the link event is enough time for things to stabilize and detect the
link at 1000 instead of 10/100.

- Alex


Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-01 Thread Alexander Duyck
On Wed, Aug 1, 2018 at 4:13 PM, Saeed Mahameed
 wrote:
> On Wed, Aug 1, 2018 at 3:34 PM, Alexander Duyck
>  wrote:
>> On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed  wrote:
>>> Hi Dave,
>>>
>>> This series provides devlink parameters updates to both devlink API and
>>> mlx5 driver, it is a 2nd iteration of the dropped patches sent in a previous
>>> mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via
>>> Devlink" to address review comments [1].
>>>
>>> Changes from the original series:
>>> - According to the discussion outcome, we are keeping the congestion control
>>>   setting as mlx5 device specific for the current HW generation.
>>> - Changed the congestion_mode and congestion action param type to string
>>> - Added patches to fix devlink handling of param type string
>>> - Added a patch which adds extack messages support for param set.
>>> - At the end of this series, I've added yet another mlx5 devlink related
>>>  feature, firmware snapshot support.
>>>
>>> For more information please see tag log below.
>>>
>>> Please pull and let me know if there's any problem.
>>>
>>> [1] https://patchwork.ozlabs.org/patch/945996/
>>>
>>> Thanks,
>>> Saeed.
>>>
>>> ---
>>>
>>> The following changes since commit e6476c21447c4b17c47e476aade6facf050f31e8:
>>>
>>>   net: remove bogus RCU annotations on socket.wq (2018-07-31 12:40:22 -0700)
>>>
>>> are available in the Git repository at:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
>>> tags/mlx5-updates-2018-08-01
>>>
>>> for you to fetch changes up to 2ac6108c65ffcb1e5eab1fba1fd59272604d1c32:
>>>
>>>   net/mlx5: Use devlink region_snapshot parameter (2018-08-01 14:49:09 
>>> -0700)
>>>
>>> 
>>> mlx5-updates-2018-08-01
>>>
>>> This series provides devlink parameters updates to both devlink API and
>>> mlx5 driver,
>>>
>>> 1) Devlink changes: (Moshe Shemesh)
>>> The first two patches fix devlink param infrastructure for string type
>>> params.
>>> The third patch adds a devlink helper function to safely copy string from
>>> driver to devlink.
>>> The forth patch adds extack support for param set.
>>>
>>> 2) mlx5 specific congestion parameters: (Eran Ben Elisha)
>>> Next three patches add new devlink driver specific params for controlling
>>> congestion action and mode, using string type params and extack messages 
>>> support.
>>>
>>> This congestion mode enables hw workaround in specific devices which is
>>> controlled by devlink driver-specific params. The workaround is device
>>> specific for this NIC generation, the next NIC will not need it.
>>>
>>> Congestion parameters:
>>>  - Congestion action
>>> HW W/A mechanism in the PCIe buffer which monitors the amount of
>>> consumed PCIe buffer per host.  This mechanism supports the
>>> following actions in case of threshold overflow:
>>> - Disabled - NOP (Default)
>>> - Drop
>>> - Mark - Mark CE bit in the CQE of received packet
>>> - Congestion mode
>>> - Aggressive - Aggressive static trigger threshold (Default)
>>> - Dynamic - Dynamically change the trigger threshold
>>>
>>> 3) mlx5 firmware snapshot support via devlink: (Alex Vesker)
>>> Last three patches, add the support for capturing region snapshot of the
>>> firmware crspace during critical errors, using devlink region_snapshot
>>> parameter.
>>>
>>> -Saeed.
>>>
>>> 
>>> Alex Vesker (3):
>>>   net/mlx5: Add Vendor Specific Capability access gateway
>>>   net/mlx5: Add Crdump FW snapshot support
>>>   net/mlx5: Use devlink region_snapshot parameter
>>>
>>> Eran Ben Elisha (3):
>>>   net/mlx5: Move all devlink related functions calls to devlink.c
>>>   net/mlx5: Add MPEGC register configuration functionality
>>>   net/mlx5: Enable PCIe buffer congestion handling workaround via 
>>> devlink
>>>
>>> Moshe Shemesh (4):
>>>   devlink: Fix param set handling for string type
>>>   de

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-01 Thread Alexander Duyck
On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed  wrote:
> Hi Dave,
>
> This series provides devlink parameters updates to both devlink API and
> mlx5 driver, it is a 2nd iteration of the dropped patches sent in a previous
> mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via
> Devlink" to address review comments [1].
>
> Changes from the original series:
> - According to the discussion outcome, we are keeping the congestion control
>   setting as mlx5 device specific for the current HW generation.
> - Changed the congestion_mode and congestion action param type to string
> - Added patches to fix devlink handling of param type string
> - Added a patch which adds extack messages support for param set.
> - At the end of this series, I've added yet another mlx5 devlink related
>  feature, firmware snapshot support.
>
> For more information please see tag log below.
>
> Please pull and let me know if there's any problem.
>
> [1] https://patchwork.ozlabs.org/patch/945996/
>
> Thanks,
> Saeed.
>
> ---
>
> The following changes since commit e6476c21447c4b17c47e476aade6facf050f31e8:
>
>   net: remove bogus RCU annotations on socket.wq (2018-07-31 12:40:22 -0700)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
> tags/mlx5-updates-2018-08-01
>
> for you to fetch changes up to 2ac6108c65ffcb1e5eab1fba1fd59272604d1c32:
>
>   net/mlx5: Use devlink region_snapshot parameter (2018-08-01 14:49:09 -0700)
>
> 
> mlx5-updates-2018-08-01
>
> This series provides devlink parameters updates to both devlink API and
> mlx5 driver,
>
> 1) Devlink changes: (Moshe Shemesh)
> The first two patches fix devlink param infrastructure for string type
> params.
> The third patch adds a devlink helper function to safely copy string from
> driver to devlink.
> The forth patch adds extack support for param set.
>
> 2) mlx5 specific congestion parameters: (Eran Ben Elisha)
> Next three patches add new devlink driver specific params for controlling
> congestion action and mode, using string type params and extack messages 
> support.
>
> This congestion mode enables hw workaround in specific devices which is
> controlled by devlink driver-specific params. The workaround is device
> specific for this NIC generation, the next NIC will not need it.
>
> Congestion parameters:
>  - Congestion action
> HW W/A mechanism in the PCIe buffer which monitors the amount of
> consumed PCIe buffer per host.  This mechanism supports the
> following actions in case of threshold overflow:
> - Disabled - NOP (Default)
> - Drop
> - Mark - Mark CE bit in the CQE of received packet
> - Congestion mode
> - Aggressive - Aggressive static trigger threshold (Default)
> - Dynamic - Dynamically change the trigger threshold
>
> 3) mlx5 firmware snapshot support via devlink: (Alex Vesker)
> Last three patches, add the support for capturing region snapshot of the
> firmware crspace during critical errors, using devlink region_snapshot
> parameter.
>
> -Saeed.
>
> 
> Alex Vesker (3):
>   net/mlx5: Add Vendor Specific Capability access gateway
>   net/mlx5: Add Crdump FW snapshot support
>   net/mlx5: Use devlink region_snapshot parameter
>
> Eran Ben Elisha (3):
>   net/mlx5: Move all devlink related functions calls to devlink.c
>   net/mlx5: Add MPEGC register configuration functionality
>   net/mlx5: Enable PCIe buffer congestion handling workaround via devlink
>
> Moshe Shemesh (4):
>   devlink: Fix param set handling for string type
>   devlink: Fix param cmode driverinit for string type
>   devlink: Add helper function for safely copy string param
>   devlink: Add extack messages support to param set
>
>  drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c  |   3 +-
>  drivers/net/ethernet/mellanox/mlx4/main.c  |   6 +-
>  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +-
>  drivers/net/ethernet/mellanox/mlx5/core/devlink.c  | 388 
> +
>  drivers/net/ethernet/mellanox/mlx5/core/devlink.h  |  13 +
>  .../net/ethernet/mellanox/mlx5/core/diag/crdump.c  | 223 
>  drivers/net/ethernet/mellanox/mlx5/core/health.c   |   3 +
>  drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h |   4 +
>  .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c  | 320 +
>  .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h  |  56 +++
>  drivers/net/ethernet/mellanox/mlx5/core/main.c |  10 +-
>  include/linux/mlx5/driver.h|   5 +
>  include/net/devlink.h  |  15 +-
>  net/core/devlink.c |  44 ++-
>  14 files changed, 1076 insertions(+), 17 deletions(-)
>  create mode 100644 

Re: [net-next 07/10] net/mlx5: Enable PCIe buffer congestion handling workaround via devlink

2018-08-01 Thread Alexander Duyck
On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed  wrote:
> From: Eran Ben Elisha 
>
> Add support for two driver parameters via devlink params interface:
> - Congestion action
> HW mechanism in the PCIe buffer which monitors the amount of
> consumed PCIe buffer per host. This mechanism supports the
> following actions in case of threshold overflow:
> - disabled - NOP (Default)
> - drop
> - mark - Mark CE bit in the CQE of received packet

Any chance you could clarify the differences between "disabled" and
"drop"? I am assuming the "drop" is a head-of-line drop versus the
"disabled" being a incoming packet drop?

Also I still don't see this as necessarily being all that unique of a
feature/issue. Basically being PCIe bus limited is not all that
uncommon of a thing and has existed since the early days of PCI. In
the case of the Intel NICs we just throw a warning and end up dropping
the incoming packets instead of providing the two other options you
have listed.

> - Congestion mode
> - aggressive - Aggressive static trigger threshold (Default)
> - dynamic - Dynamically change the trigger threshold
>
> These driver-specific params enable the NIC HW workaround to handle
> buffer congestion on the current NIC generation.

Is there any documentation anywhere for any of these features? In the
patch set I see you adding interfaces, but I don't see them documented
anywhere.

> Signed-off-by: Eran Ben Elisha 
> Reviewed-by: Moshe Shemesh 
> Reviewed-by: Jiri Pirko 
> Signed-off-by: Saeed Mahameed 
> ---
>  .../net/ethernet/mellanox/mlx5/core/devlink.c | 204 +-
>  1 file changed, 203 insertions(+), 1 deletion(-)

Ideally there should be some documentation going into the kernel when
you extend the devlink interface at least so that I know how to use
your new interfaces when you define them. Just updating devlink.c
seems like a messy way to do things.

- Alex


Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-30 Thread Alexander Duyck
On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas  wrote:
> On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
>> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas  wrote:
>> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
>> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  
>> >> wrote:
>> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  
>> >> > wrote:
>> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  
>> >> >> > wrote:
>> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
>> >> >> > > wrote:
>> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> >> >> > >>> >>>> The devlink params haven't been upstream even for a full 
>> >> >> > >>> >>>> cycle
>> >> >> > >>> >>>> and
>> >> >> > >>> >>>> already you guys are starting to use them to configure 
>> >> >> > >>> >>>> standard
>> >> >> > >>> >>>> features like queuing.
>> >> >> > >>> >>>
>> >> >> > >>> >>> We developed the devlink params in order to support 
>> >> >> > >>> >>> non-standard
>> >> >> > >>> >>> configuration only. And for non-standard, there are generic 
>> >> >> > >>> >>> and
>> >> >> > >>> >>> vendor
>> >> >> > >>> >>> specific options.
>> >> >> > >>> >>
>> >> >> > >>> >> I thought it was developed for performing non-standard and
>> >> >> > >>> >> possibly
>> >> >> > >>> >> vendor specific configuration.  Look at 
>> >> >> > >>> >> DEVLINK_PARAM_GENERIC_*
>> >> >> > >>> >> for
>> >> >> > >>> >> examples of well justified generic options for which we have 
>> >> >> > >>> >> no
>> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor 
>> >> >> > >>> >> specific
>> >> >> > >>> >> if you
>> >> >> > >>> >> ask me, too.
>> >> >> > >>> >>
>> >> >> > >>> >> Configuring queuing has an API.  The question is it 
>> >> >> > >>> >> acceptable to
>> >> >> > >>> >> enter
>> >> >> > >>> >> into the risky territory of controlling offloads via devlink
>> >> >> > >>> >> parameters
>> >> >> > >>> >> or would we rather make vendors take the time and effort to 
>> >> >> > >>> >> model
>> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> >> >> > >>> >> APIs
>> >> >> > >>> >> perfectly.
>> >> >> > >>> >
>> >> >> > >>> > I understand what you meant here, I would like to highlight 
>> >> >> > >>> > that
>> >> >> > >>> > this
>> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> >> >> > >>> > The vendor specific configuration suggested here is to handle a
>> >> >> > >>> > congestion
>> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
>> >> >> > >>> > VFs per
>> >> >> > >>> > host), where one host is not aware to the other hosts, and 
>> >> >> > >>> > each is
>> >> >> > >>> > running
>> >>

Re: [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-07-30 Thread Alexander Duyck
On Mon, Jul 30, 2018 at 12:20 PM, Sinan Kaya  wrote:
> +netdev,
>
> On 7/30/2018 9:45 AM, Alexander Duyck wrote:
>>
>> I haven't had a chance to work on this much myself. My understanding
>> is that igb has had the barriers updated, but I don't think any of the
>> other drivers have been worked over yet.
>
>
> Unfortunately, I have recently changed jobs and I no longer have the
> hardware to test my changes. I thought that you wanted to handle this
> yourself.
>
> I haven't seen any follow ups. I wanted to double check.

As I said so far igb has been the only one updated, and that was done
by a third party:
73017f4e051c8 igb: Use dma_wmb() instead of wmb() before doorbell writes

> I worked with several architecture leads on 4.17. All architectures
> support the updated barrier semantics now. It is time to clean up the
> network drivers.

Thanks for that update. As I said for now igb has the barriers
updated. The idea being that igb is the test vehicle for this so if we
go a kernel version or so without triggering any issues then we can
follow up with the other drivers.

The other thing we have to keep in mind is that unlike many other NICs
we have to also deal with emulations of our devices (e1000 and e1000e)
that may rely on certain barriers being used to enforce things like
SMP synchronization between CPUs, so we have to be careful as we roll
this out.

Thanks.

- Alex


Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-30 Thread Alexander Duyck
On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas  wrote:
> On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
>> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  wrote:
>> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  wrote:
>> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
>> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
>> >> > > wrote:
>> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
>> >> > >>> >>>> and
>> >> > >>> >>>> already you guys are starting to use them to configure standard
>> >> > >>> >>>> features like queuing.
>> >> > >>> >>>
>> >> > >>> >>> We developed the devlink params in order to support non-standard
>> >> > >>> >>> configuration only. And for non-standard, there are generic and
>> >> > >>> >>> vendor
>> >> > >>> >>> specific options.
>> >> > >>> >>
>> >> > >>> >> I thought it was developed for performing non-standard and
>> >> > >>> >> possibly
>> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
>> >> > >>> >> for
>> >> > >>> >> examples of well justified generic options for which we have no
>> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
>> >> > >>> >> if you
>> >> > >>> >> ask me, too.
>> >> > >>> >>
>> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
>> >> > >>> >> enter
>> >> > >>> >> into the risky territory of controlling offloads via devlink
>> >> > >>> >> parameters
>> >> > >>> >> or would we rather make vendors take the time and effort to model
>> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> >> > >>> >> APIs
>> >> > >>> >> perfectly.
>> >> > >>> >
>> >> > >>> > I understand what you meant here, I would like to highlight that
>> >> > >>> > this
>> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> >> > >>> > The vendor specific configuration suggested here is to handle a
>> >> > >>> > congestion
>> >> > >>> > state in Multi Host environment (which includes PF and multiple
>> >> > >>> > VFs per
>> >> > >>> > host), where one host is not aware to the other hosts, and each is
>> >> > >>> > running
>> >> > >>> > on its own pci/driver. It is a device working mode configuration.
>> >> > >>> >
>> >> > >>> > This  couldn't fit into any existing API, thus creating this
>> >> > >>> > vendor specific
>> >> > >>> > unique API is needed.
>> >> > >>>
>> >> > >>> If we are just going to start creating devlink interfaces in for
>> >> > >>> every
>> >> > >>> one-off option a device wants to add why did we even bother with
>> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
>> >> > >>> are back to the same arguments we had back in the day with it.
>> >> > >>>
>> >> > >>> I feel like the bigger question here is if devlink is how we are
>> >> > >>> going
>> >> > >>> to deal with all PCIe related features going forward, or should we
>> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
>> >> > >>

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-29 Thread Alexander Duyck
On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  wrote:
>
>
> On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  wrote:
>>
>> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
>> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
>> > > wrote:
>> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> > >>> >>>> The devlink params haven't been upstream even for a full cycle
>> > >>> >>>> and
>> > >>> >>>> already you guys are starting to use them to configure standard
>> > >>> >>>> features like queuing.
>> > >>> >>>
>> > >>> >>> We developed the devlink params in order to support non-standard
>> > >>> >>> configuration only. And for non-standard, there are generic and
>> > >>> >>> vendor
>> > >>> >>> specific options.
>> > >>> >>
>> > >>> >> I thought it was developed for performing non-standard and
>> > >>> >> possibly
>> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
>> > >>> >> for
>> > >>> >> examples of well justified generic options for which we have no
>> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
>> > >>> >> if you
>> > >>> >> ask me, too.
>> > >>> >>
>> > >>> >> Configuring queuing has an API.  The question is it acceptable to
>> > >>> >> enter
>> > >>> >> into the risky territory of controlling offloads via devlink
>> > >>> >> parameters
>> > >>> >> or would we rather make vendors take the time and effort to model
>> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> > >>> >> APIs
>> > >>> >> perfectly.
>> > >>> >
>> > >>> > I understand what you meant here, I would like to highlight that
>> > >>> > this
>> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> > >>> > The vendor specific configuration suggested here is to handle a
>> > >>> > congestion
>> > >>> > state in Multi Host environment (which includes PF and multiple
>> > >>> > VFs per
>> > >>> > host), where one host is not aware to the other hosts, and each is
>> > >>> > running
>> > >>> > on its own pci/driver. It is a device working mode configuration.
>> > >>> >
>> > >>> > This  couldn't fit into any existing API, thus creating this
>> > >>> > vendor specific
>> > >>> > unique API is needed.
>> > >>>
>> > >>> If we are just going to start creating devlink interfaces in for
>> > >>> every
>> > >>> one-off option a device wants to add why did we even bother with
>> > >>> trying to prevent drivers from using sysfs? This just feels like we
>> > >>> are back to the same arguments we had back in the day with it.
>> > >>>
>> > >>> I feel like the bigger question here is if devlink is how we are
>> > >>> going
>> > >>> to deal with all PCIe related features going forward, or should we
>> > >>> start looking at creating a new interface/tool for PCI/PCIe related
>> > >>> features? My concern is that we have already had features such as
>> > >>> DMA
>> > >>> Coalescing that didn't really fit into anything and now we are
>> > >>> starting to see other things related to DMA and PCIe bus credits.
>> > >>> I'm
>> > >>> wondering if we shouldn't start looking at a tool/interface to
>> > >>> configure all the PCIe related features such as interrupts, error
>> > >>> reporting, DMA configuration, power management, etc. Maybe we could
>> > >>> even look at sharing it across subsystems and include 

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-26 Thread Alexander Duyck
On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
> Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com wrote:
>>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>>> >>>> The devlink params haven't been upstream even for a full cycle and
>>> >>>> already you guys are starting to use them to configure standard
>>> >>>> features like queuing.
>>> >>>
>>> >>> We developed the devlink params in order to support non-standard
>>> >>> configuration only. And for non-standard, there are generic and vendor
>>> >>> specific options.
>>> >>
>>> >> I thought it was developed for performing non-standard and possibly
>>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
>>> >> examples of well justified generic options for which we have no
>>> >> other API.  The vendor mlx4 options look fairly vendor specific if you
>>> >> ask me, too.
>>> >>
>>> >> Configuring queuing has an API.  The question is it acceptable to enter
>>> >> into the risky territory of controlling offloads via devlink parameters
>>> >> or would we rather make vendors take the time and effort to model
>>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
>>> >> perfectly.
>>> >
>>> > I understand what you meant here, I would like to highlight that this
>>> > mechanism was not meant to handle SRIOV, Representors, etc.
>>> > The vendor specific configuration suggested here is to handle a congestion
>>> > state in Multi Host environment (which includes PF and multiple VFs per
>>> > host), where one host is not aware to the other hosts, and each is running
>>> > on its own pci/driver. It is a device working mode configuration.
>>> >
>>> > This  couldn't fit into any existing API, thus creating this vendor 
>>> > specific
>>> > unique API is needed.
>>>
>>> If we are just going to start creating devlink interfaces in for every
>>> one-off option a device wants to add why did we even bother with
>>> trying to prevent drivers from using sysfs? This just feels like we
>>> are back to the same arguments we had back in the day with it.
>>>
>>> I feel like the bigger question here is if devlink is how we are going
>>> to deal with all PCIe related features going forward, or should we
>>> start looking at creating a new interface/tool for PCI/PCIe related
>>> features? My concern is that we have already had features such as DMA
>>> Coalescing that didn't really fit into anything and now we are
>>> starting to see other things related to DMA and PCIe bus credits. I'm
>>> wondering if we shouldn't start looking at a tool/interface to
>>> configure all the PCIe related features such as interrupts, error
>>> reporting, DMA configuration, power management, etc. Maybe we could
>>> even look at sharing it across subsystems and include things like
>>> storage, graphics, and other subsystems in the conversation.
>>
>>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
>>to build up an API.  Sharing it across subsystems would be very cool!
>
> I wonder howcome there isn't such API in place already. Or is it?
> If it is not, do you have any idea how should it look like? Should it be
> an extension of the existing PCI uapi or something completely new?
> It would be probably good to loop some PCI people in...

The closest thing I can think of in terms of answering your questions
as to why we haven't seen anything like that would be setpci.
Basically with that tool you can go through the PCI configuration
space and update any piece you want. The problem is it can have
effects on the driver and I don't recall there ever being any sort of
notification mechanism added to make a driver aware of configuration
updates.

As far as the interface I don't know if we would want to use something
like netlink or look at something completely new.

I've gone ahead and added the linux-pci mailing list to the thread.

- Alex


Re: [Intel-wired-lan] [PATCH v1 1/4] igb: Remove unnecessary include of

2018-07-25 Thread Alexander Duyck
On Wed, Jul 25, 2018 at 12:52 PM, Bjorn Helgaas  wrote:
> From: Bjorn Helgaas 
>
> The igb driver doesn't need anything provided by pci-aspm.h, so remove
> the unnecessary include of it.
>
> Signed-off-by: Bjorn Helgaas 

Looks good to me.

Acked-by: Alexander Duyck 

> ---
>  drivers/net/ethernet/intel/igb/igb_main.c |1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index f707709969ac..c77fda05f683 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -22,7 +22,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
>
> ___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-25 Thread Alexander Duyck
On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha  wrote:
>
>
> On 7/24/2018 10:51 PM, Jakub Kicinski wrote:


 The devlink params haven't been upstream even for a full cycle and
 already you guys are starting to use them to configure standard
 features like queuing.
>>>
>>>
>>> We developed the devlink params in order to support non-standard
>>> configuration only. And for non-standard, there are generic and vendor
>>> specific options.
>>
>>
>> I thought it was developed for performing non-standard and possibly
>> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
>> examples of well justified generic options for which we have no
>> other API.  The vendor mlx4 options look fairly vendor specific if you
>> ask me, too.
>>
>> Configuring queuing has an API.  The question is it acceptable to enter
>> into the risky territory of controlling offloads via devlink parameters
>> or would we rather make vendors take the time and effort to model
>> things to (a subset) of existing APIs.  The HW never fits the APIs
>> perfectly.
>
>
> I understand what you meant here, I would like to highlight that this
> mechanism was not meant to handle SRIOV, Representors, etc.
> The vendor specific configuration suggested here is to handle a congestion
> state in Multi Host environment (which includes PF and multiple VFs per
> host), where one host is not aware to the other hosts, and each is running
> on its own pci/driver. It is a device working mode configuration.
>
> This  couldn't fit into any existing API, thus creating this vendor specific
> unique API is needed.

If we are just going to start creating devlink interfaces in for every
one-off option a device wants to add why did we even bother with
trying to prevent drivers from using sysfs? This just feels like we
are back to the same arguments we had back in the day with it.

I feel like the bigger question here is if devlink is how we are going
to deal with all PCIe related features going forward, or should we
start looking at creating a new interface/tool for PCI/PCIe related
features? My concern is that we have already had features such as DMA
Coalescing that didn't really fit into anything and now we are
starting to see other things related to DMA and PCIe bus credits. I'm
wondering if we shouldn't start looking at a tool/interface to
configure all the PCIe related features such as interrupts, error
reporting, DMA configuration, power management, etc. Maybe we could
even look at sharing it across subsystems and include things like
storage, graphics, and other subsystems in the conversation.

- Alex


Re: [net-next V2 12/12] net/mlx5e: Use PARTIAL_GSO for UDP segmentation

2018-07-24 Thread Alexander Duyck
On Mon, Jul 23, 2018 at 3:11 PM, Saeed Mahameed  wrote:
> From: Boris Pismenny 
>
> This patch removes the splitting of UDP_GSO_L4 packets in the driver,
> and exposes UDP_GSO_L4 as a PARTIAL_GSO feature. Thus, the network stack
> is not responsible for splitting the packet into two.
>
> Signed-off-by: Boris Pismenny 
> Signed-off-by: Saeed Mahameed 
> ---
>  .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +-
>  .../mellanox/mlx5/core/en_accel/en_accel.h|  27 +++--
>  .../mellanox/mlx5/core/en_accel/rxtx.c| 109 --
>  .../mellanox/mlx5/core/en_accel/rxtx.h|  14 ---
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |   9 +-
>  5 files changed, 23 insertions(+), 140 deletions(-)
>  delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
>  delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
> b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index 55d5a5c2e9d8..fa7fcca5dc78 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -14,8 +14,8 @@ mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o 
> fpga/conn.o fpga/sdk.o \
> fpga/ipsec.o fpga/tls.o
>
>  mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o 
> en_ethtool.o \
> -   en_tx.o en_rx.o en_dim.o en_txrx.o en_accel/rxtx.o en_stats.o 
>  \
> -   vxlan.o en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
> +   en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o 
>  \
> +   en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
>
>  mlx5_core-$(CONFIG_MLX5_MPFS) += lib/mpfs.o
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
> index 39a5d13ba459..1dd225380a66 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
> @@ -38,14 +38,22 @@
>  #include 
>  #include "en_accel/ipsec_rxtx.h"
>  #include "en_accel/tls_rxtx.h"
> -#include "en_accel/rxtx.h"
>  #include "en.h"
>
> -static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
> -   struct mlx5e_txqsq *sq,
> -   struct net_device *dev,
> -   struct mlx5e_tx_wqe **wqe,
> -   u16 *pi)
> +static inline void
> +mlx5e_udp_gso_handle_tx_skb(struct sk_buff *skb)
> +{
> +   int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr);
> +
> +   udp_hdr(skb)->len = htons(payload_len);
> +}
> +

So it looks like you decided to just update the length here. Do you
still have plans to update GSO_PARTIAL to set the length this way or
have you decided to just leave it as it is?

> +static inline struct sk_buff *
> +mlx5e_accel_handle_tx(struct sk_buff *skb,
> + struct mlx5e_txqsq *sq,
> + struct net_device *dev,
> + struct mlx5e_tx_wqe **wqe,
> + u16 *pi)
>  {
>  #ifdef CONFIG_MLX5_EN_TLS
> if (test_bit(MLX5E_SQ_STATE_TLS, >state)) {
> @@ -63,11 +71,8 @@ static inline struct sk_buff *mlx5e_accel_handle_tx(struct 
> sk_buff *skb,
> }
>  #endif
>
> -   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
> -   skb = mlx5e_udp_gso_handle_tx_skb(dev, sq, skb, wqe, pi);
> -   if (unlikely(!skb))
> -   return NULL;
> -   }
> +   if (skb_is_gso(skb) && skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
> +   mlx5e_udp_gso_handle_tx_skb(skb);
>
> return skb;
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
> deleted file mode 100644
> index 7b7ec3998e84..
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
> +++ /dev/null
> @@ -1,109 +0,0 @@
> -#include "en_accel/rxtx.h"
> -
> -static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb,
> -  struct sk_buff *nskb,
> -  int remaining)
> -{
> -   int bytes_needed = remaining, remaining_headlen, 
> remaining_page_offset;
> -   int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
> -   int payload_len = remaining + sizeof(struct udphdr);
> -   int k = 0, i, j;
> -
> -   skb_copy_bits(skb, 0, nskb->data, headlen);
> -   nskb->dev = skb->dev;
> -   skb_reset_mac_header(nskb);
> -   skb_set_network_header(nskb, skb_network_offset(skb));
> -   skb_set_transport_header(nskb, skb_transport_offset(skb));
> -   skb_set_tail_pointer(nskb, headlen);
> -
> -   /* How many frags do we need? */
> 

Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-20 Thread Alexander Duyck
On Fri, Jul 20, 2018 at 3:09 AM, Pablo Neira Ayuso  wrote:
> On Thu, Jul 19, 2018 at 02:04:16PM -0700, Alexander Duyck wrote:
>> On Thu, Jul 19, 2018 at 1:52 PM, Pablo Neira Ayuso  
>> wrote:
>> > On Thu, Jul 19, 2018 at 08:18:20AM -0700, Alexander Duyck wrote:
>> >> On Wed, Jul 18, 2018 at 5:11 PM, Pablo Neira Ayuso  
>> >> wrote:
>> >> > One of the recurring complaints is that we do not have, as a driver
>> >> > writer, a central location from which we would be fed offloading rules
>> >> > into a NIC. This was brought up again during Netconf'18 in Boston.
>> >> >
>> >> > This patch just renames ndo_setup_tc to ndo_setup_offload as a very
>> >> > early initial work to prepare for follow up patch that discuss unified
>> >> > flow representation for the existing offload programming APIs.
>> >> >
>> >> > Signed-off-by: Pablo Neira Ayuso 
>> >> > Acked-by: Jiri Pirko 
>> >> > Acked-by: Jakub Kicinski 
>> >>
>> >> One request I would have here is to not bother updating the individual
>> >> driver function names. For now I would say we could leave the
>> >> "_setup_tc" in the naming of the driver functions itself and just
>> >> update the name of the net device operation. Renaming the driver
>> >> functions just adds unnecessary overhead and complexity to the patch
>> >> and will make it more difficult to maintain. When we get around to
>> >> adding additional functionality that relates to the rename we could
>> >> address renaming the function on a per driver basis in the future.
>> >
>> > Plan was to follow up patch will rename enum tc_setup_type too:
>> >
>> > https://marc.info/?l=linux-netdev=153193158512556=2
>> >
>> > that will result in more renames in the driver side.
>> >
>> > I would expect this will happen sooner or later, and out of tree
>> > patches will end up needing a rebase sooner or later, if that is the
>> > concern.
>>
>> I was just thinking that renaming the functions themselves adds noise
>> and makes it harder to debug functions later when they get renamed. As
>> far as the out-of-tree driver I agree we will still have to deal with
>> it due to the enum and NDO function rename. I just figured that using
>> things like LXR is a bit easier when the function name stays the same
>> and you have to move between versions.
>
> Semantic changes in this interface are expected in follow up patches.
> Specifically, this interface will not be exclusively dedicated to 'tc'
> anymore. The function rename will provide a hint on this semantic change
> going on. I understand your concern, and I also tend to dislike renaming
> for the sake of renaming, but in this case this rename coveys useful
> information to developers.
>
> Thanks.

I kind of figured semantic changes were coming. I just thought we
could wait until then for the function renames. There isn't any point
in renaming the function until it changes what it is actually doing. I
suspect you may only have a few drivers that do the update as there
probably isn't much value in updating the function name on drivers
such as fm10k or igb as they are unlikely to ever support anything
other than the tc offloads. That way it is clear that currently none
of the drivers are supporting anything other than the tc offloads
until the new functionality to support netfilter offloads is added on
a per driver basis.

- Alex


Re: [PATCH net] net: skb_segment() should not return NULL

2018-07-19 Thread Alexander Duyck
0246 ORIG_RAX: 002c
> RAX: ffda RBX: 7fcae64d66d4 RCX: 00455ab9
> RDX: 0001 RSI: 2200 RDI: 0013
> RBP: 0072bea0 R08:  R09: 
> R10:  R11: 0246 R12: 0014
> R13: 004c1145 R14: 004d1818 R15: 0006
> Modules linked in:
> Dumping ftrace buffer:
>(ftrace buffer empty)
>
> Fixes: ddff00d42043 ("net: Move skb_has_shared_frag check out of GRE code and 
> into segmentation")
> Signed-off-by: Eric Dumazet 
> Cc: Alexander Duyck 
> Reported-by: syzbot 

Thanks for fixing this.

Acked-by: Alexander Duyck 

> ---
>  net/core/skbuff.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 
> 8e51f8555e11b95bc48ab334f50571048f705101..fb35b62af2724025f743d61de24f9fb7eb9186a8
>  100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3720,6 +3720,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> net_warn_ratelimited(
> "skb_segment: too many frags: %u 
> %u\n",
> pos, mss);
> +   err = -EINVAL;
> goto err;
> }
>
> @@ -3753,11 +3754,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>
>  perform_csum_check:
> if (!csum) {
> -   if (skb_has_shared_frag(nskb)) {
> -   err = __skb_linearize(nskb);
> -   if (err)
> -   goto err;
> -   }
> +   if (skb_has_shared_frag(nskb) &&
> +   __skb_linearize(nskb))
> +   goto err;
> +
> if (!nskb->remcsum_offload)
> nskb->ip_summed = CHECKSUM_NONE;
> SKB_GSO_CB(nskb)->csum =
> --
> 2.18.0.233.g985f88cf7e-goog
>


Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-19 Thread Alexander Duyck
On Thu, Jul 19, 2018 at 1:52 PM, Pablo Neira Ayuso  wrote:
> On Thu, Jul 19, 2018 at 08:18:20AM -0700, Alexander Duyck wrote:
>> On Wed, Jul 18, 2018 at 5:11 PM, Pablo Neira Ayuso  
>> wrote:
>> > One of the recurring complaints is that we do not have, as a driver
>> > writer, a central location from which we would be fed offloading rules
>> > into a NIC. This was brought up again during Netconf'18 in Boston.
>> >
>> > This patch just renames ndo_setup_tc to ndo_setup_offload as a very
>> > early initial work to prepare for follow up patch that discuss unified
>> > flow representation for the existing offload programming APIs.
>> >
>> > Signed-off-by: Pablo Neira Ayuso 
>> > Acked-by: Jiri Pirko 
>> > Acked-by: Jakub Kicinski 
>>
>> One request I would have here is to not bother updating the individual
>> driver function names. For now I would say we could leave the
>> "_setup_tc" in the naming of the driver functions itself and just
>> update the name of the net device operation. Renaming the driver
>> functions just adds unnecessary overhead and complexity to the patch
>> and will make it more difficult to maintain. When we get around to
>> adding additional functionality that relates to the rename we could
>> address renaming the function on a per driver basis in the future.
>
> Plan was to follow up patch will rename enum tc_setup_type too:
>
> https://marc.info/?l=linux-netdev=153193158512556=2
>
> that will result in more renames in the driver side.
>
> I would expect this will happen sooner or later, and out of tree
> patches will end up needing a rebase sooner or later, if that is the
> concern.

I was just thinking that renaming the functions themselves adds noise
and makes it harder to debug functions later when they get renamed. As
far as the out-of-tree driver I agree we will still have to deal with
it due to the enum and NDO function rename. I just figured that using
things like LXR is a bit easier when the function name stays the same
and you have to move between versions.

- Alex


Re: [PATCH net-next,v2] net: rename ndo_setup_tc to ndo_setup_offload

2018-07-19 Thread Alexander Duyck
On Wed, Jul 18, 2018 at 5:11 PM, Pablo Neira Ayuso  wrote:
> One of the recurring complaints is that we do not have, as a driver
> writer, a central location from which we would be fed offloading rules
> into a NIC. This was brought up again during Netconf'18 in Boston.
>
> This patch just renames ndo_setup_tc to ndo_setup_offload as a very
> early initial work to prepare for follow up patch that discuss unified
> flow representation for the existing offload programming APIs.
>
> Signed-off-by: Pablo Neira Ayuso 
> Acked-by: Jiri Pirko 
> Acked-by: Jakub Kicinski 

One request I would have here is to not bother updating the individual
driver function names. For now I would say we could leave the
"_setup_tc" in the naming of the driver functions itself and just
update the name of the net device operation. Renaming the driver
functions just adds unnecessary overhead and complexity to the patch
and will make it more difficult to maintain. When we get around to
adding additional functionality that relates to the rename we could
address renaming the function on a per driver basis in the future.


Re: tc mqprio offload command error

2018-07-16 Thread Alexander Duyck
On Sun, Jul 15, 2018 at 6:30 PM, Chopra, Manish
 wrote:
> Hello Folks,
>
> I am trying to set below command to try mqprio offload on 4.18 kernel. It is 
> throwing the flowing error.
>
> # tc qdisc add dev eth0 root mqprio num_tc 2 map 1 1 1 1 0 0 0 0
> RTNETLINK answers: Numerical result out of range
>
> I can't really make out what's wrong with the above command, since this works 
> fine with other OS kernels.
> Any thoughts if it is something broken on upstream kernel ?
>
> Thanks,
> Manish

You might need to specify the traffic class for the 8 remaining
priorities. The full map size is 16 entries, not just 8. The default
value for the last 4 mapping entries is TC 3 which would be out of
range if you only have 2 TCs specified.

- Alex


[jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-07-09 Thread Alexander Duyck
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.

The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.

I am submitting this as an RFC for the netdev mailing list, and officially
submitting it for testing to Jeff Kirsher's next-queue in order to validate
the ixgbe specific bits.

The big changes in this set are:
  Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
  Disable XPS for single queue devices
  Replace accel_priv with sb_dev in ndo_select_queue
  Add sb_dev parameter to fallback function for ndo_select_queue
  Consolidated ndo_select_queue functions that appeared to be duplicates

v2:
Updated patch set to rebase the netdev_pick_tx logic off of recent Rx
symmetric queue changes.

---

Alexander Duyck (7):
  net-sysfs: Drop support for XPS and traffic_class on single queue device
  net: Add support for subordinate device traffic classes
  ixgbe: Add code to populate and use macvlan tc to Tx queue map
  net: Add support for subordinate traffic classes to netdev_pick_tx
  net: Add generic ndo_select_queue functions
  net: allow ndo_select_queue to pass netdev
  net: allow fallback function to pass netdev


 drivers/infiniband/hw/hfi1/vnic_main.c|2 
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 -
 drivers/net/bonding/bond_main.c   |3 
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |5 -
 drivers/net/ethernet/broadcom/bcmsysport.c|6 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |6 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |5 -
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |5 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   62 ++--
 drivers/net/ethernet/lantiq_etop.c|   10 -
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|7 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |5 -
 drivers/net/ethernet/renesas/ravb_main.c  |3 
 drivers/net/ethernet/sun/ldmvsw.c |3 
 drivers/net/ethernet/sun/sunvnet.c|3 
 drivers/net/ethernet/ti/netcp_core.c  |9 -
 drivers/net/hyperv/netvsc_drv.c   |6 -
 drivers/net/macvlan.c |   10 -
 drivers/net/net_failover.c|7 +
 drivers/net/team/team.c   |3 
 drivers/net/tun.c |3 
 drivers/net/wireless/marvell/mwifiex/main.c   |3 
 drivers/net/xen-netback/interface.c   |4 -
 drivers/net/xen-netfront.c|3 
 drivers/staging/netlogic/xlr_net.c|9 -
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 -
 include/linux/netdevice.h |   34 -
 net/core/dev.c|  157 ++---
 net/core/net-sysfs.c  |   36 -
 net/mac80211/iface.c  |4 -
 net/packet/af_packet.c|7 +
 35 files changed, 312 insertions(+), 131 deletions(-)

--


[jkirsher/next-queue PATCH v2 5/7] net: Add generic ndo_select_queue functions

2018-07-09 Thread Alexander Duyck
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/lantiq_etop.c   |   10 +-
 drivers/net/ethernet/ti/netcp_core.c |9 +
 drivers/staging/netlogic/xlr_net.c   |9 +
 include/linux/netdevice.h|4 
 net/core/dev.c   |   14 ++
 net/packet/af_packet.c   |2 +-
 6 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c 
b/drivers/net/ethernet/lantiq_etop.c
index afc8100..7a637b5 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -563,14 +563,6 @@ struct ltq_etop_priv {
spin_unlock_irqrestore(>lock, flags);
 }
 
-static u16
-ltq_etop_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
-{
-   /* we are currently only using the first queue */
-   return 0;
-}
-
 static int
 ltq_etop_init(struct net_device *dev)
 {
@@ -641,7 +633,7 @@ struct ltq_etop_priv {
.ndo_set_mac_address = ltq_etop_set_mac_address,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_rx_mode = ltq_etop_set_multicast_list,
-   .ndo_select_queue = ltq_etop_select_queue,
+   .ndo_select_queue = dev_pick_tx_zero,
.ndo_init = ltq_etop_init,
.ndo_tx_timeout = ltq_etop_tx_timeout,
 };
diff --git a/drivers/net/ethernet/ti/netcp_core.c 
b/drivers/net/ethernet/ti/netcp_core.c
index 6ebf110..a1d335a 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1889,13 +1889,6 @@ static int netcp_rx_kill_vid(struct net_device *ndev, 
__be16 proto, u16 vid)
return err;
 }
 
-static u16 netcp_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
- select_queue_fallback_t fallback)
-{
-   return 0;
-}
-
 static int netcp_setup_tc(struct net_device *dev, enum tc_setup_type type,
  void *type_data)
 {
@@ -1972,7 +1965,7 @@ static int netcp_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
.ndo_vlan_rx_add_vid= netcp_rx_add_vid,
.ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
.ndo_tx_timeout = netcp_ndo_tx_timeout,
-   .ndo_select_queue   = netcp_select_queue,
+   .ndo_select_queue   = dev_pick_tx_zero,
.ndo_setup_tc   = netcp_setup_tc,
 };
 
diff --git a/drivers/staging/netlogic/xlr_net.c 
b/drivers/staging/netlogic/xlr_net.c
index e461168..4e6611e 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -290,13 +290,6 @@ static netdev_tx_t xlr_net_start_xmit(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
-static u16 xlr_net_select_queue(struct net_device *ndev, struct sk_buff *skb,
-   void *accel_priv,
-   select_queue_fallback_t fallback)
-{
-   return (u16)smp_processor_id();
-}
-
 static void xlr_hw_set_mac_addr(struct net_device *ndev)
 {
struct xlr_net_priv *priv = netdev_priv(ndev);
@@ -403,7 +396,7 @@ static void xlr_stats(struct net_device *ndev, struct 
rtnl_link_stats64 *stats)
.ndo_open = xlr_net_open,
.ndo_stop = xlr_net_stop,
.ndo_start_xmit = xlr_net_start_xmit,
-   .ndo_select_queue = xlr_net_select_queue,
+   .ndo_select_queue = dev_pick_tx_cpu_id,
.ndo_set_mac_address = xlr_net_set_mac_addr,
.ndo_set_rx_mode = xlr_set_rx_mode,
.ndo_get_stats64 = xlr_stats,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5729bc80..2e056bc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2567,6 +2567,10 @@ struct net_device *__dev_get_by_flags(struct net *net, 
unsigned short flags,
 void dev_close_many(struct list_head *head, bool unlink);
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff 
*newskb);
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
+void *accel_priv, select_queue_fallback_t fallback);
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
+  void *accel_priv, select_queue_fallback_t fallback);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 09a7cc2..b5e5380 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3617,6 +3617,20 @@ static int get_xps_queue(struct net_device *dev, str

[jkirsher/next-queue PATCH v2 6/7] net: allow ndo_select_queue to pass netdev

2018-07-09 Thread Alexander Duyck
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck 
---
 drivers/infiniband/hw/hfi1/vnic_main.c|2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 ++--
 drivers/net/bonding/bond_main.c   |3 ++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |3 ++-
 drivers/net/ethernet/broadcom/bcmsysport.c|2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |3 ++-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 ---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|3 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |3 ++-
 drivers/net/ethernet/renesas/ravb_main.c  |3 ++-
 drivers/net/ethernet/sun/ldmvsw.c |3 ++-
 drivers/net/ethernet/sun/sunvnet.c|3 ++-
 drivers/net/hyperv/netvsc_drv.c   |4 ++--
 drivers/net/net_failover.c|5 +++--
 drivers/net/team/team.c   |3 ++-
 drivers/net/tun.c |3 ++-
 drivers/net/wireless/marvell/mwifiex/main.c   |3 ++-
 drivers/net/xen-netback/interface.c   |2 +-
 drivers/net/xen-netfront.c|3 ++-
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 ++-
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 +++
 include/linux/netdevice.h |   11 +++
 net/core/dev.c|6 --
 net/mac80211/iface.c  |4 ++--
 29 files changed, 66 insertions(+), 42 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 5d65582..616fc9b 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -423,7 +423,7 @@ static netdev_tx_t hfi1_netdev_start_xmit(struct sk_buff 
*skb,
 
 static u16 hfi1_vnic_select_queue(struct net_device *netdev,
  struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
  select_queue_fallback_t fallback)
 {
struct hfi1_vnic_vport_info *vinfo = opa_vnic_dev_priv(netdev);
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
index 0c8aec6..6155878 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
@@ -95,7 +95,7 @@ static netdev_tx_t opa_netdev_start_xmit(struct sk_buff *skb,
 }
 
 static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff 
*skb,
-void *accel_priv,
+struct net_device *sb_dev,
 select_queue_fallback_t fallback)
 {
struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
@@ -107,7 +107,7 @@ static u16 opa_vnic_select_queue(struct net_device *netdev, 
struct sk_buff *skb,
mdata->entropy = opa_vnic_calc_entropy(skb);
mdata->vl = opa_vnic_get_vl(adapter, skb);
rc = adapter->rn_ops->ndo_select_queue(netdev, skb,
-  accel_priv, fallback);
+  sb_dev, fallback);
skb_pull(skb, sizeof(*mdata));
return rc;
 }
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 63e3844..9a2ea3c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4094,7 +4094,8 @@ static inline int bond_slave_override(struct bonding 
*bond,
 
 
 static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb,
-void *accel_priv, select_queue_fallback_t fallback)
+struct net_device *sb_dev,
+select_queue_fallback_t fallback)
 {
/* This helper function exists to help dev_pick_tx get the correct
 * destination queue.  Using a helper function skips a call to
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index f2af87d..e3befb1 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/

[jkirsher/next-queue PATCH v2 3/7] ixgbe: Add code to populate and use macvlan tc to Tx queue map

2018-07-09 Thread Alexander Duyck
This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c265963..3ff34ca 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5275,6 +5275,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring 
*rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 struct ixgbe_fwd_adapter *accel)
 {
+   u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+   int num_tc = netdev_get_num_tc(adapter->netdev);
struct net_device *vdev = accel->netdev;
int i, baseq, err;
 
@@ -5286,6 +5288,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
accel->rx_base_queue = baseq;
accel->tx_base_queue = baseq;
 
+   /* record configuration for macvlan interface in vdev */
+   for (i = 0; i < num_tc; i++)
+   netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+i, rss_i, baseq + (rss_i * i));
+
for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5310,6 +5317,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
 
netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
 
@@ -8206,18 +8217,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-   struct ixgbe_adapter *adapter;
-   int txq;
 #ifdef IXGBE_FCOE
+   struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
 #endif
+   int txq;
 
if (fwd_adapter) {
-   adapter = netdev_priv(dev);
-   txq = reciprocal_scale(skb_get_hash(skb),
-  adapter->num_rx_queues_per_pool);
+   u8 tc = netdev_get_num_tc(dev) ?
+   netdev_get_prio_tc_map(dev, skb->priority) : 0;
+   struct net_device *vdev = fwd_adapter->netdev;
+
+   txq = vdev->tc_to_txq[tc].offset;
+   txq += reciprocal_scale(skb_get_hash(skb),
+   vdev->tc_to_txq[tc].count);
 
-   return txq + fwd_adapter->tx_base_queue;
+   return txq;
}
 
 #ifdef IXGBE_FCOE
@@ -8771,6 +8786,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device 
*vdev, void *data)
/* if we cannot find a free pool then disable the offload */
netdev_err(vdev, "L2FW offload disabled due to lack of queue 
resources\n");
macvlan_release_l2fw_offload(vdev);
+
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
kfree(accel);
 
return 0;
@@ -9779,6 +9799,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
if (!macvlan_supports_dest_filter(vdev))
return ERR_PTR(-EMEDIUMTYPE);
 
+   /* We need to lock down the macvlan to be a single queue device so that
+* we can reuse the tc_to_txq field in the macvlan netdev to represent
+* the queue mapping to our netdev.
+*/
+   if (netif_is_multiqueue(vdev))
+   return ERR_PTR(-ERANGE);
+
pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
if (pool == adapter->num_rx_pools) {
u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9835,6 +9862,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
return ERR_PTR(-ENOMEM);
 
set_bit(pool, adapter->fwd_bitmask);
+   netdev_set_sb_channel(vdev, pool);
accel->pool = pool;
accel->netdev = vdev;
 
@@ -9876,6 +9904,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void 
*priv)
   

[jkirsher/next-queue PATCH v2 7/7] net: allow fallback function to pass netdev

2018-07-09 Thread Alexander Duyck
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.

The only driver that has any signficant change in this patchset should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c|2 +-
 drivers/net/ethernet/broadcom/bcmsysport.c  |4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c   |2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c |2 +-
 drivers/net/hyperv/netvsc_drv.c |2 +-
 drivers/net/net_failover.c  |2 +-
 drivers/net/xen-netback/interface.c |2 +-
 include/linux/netdevice.h   |3 ++-
 net/core/dev.c  |   12 +++-
 net/packet/af_packet.c  |7 ---
 14 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e3befb1..c673ac2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2224,7 +2224,7 @@ static u16 ena_select_queue(struct net_device *dev, 
struct sk_buff *skb,
if (skb_rx_queue_recorded(skb))
qid = skb_get_rx_queue(skb);
else
-   qid = fallback(dev, skb);
+   qid = fallback(dev, skb, NULL);
 
return qid;
 }
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 32f548e..eb890c4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2116,7 +2116,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
unsigned int q, port;
 
if (!netdev_uses_dsa(dev))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
/* DSA tagging layer will have configured the correct queue */
q = BRCM_TAG_GET_QUEUE(queue);
@@ -2124,7 +2124,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
tx_ring = priv->ring_map[q + port * priv->per_port_num_tx_queues];
 
if (unlikely(!tx_ring))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
return tx_ring->index;
 }
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index e4e1cf9..5a727d4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1933,7 +1933,8 @@ u16 bnx2x_select_queue(struct net_device *dev, struct 
sk_buff *skb,
}
 
/* select a non-FCoE queue */
-   return fallback(dev, skb) % (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
+   return fallback(dev, skb, NULL) %
+  (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
 }
 
 void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 5dc5e56..40cf8dc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -973,7 +973,7 @@ static u16 cxgb_select_queue(struct net_device *dev, struct 
sk_buff *skb,
return txq;
}
 
-   return fallback(dev, skb) % dev->real_num_tx_queues;
+   return fallback(dev, skb, NULL) % dev->real_num_tx_queues;
 }
 
 static int closest_timer(const struct sge *s, int time)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index ff7a74e..948b3e0 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2033,7 +2033,7 @@ static void hns_nic_get_stats64(struct net_device *ndev,
is_multicast_ether_addr(eth_hdr->h_dest))
return 0;
else
-   return fallback(ndev, skb);
+   return fallback(ndev, skb, NULL);
 }
 
 static const struct net_device_ops hns_nic_netdev_ops = {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a0cf33d..bdaecae 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8242,11 +8242,11 @@ stati

[jkirsher/next-queue PATCH v2 4/7] net: Add support for subordinate traffic classes to netdev_pick_tx

2018-07-09 Thread Alexander Duyck
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   19 +++-
 drivers/net/macvlan.c |   10 +---
 include/linux/netdevice.h |4 +-
 net/core/dev.c|   58 +++--
 4 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3ff34ca..41ef58f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8213,20 +8213,17 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
  input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
-   struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
-#endif
int txq;
 
-   if (fwd_adapter) {
-   u8 tc = netdev_get_num_tc(dev) ?
-   netdev_get_prio_tc_map(dev, skb->priority) : 0;
-   struct net_device *vdev = fwd_adapter->netdev;
+   if (accel_priv) {
+   u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+   struct net_device *vdev = accel_priv;
 
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
@@ -8235,8 +8232,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
return txq;
}
 
-#ifdef IXGBE_FCOE
-
/*
 * only execute the code below if protocol is FCoE
 * or FIP and we have FCoE enabled on the adapter
@@ -8262,11 +8257,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
txq -= f->indices;
 
return txq + f->offset;
-#else
-   return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
   struct xdp_frame *xdpf)
 {
@@ -10068,7 +10061,6 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
-   .ndo_select_queue   = ixgbe_select_queue,
.ndo_set_rx_mode= ixgbe_set_rx_mode,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_mac_address= ixgbe_set_mac,
@@ -10091,6 +10083,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_poll_controller= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+   .ndo_select_queue   = ixgbe_select_queue,
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
-   void *accel_priv = NULL;
 
if (vlan->mode == MACVLAN_MODE_BRIDGE) {
const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NET_XMIT_SUCCESS;
}
}
-
-   /* For packets that are non-multicast and not bridged we will pass
-* the necessary information so that the lowerdev can distinguish
-* the source of the packets via the accel_priv value.
-*/
-   accel_priv = vlan->accel_priv;
 xmit_world:
skb->dev = vlan->lowerdev;
-   return dev_queue_xmit_accel(skb, accel_priv);
+   return dev_queue_xmit_accel(skb,
+   netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev 

[jkirsher/next-queue PATCH v2 1/7] net-sysfs: Drop support for XPS and traffic_class on single queue device

2018-07-09 Thread Alexander Duyck
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck 
---
 net/core/net-sysfs.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index f25ac5f..dce3ae0 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue 
*queue,
  char *buf)
 {
struct net_device *dev = queue->dev;
-   int index = get_netdev_queue_index(queue);
-   int tc = netdev_txq_to_tc(dev, index);
+   int index;
+   int tc;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+   tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
index = get_netdev_queue_index(queue);
 
if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
cpumask_var_t mask;
int err;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 



[jkirsher/next-queue PATCH v2 2/7] net: Add support for subordinate device traffic classes

2018-07-09 Thread Alexander Duyck
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |   16 
 net/core/dev.c|   89 +
 net/core/net-sysfs.c  |   21 ++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b683971..4648a9a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -575,6 +575,9 @@ struct netdev_queue {
 * (/sys/class/net/DEV/Q/trans_timeout)
 */
unsigned long   trans_timeout;
+
+   /* Suboordinate device that the queue has been assigned to */
+   struct net_device   *sb_dev;
 /*
  * write-mostly part
  */
@@ -1991,7 +1994,7 @@ struct net_device {
 #ifdef CONFIG_DCB
const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-   u8  num_tc;
+   s16 num_tc;
struct netdev_tc_txqtc_to_txq[TC_MAX_QUEUE];
u8  prio_tc_map[TC_BITMASK + 1];
 
@@ -2045,6 +2048,17 @@ int netdev_get_num_tc(struct net_device *dev)
return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+struct net_device *sb_dev,
+u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+   return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 89825c1..cc1d6bb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2067,11 +2067,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
struct netdev_tc_txq *tc = >tc_to_txq[0];
int i;
 
+   /* walk through the TCs and see if it falls into any of them */
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
if ((txq - tc->offset) < tc->count)
return i;
}
 
+   /* didn't find it, just return -1 to indicate no match */
return -1;
}
 
@@ -2260,7 +2262,14 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
unsigned int nr_ids;
 
if (dev->num_tc) {
+   /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+   if (num_tc < 0)
+   return -EINVAL;
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -2448,11 +2457,25 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+   struct netdev_queue *txq = >_tx[dev->num_tx_queues];
+
+   /* Unbind any subordinate channels */
+   while (txq-- != >_tx[0]) {
+   if (txq->sb_dev)
+   netdev_unbind_sb_channel(dev, txq->sb_dev);
+   }
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
 #endif
+   netdev_unbind_all_sb_channels(dev);
+
+   /* Reset TC configuration of device */
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2481,11 +2504,77 @@ int netde

Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-25 Thread Alexander Duyck
On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert  wrote:
>
>
> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
>  wrote:
>>
>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>> 'int' to 'unsigned short' type with 0 indicating unset and
>> a positive queue value being set. This way it is consistent with
>> the queue_mapping field in the sk_buff. This will also accommodate
>> adding a new 'unsigned short' field in sock_common in the next
>> patch for rx_queue_mapping.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  include/net/sock.h |   10 ++
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index b3b7541..009fd30 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -214,7 +214,7 @@ struct sock_common {
>> struct hlist_node   skc_node;
>> struct hlist_nulls_node skc_nulls_node;
>> };
>> -   int skc_tx_queue_mapping;
>> +   unsigned short  skc_tx_queue_mapping;
>> union {
>> int skc_incoming_cpu;
>> u32 skc_rcv_wnd;
>> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk,
>> struct sk_buff *skb,
>>
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> -   sk->sk_tx_queue_mapping = tx_queue;
>> +   /* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +   WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>
>
> Shouldn't this be USHRT_MAX - 1 ?

Actually just a ">=" would probably do as well.

>
>> +   sk->sk_tx_queue_mapping = tx_queue + 1;
>>  }
>>
>>  static inline void sk_tx_queue_clear(struct sock *sk)
>>  {
>> -   sk->sk_tx_queue_mapping = -1;
>>
>> +   sk->sk_tx_queue_mapping = 0;
>
>
> I think it's slightly better to define a new constant like NO_QUEUE_MAPPING
> to be USHRT_MAX. That avoids needing to do the arithmetic every time the
> value is accessed.
>>
>>  }
>>
>>  static inline int sk_tx_queue_get(const struct sock *sk)
>>  {
>> -   return sk ? sk->sk_tx_queue_mapping : -1;
>> +   return sk ? sk->sk_tx_queue_mapping - 1 : -1;
>
>
> Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed
> for this?

This doesn't change the result. It was still -1 if the queue mapping
is not set. It was just initialized to 0 instead of to -1 so we have
to perform the operation to get there.

Also in regards to the comment above about needing an extra operation
I am not sure it makes much difference.

In the case of us starting with 0 as a reserved value I think the
instruction count should be about the same. We move the unsigned short
into an unsigned in, then decrement, and if the value is non-negative
we can assume it is valid. Although maybe I should double check the
code to make certain it is doing what I thought it was supposed to be
doing.

>
>>
>>
>>
>>  }
>>
>>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>>
>


Re: [PATCH v2 net-next] net/sched: add skbprio scheduler

2018-06-23 Thread Alexander Duyck
On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan  wrote:
> net/sched: add skbprio scheduler
>
> Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
> according to their skb->priority field. Although Skbprio can be employed in 
> any
> scenario in which a higher skb->priority field means a higher priority packet,
> Skbprio was concieved as a solution for denial-of-service defenses that need 
> to
> route packets with different priorities.

Really this description is not very good. Reading it I was thinking to
myself "why do we need this, prio already does this". It wasn't until
I read through the code that I figured out that you are basically
adding dropping of lower priority frames.

>
> v2
> *Use skb->priority field rather than DS field. Rename queueing discipline as
> SKB Priority Queue (previously Gatekeeper Priority Queue).
>
> *Queueing discipline is made classful to expose Skbprio's internal priority
> queues.
>
> Signed-off-by: Nishanth Devarajan 
> Reviewed-by: Sachin Paryani 
> Reviewed-by: Cody Doucette 
> Reviewed-by: Michel Machado 
> ---
>  include/uapi/linux/pkt_sched.h |  15 ++
>  net/sched/Kconfig  |  13 ++
>  net/sched/Makefile |   1 +
>  net/sched/sch_skbprio.c| 347 
> +
>  4 files changed, 376 insertions(+)
>  create mode 100644 net/sched/sch_skbprio.c
>
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 37b5096..6fd07e8 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -124,6 +124,21 @@ struct tc_fifo_qopt {
> __u32   limit;  /* Queue length: bytes for bfifo, packets for pfifo */
>  };
>
> +/* SKBPRIO section */
> +
> +/*
> + * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
> + * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be able
> + * to map one to one the DS field of IPV4 and IPV6 headers.
> + * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
> + */
> +
> +#define SKBPRIO_MAX_PRIORITY 64
> +
> +struct tc_skbprio_qopt {
> +   __u32   limit;  /* Queue length in packets. */
> +};
> +
>  /* PRIO section */
>
>  #define TCQ_PRIO_BANDS 16
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index a01169f..9ac4b53 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -240,6 +240,19 @@ config NET_SCH_MQPRIO
>
>   If unsure, say N.
>
> +config NET_SCH_SKBPRIO
> +   tristate "SKB priority queue scheduler (SKBPRIO)"
> +   help
> + Say Y here if you want to use the SKB priority queue
> + scheduler. This schedules packets according to skb->priority,
> + which is useful for request packets in DoS mitigation systems such
> + as Gatekeeper.
> +
> + To compile this driver as a module, choose M here: the module will
> + be called sch_skbprio.
> +
> + If unsure, say N.
> +
>  config NET_SCH_CHOKE
> tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
> help
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 8811d38..a4d8893 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM)   += sch_netem.o
>  obj-$(CONFIG_NET_SCH_DRR)  += sch_drr.o
>  obj-$(CONFIG_NET_SCH_PLUG) += sch_plug.o
>  obj-$(CONFIG_NET_SCH_MQPRIO)   += sch_mqprio.o
> +obj-$(CONFIG_NET_SCH_SKBPRIO)  += sch_skbprio.o
>  obj-$(CONFIG_NET_SCH_CHOKE)+= sch_choke.o
>  obj-$(CONFIG_NET_SCH_QFQ)  += sch_qfq.o
>  obj-$(CONFIG_NET_SCH_CODEL)+= sch_codel.o
> diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
> new file mode 100644
> index 000..5e89446
> --- /dev/null
> +++ b/net/sched/sch_skbprio.c
> @@ -0,0 +1,347 @@
> +/*
> + * net/sched/sch_skbprio.c  SKB Priority Queue.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + *
> + * Authors:Nishanth Devarajan, 
> + * Cody Doucette, 
> + * original idea by Michel Machado, Cody Doucette, and Qiaobin Fu
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +
> +/*   SKB Priority Queue
> + * =
> + *
> + * This qdisc schedules a packet according to skb->priority, where a higher
> + * value places the packet closer to the exit of the queue. When the queue is
> + * full, the lowest priority packet in the queue is dropped to make room for
> + * the packet to be added if it has higher priority. If the packet to be 
> added
> + * has lower priority than all packets in the queue, it is dropped.
> + *
> + * Without the SKB priority queue, queue 

Re: [Intel-wired-lan] [PATCH net/jkirsher] bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize

2018-06-13 Thread Alexander Duyck
On Wed, Jun 13, 2018 at 9:26 AM, John Fastabend
 wrote:
> On 06/13/2018 02:04 AM, Daniel Borkmann wrote:
>> Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
>> is clearly wrong since I40E_SKB_PAD already points to the offset where
>> the original xdp->data was sitting since xdp->data_hard_start is defined
>> as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
>> when build skb is used.
>>
>> However, also before cc5b114dcf98 ("bpf, i40e: add meta data support")
>> this seems broken since bpf_xdp_adjust_head() helper could have been used
>> to alter headroom and enlarge / shrink the frame and with that the assumption
>> that the xdp->data remains unchanged does not hold and would push a bogus
>> packet to upper stack.
>>
>> ixgbe got this right in 924708081629 ("ixgbe: add XDP support for pass and
>> drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
>> skb_reserve() and truesize calculation.
>>
>> Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
>> Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
>> Reported-by: Keith Busch 
>> Reported-by: Toshiaki Makita 
>> Signed-off-by: Daniel Borkmann 
>> Cc: Björn Töpel 
>> Cc: John Fastabend 
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 7 +++
>>  1 file changed, 3 insertions(+), 4 deletions(-)
>>
>
> Thanks! I missed this during review.
>
> Acked-by: John Fastabend 

Looks good to me. Thanks for taking care of this.

Acked-by: Alexander Duyck 


Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-06-12 Thread Alexander Duyck
On Tue, Jun 12, 2018 at 10:50 AM, Stephen Hemminger
 wrote:
> On Tue, 12 Jun 2018 11:18:25 -0400
> Alexander Duyck  wrote:
>
>> This patch series is meant to allow support for the L2 forward offload, aka
>> MACVLAN offload without the need for using ndo_select_queue.
>>
>> The existing solution currently requires that we use ndo_select_queue in
>> the transmit path if we want to associate specific Tx queues with a given
>> MACVLAN interface. In order to get away from this we need to repurpose the
>> tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
>> a means of accessing the queues on the lower device. As a result we cannot
>> offload a device that is configured as multiqueue, however it doesn't
>> really make sense to configure a macvlan interfaced as being multiqueue
>> anyway since it doesn't really have a qdisc of its own in the first place.
>>
>> I am submitting this as an RFC for the netdev mailing list, and officially
>> submitting it for testing to Jeff Kirsher's next-queue in order to validate
>> the ixgbe specific bits.
>>
>> The big changes in this set are:
>>   Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
>>   Disable XPS for single queue devices
>>   Replace accel_priv with sb_dev in ndo_select_queue
>>   Add sb_dev parameter to fallback function for ndo_select_queue
>>   Consolidated ndo_select_queue functions that appeared to be duplicates
>>
>> v2: Implement generic "select_queue" functions instead of "fallback" 
>> functions.
>> Tweak last two patches to account for changes in dev_pick_tx_xxx 
>> functions.
>>
>> ---
>>
>> Alexander Duyck (7):
>>   net-sysfs: Drop support for XPS and traffic_class on single queue 
>> device
>>   net: Add support for subordinate device traffic classes
>>   ixgbe: Add code to populate and use macvlan tc to Tx queue map
>>   net: Add support for subordinate traffic classes to netdev_pick_tx
>>   net: Add generic ndo_select_queue functions
>>   net: allow ndo_select_queue to pass netdev
>>   net: allow fallback function to pass netdev
>>
>>
>>  drivers/infiniband/hw/hfi1/vnic_main.c|2
>>  drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 -
>>  drivers/net/bonding/bond_main.c   |3
>>  drivers/net/ethernet/amazon/ena/ena_netdev.c  |5 -
>>  drivers/net/ethernet/broadcom/bcmsysport.c|6 -
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |6 +
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3
>>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |5 -
>>  drivers/net/ethernet/hisilicon/hns/hns_enet.c |5 -
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   62 ++--
>>  drivers/net/ethernet/lantiq_etop.c|   10 -
>>  drivers/net/ethernet/mellanox/mlx4/en_tx.c|7 +
>>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3
>>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |3
>>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |5 -
>>  drivers/net/ethernet/renesas/ravb_main.c  |3
>>  drivers/net/ethernet/sun/ldmvsw.c |3
>>  drivers/net/ethernet/sun/sunvnet.c|3
>>  drivers/net/ethernet/ti/netcp_core.c  |9 -
>>  drivers/net/hyperv/netvsc_drv.c   |6 -
>>  drivers/net/macvlan.c |   10 -
>>  drivers/net/net_failover.c|7 +
>>  drivers/net/team/team.c   |3
>>  drivers/net/tun.c |3
>>  drivers/net/wireless/marvell/mwifiex/main.c   |3
>>  drivers/net/xen-netback/interface.c   |4 -
>>  drivers/net/xen-netfront.c|3
>>  drivers/staging/netlogic/xlr_net.c|9 -
>>  drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3
>>  drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 -
>>  include/linux/netdevice.h |   34 -
>>  net/core/dev.c|  156 
>> ++---
>>  net/core/net-sysfs.c  |   36 -
>>  net/mac80211/iface.c  |4 -
>>  net/packet/af_packet.c|7 +
>>  35 files changed, 312 insertions(+), 130 deletions(-)
>>
>> --
>
> This makes sense. I thought you were hoping to get rid of select queue in 
> fut

Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-06-12 Thread Alexander Duyck
On Tue, Jun 12, 2018 at 10:56 AM, Florian Fainelli  wrote:
> On 06/12/2018 08:18 AM, Alexander Duyck wrote:
>> This patch series is meant to allow support for the L2 forward offload, aka
>> MACVLAN offload without the need for using ndo_select_queue.
>>
>> The existing solution currently requires that we use ndo_select_queue in
>> the transmit path if we want to associate specific Tx queues with a given
>> MACVLAN interface. In order to get away from this we need to repurpose the
>> tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
>> a means of accessing the queues on the lower device. As a result we cannot
>> offload a device that is configured as multiqueue, however it doesn't
>> really make sense to configure a macvlan interfaced as being multiqueue
>> anyway since it doesn't really have a qdisc of its own in the first place.
>
> Interesting, so at some point I had came up with the following for
> mapping queues between the DSA slave network devices and the DSA master
> network device (doing the actual transmission). The DSA master network
> device driver is just a normal network device driver.
>
> The set-up is as follows: 4 external Ethernet switch ports, each with 8
> egress queues and the DSA master (bcmsysport.c), aka CPU Ethernet
> controller has 32 output queues, so you can do a 1:1 mapping of those,
> that's actually what we want. A subsequent hardware generation only
> provides 16 output queues, so we can still do 2:1 mapping.
>
> The implementation is done like this:
>
> - DSA slave network devices are always created after the DSA master
> network device so we can leverage that
>
> - a specific notifier is running from the DSA core and tells the DSA
> master about the switch position in the tree (position 0 = directly
> attached), and the switch port number and a pointer to the slave network
> device
>
> - we establish the mapping between the queues within the bcmsysport
> driver as a simple array
>
> - when transmitting, DSA slave network devices set a specific queue/port
> number within the 16-bits that skb->queue_mapping permits
>
> - this gets re-used by bcmsysport.c to extract the correct queue number
> during ndo_select_queue such that the appropriate queue number gets used
> and congestion works end-to-end.
>
> The reason why we do that is because there is some out of band HW that
> monitors the queue depth of the switch port's egress queue and
> back-pressure the Ethernet controller directly when trying to transmit
> to a congested queue.
>
> I had initially considered establishing the mapping using tc and some
> custom "bind" argument of some kind, but ended-up doing things the way
> they are which are more automatic though they leave less configuration
> to an user. This has a number of caveats though:
>
> - this is made generic within the context of DSA in that nothing is
> switch driver or Ethernet MAC driver specific and the notifier
> represents the contract between these two seemingly independent subsystems
>
> - the queue indicated between DSA slave and master is unfortunately
> switch driver/controller specific (BRCM_TAG_SET_PORT_QUEUE,
> BRCM_TAG_GET_PORT, BRCM_TAG_GET_QUEUE)
>
> What I like about your patchset is the mapping establishment, but as you
> will read from my reply in patch 2, I think the (upper) 1:N (lower)
> mapping might not work for my specific use case.
>
> Anyhow, not intended to be blocking this, as it seems to be going in the
> right direction anyway.

I think I am still not getting why the 1:N would be an issue. At least
the way I have the code implemented here the lower queues all have a
qdisc associated with them, just not the upper device. Generally I am
using the macvlan as a bump in the wire to take care of filtering for
the bridging mode. If I have to hairpin packets and send them back up
on on of the the upper interfaces I want to do that in software rather
than hardware so I try to take care of it there instead of routing it
through the hardware.

>>
>> I am submitting this as an RFC for the netdev mailing list, and officially
>> submitting it for testing to Jeff Kirsher's next-queue in order to validate
>> the ixgbe specific bits.
>>
>> The big changes in this set are:
>>   Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
>>   Disable XPS for single queue devices
>>   Replace accel_priv with sb_dev in ndo_select_queue
>>   Add sb_dev parameter to fallback function for ndo_select_queue
>>   Consolidated ndo_select_queue functions that appeared to be duplicates
>
> Interesting, turns out I had a possibly similar use case with DSA with
> the slave network devices need to select a

Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v2 2/7] net: Add support for subordinate device traffic classes

2018-06-12 Thread Alexander Duyck
On Tue, Jun 12, 2018 at 10:49 AM, Florian Fainelli  wrote:
> On 06/12/2018 08:18 AM, Alexander Duyck wrote:
>> This patch is meant to provide the basic tools needed to allow us to create
>> subordinate device traffic classes. The general idea here is to allow
>> subdividing the queues of a device into queue groups accessible through an
>> upper device such as a macvlan.
>>
>> The idea here is to enforce the idea that an upper device has to be a
>> single queue device, ideally with IFF_NO_QUQUE set. With that being the
>> case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
>> for the upper device are unused. As such we could reuse those in order to
>> support subdividing the lower device and distributing those queues between
>> the subordinate devices.
>
> This is not necessarily a valid paradigm to work with. For instance in
> DSA we have IFF_NO_QUEUE devices, but we still expose multiple egress
> queues because that is how an application can choose how it wants to get
> packets transmitted at the switch level. We have a 1:1 representation
> between a queue at the net_device level, and what an egress queue at the
> switch level is, so things like buffer reservation etc. can be configured.

I'm not saying that IFF_NO_QUEUE implies that a device is single
queue, but in this case we enforce that the upper device has to be a
single queue device so that the code in netdev_pick_tx will ignore the
XPS and tc_to_txq mappings for that netdev. I had mentioned
IFF_NO_QUEUE as a suggestion as that allows us to avoid head-of-line
blocking if the lower device starts to apply back-pressure.

> I think you should consider that an upper device might want to have a
> 1:1 mapping to the lower device's queues and make that permissible.
> Thoughts?

I had considered that. However the issue becomes that at that point it
makes the setup much more rigid. With this approach I can enable and
disable the offload without needing to stop the upper device to either
create or remove qdiscs. I would much rather keep the upper device
generic and leave it to the lower device to populate the rings and
such.


[jkirsher/next-queue PATCH v2 3/7] ixgbe: Add code to populate and use macvlan tc to Tx queue map

2018-06-12 Thread Alexander Duyck
This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fc23e36..6e27848 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring 
*rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 struct ixgbe_fwd_adapter *accel)
 {
+   u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+   int num_tc = netdev_get_num_tc(adapter->netdev);
struct net_device *vdev = accel->netdev;
int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
accel->rx_base_queue = baseq;
accel->tx_base_queue = baseq;
 
+   /* record configuration for macvlan interface in vdev */
+   for (i = 0; i < num_tc; i++)
+   netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+i, rss_i, baseq + (rss_i * i));
+
for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
 
netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
 
@@ -8212,18 +8223,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-   struct ixgbe_adapter *adapter;
-   int txq;
 #ifdef IXGBE_FCOE
+   struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
 #endif
+   int txq;
 
if (fwd_adapter) {
-   adapter = netdev_priv(dev);
-   txq = reciprocal_scale(skb_get_hash(skb),
-  adapter->num_rx_queues_per_pool);
+   u8 tc = netdev_get_num_tc(dev) ?
+   netdev_get_prio_tc_map(dev, skb->priority) : 0;
+   struct net_device *vdev = fwd_adapter->netdev;
+
+   txq = vdev->tc_to_txq[tc].offset;
+   txq += reciprocal_scale(skb_get_hash(skb),
+   vdev->tc_to_txq[tc].count);
 
-   return txq + fwd_adapter->tx_base_queue;
+   return txq;
}
 
 #ifdef IXGBE_FCOE
@@ -8777,6 +8792,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device 
*vdev, void *data)
/* if we cannot find a free pool then disable the offload */
netdev_err(vdev, "L2FW offload disabled due to lack of queue 
resources\n");
macvlan_release_l2fw_offload(vdev);
+
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
kfree(accel);
 
return 0;
@@ -9785,6 +9805,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
if (!macvlan_supports_dest_filter(vdev))
return ERR_PTR(-EMEDIUMTYPE);
 
+   /* We need to lock down the macvlan to be a single queue device so that
+* we can reuse the tc_to_txq field in the macvlan netdev to represent
+* the queue mapping to our netdev.
+*/
+   if (netif_is_multiqueue(vdev))
+   return ERR_PTR(-ERANGE);
+
pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
if (pool == adapter->num_rx_pools) {
u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9841,6 +9868,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
return ERR_PTR(-ENOMEM);
 
set_bit(pool, adapter->fwd_bitmask);
+   netdev_set_sb_channel(vdev, pool);
accel->pool = pool;
accel->netdev = vdev;
 
@@ -9882,6 +9910,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void 
*priv)
   

[jkirsher/next-queue PATCH v2 5/7] net: Add generic ndo_select_queue functions

2018-06-12 Thread Alexander Duyck
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/lantiq_etop.c   |   10 +-
 drivers/net/ethernet/ti/netcp_core.c |9 +
 drivers/staging/netlogic/xlr_net.c   |9 +
 include/linux/netdevice.h|4 
 net/core/dev.c   |   14 ++
 net/packet/af_packet.c   |2 +-
 6 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c 
b/drivers/net/ethernet/lantiq_etop.c
index afc8100..7a637b5 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -563,14 +563,6 @@ struct ltq_etop_priv {
spin_unlock_irqrestore(>lock, flags);
 }
 
-static u16
-ltq_etop_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
-{
-   /* we are currently only using the first queue */
-   return 0;
-}
-
 static int
 ltq_etop_init(struct net_device *dev)
 {
@@ -641,7 +633,7 @@ struct ltq_etop_priv {
.ndo_set_mac_address = ltq_etop_set_mac_address,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_rx_mode = ltq_etop_set_multicast_list,
-   .ndo_select_queue = ltq_etop_select_queue,
+   .ndo_select_queue = dev_pick_tx_zero,
.ndo_init = ltq_etop_init,
.ndo_tx_timeout = ltq_etop_tx_timeout,
 };
diff --git a/drivers/net/ethernet/ti/netcp_core.c 
b/drivers/net/ethernet/ti/netcp_core.c
index e40aa3e..2c455bd 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1889,13 +1889,6 @@ static int netcp_rx_kill_vid(struct net_device *ndev, 
__be16 proto, u16 vid)
return err;
 }
 
-static u16 netcp_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
- select_queue_fallback_t fallback)
-{
-   return 0;
-}
-
 static int netcp_setup_tc(struct net_device *dev, enum tc_setup_type type,
  void *type_data)
 {
@@ -1972,7 +1965,7 @@ static int netcp_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
.ndo_vlan_rx_add_vid= netcp_rx_add_vid,
.ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
.ndo_tx_timeout = netcp_ndo_tx_timeout,
-   .ndo_select_queue   = netcp_select_queue,
+   .ndo_select_queue   = dev_pick_tx_zero,
.ndo_setup_tc   = netcp_setup_tc,
 };
 
diff --git a/drivers/staging/netlogic/xlr_net.c 
b/drivers/staging/netlogic/xlr_net.c
index e461168..4e6611e 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -290,13 +290,6 @@ static netdev_tx_t xlr_net_start_xmit(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
-static u16 xlr_net_select_queue(struct net_device *ndev, struct sk_buff *skb,
-   void *accel_priv,
-   select_queue_fallback_t fallback)
-{
-   return (u16)smp_processor_id();
-}
-
 static void xlr_hw_set_mac_addr(struct net_device *ndev)
 {
struct xlr_net_priv *priv = netdev_priv(ndev);
@@ -403,7 +396,7 @@ static void xlr_stats(struct net_device *ndev, struct 
rtnl_link_stats64 *stats)
.ndo_open = xlr_net_open,
.ndo_stop = xlr_net_stop,
.ndo_start_xmit = xlr_net_start_xmit,
-   .ndo_select_queue = xlr_net_select_queue,
+   .ndo_select_queue = dev_pick_tx_cpu_id,
.ndo_set_mac_address = xlr_net_set_mac_addr,
.ndo_set_rx_mode = xlr_set_rx_mode,
.ndo_get_stats64 = xlr_stats,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91b3ca9..70f7ee3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2551,6 +2551,10 @@ struct net_device *__dev_get_by_flags(struct net *net, 
unsigned short flags,
 void dev_close_many(struct list_head *head, bool unlink);
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff 
*newskb);
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
+void *accel_priv, select_queue_fallback_t fallback);
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
+  void *accel_priv, select_queue_fallback_t fallback);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2249294..1a1cf2c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3505,6 +3505,20 @@ static inline int get_xps_queue(struct net_device *dev,
 #en

[jkirsher/next-queue PATCH v2 6/7] net: allow ndo_select_queue to pass netdev

2018-06-12 Thread Alexander Duyck
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck 
---
 drivers/infiniband/hw/hfi1/vnic_main.c|2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 ++--
 drivers/net/bonding/bond_main.c   |3 ++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |3 ++-
 drivers/net/ethernet/broadcom/bcmsysport.c|2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |3 ++-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 ---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|3 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |3 ++-
 drivers/net/ethernet/renesas/ravb_main.c  |3 ++-
 drivers/net/ethernet/sun/ldmvsw.c |3 ++-
 drivers/net/ethernet/sun/sunvnet.c|3 ++-
 drivers/net/hyperv/netvsc_drv.c   |4 ++--
 drivers/net/net_failover.c|5 +++--
 drivers/net/team/team.c   |3 ++-
 drivers/net/tun.c |3 ++-
 drivers/net/wireless/marvell/mwifiex/main.c   |3 ++-
 drivers/net/xen-netback/interface.c   |2 +-
 drivers/net/xen-netfront.c|3 ++-
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 ++-
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 +++
 include/linux/netdevice.h |   11 +++
 net/core/dev.c|6 --
 net/mac80211/iface.c  |4 ++--
 29 files changed, 66 insertions(+), 42 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 5d65582..616fc9b 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -423,7 +423,7 @@ static netdev_tx_t hfi1_netdev_start_xmit(struct sk_buff 
*skb,
 
 static u16 hfi1_vnic_select_queue(struct net_device *netdev,
  struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
  select_queue_fallback_t fallback)
 {
struct hfi1_vnic_vport_info *vinfo = opa_vnic_dev_priv(netdev);
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
index 0c8aec6..6155878 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
@@ -95,7 +95,7 @@ static netdev_tx_t opa_netdev_start_xmit(struct sk_buff *skb,
 }
 
 static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff 
*skb,
-void *accel_priv,
+struct net_device *sb_dev,
 select_queue_fallback_t fallback)
 {
struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
@@ -107,7 +107,7 @@ static u16 opa_vnic_select_queue(struct net_device *netdev, 
struct sk_buff *skb,
mdata->entropy = opa_vnic_calc_entropy(skb);
mdata->vl = opa_vnic_get_vl(adapter, skb);
rc = adapter->rn_ops->ndo_select_queue(netdev, skb,
-  accel_priv, fallback);
+  sb_dev, fallback);
skb_pull(skb, sizeof(*mdata));
return rc;
 }
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bd53a71..e33f689 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4094,7 +4094,8 @@ static inline int bond_slave_override(struct bonding 
*bond,
 
 
 static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb,
-void *accel_priv, select_queue_fallback_t fallback)
+struct net_device *sb_dev,
+select_queue_fallback_t fallback)
 {
/* This helper function exists to help dev_pick_tx get the correct
 * destination queue.  Using a helper function skips a call to
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index f2af87d..e3befb1 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/

[jkirsher/next-queue PATCH v2 7/7] net: allow fallback function to pass netdev

2018-06-12 Thread Alexander Duyck
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.

The only driver that has any signficant change in this patchset should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c|2 +-
 drivers/net/ethernet/broadcom/bcmsysport.c  |4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c   |2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c |2 +-
 drivers/net/hyperv/netvsc_drv.c |2 +-
 drivers/net/net_failover.c  |2 +-
 drivers/net/xen-netback/interface.c |2 +-
 include/linux/netdevice.h   |3 ++-
 net/core/dev.c  |   12 +++-
 net/packet/af_packet.c  |7 ---
 14 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e3befb1..c673ac2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2224,7 +2224,7 @@ static u16 ena_select_queue(struct net_device *dev, 
struct sk_buff *skb,
if (skb_rx_queue_recorded(skb))
qid = skb_get_rx_queue(skb);
else
-   qid = fallback(dev, skb);
+   qid = fallback(dev, skb, NULL);
 
return qid;
 }
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 32f548e..eb890c4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2116,7 +2116,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
unsigned int q, port;
 
if (!netdev_uses_dsa(dev))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
/* DSA tagging layer will have configured the correct queue */
q = BRCM_TAG_GET_QUEUE(queue);
@@ -2124,7 +2124,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
tx_ring = priv->ring_map[q + port * priv->per_port_num_tx_queues];
 
if (unlikely(!tx_ring))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
return tx_ring->index;
 }
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 969dcc9..7a1b99f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1928,7 +1928,8 @@ u16 bnx2x_select_queue(struct net_device *dev, struct 
sk_buff *skb,
}
 
/* select a non-FCoE queue */
-   return fallback(dev, skb) % (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
+   return fallback(dev, skb, NULL) %
+  (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
 }
 
 void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 8de3039..380931d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -972,7 +972,7 @@ static u16 cxgb_select_queue(struct net_device *dev, struct 
sk_buff *skb,
return txq;
}
 
-   return fallback(dev, skb) % dev->real_num_tx_queues;
+   return fallback(dev, skb, NULL) % dev->real_num_tx_queues;
 }
 
 static int closest_timer(const struct sge *s, int time)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index c36a231..8327254 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2033,7 +2033,7 @@ static void hns_nic_get_stats64(struct net_device *ndev,
is_multicast_ether_addr(eth_hdr->h_dest))
return 0;
else
-   return fallback(ndev, skb);
+   return fallback(ndev, skb, NULL);
 }
 
 static const struct net_device_ops hns_nic_netdev_ops = {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5d9867e..eef64d0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8248,11 +8248,11 @@ stati

[jkirsher/next-queue PATCH v2 4/7] net: Add support for subordinate traffic classes to netdev_pick_tx

2018-06-12 Thread Alexander Duyck
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   19 +++-
 drivers/net/macvlan.c |   10 +---
 include/linux/netdevice.h |4 +-
 net/core/dev.c|   57 +++--
 4 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6e27848..053a54c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8219,20 +8219,17 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
  input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
-   struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
-#endif
int txq;
 
-   if (fwd_adapter) {
-   u8 tc = netdev_get_num_tc(dev) ?
-   netdev_get_prio_tc_map(dev, skb->priority) : 0;
-   struct net_device *vdev = fwd_adapter->netdev;
+   if (accel_priv) {
+   u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+   struct net_device *vdev = accel_priv;
 
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
@@ -8241,8 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
return txq;
}
 
-#ifdef IXGBE_FCOE
-
/*
 * only execute the code below if protocol is FCoE
 * or FIP and we have FCoE enabled on the adapter
@@ -8268,11 +8263,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
txq -= f->indices;
 
return txq + f->offset;
-#else
-   return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
   struct xdp_frame *xdpf)
 {
@@ -10076,7 +10069,6 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
-   .ndo_select_queue   = ixgbe_select_queue,
.ndo_set_rx_mode= ixgbe_set_rx_mode,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_mac_address= ixgbe_set_mac,
@@ -10099,6 +10091,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_poll_controller= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+   .ndo_select_queue   = ixgbe_select_queue,
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
-   void *accel_priv = NULL;
 
if (vlan->mode == MACVLAN_MODE_BRIDGE) {
const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NET_XMIT_SUCCESS;
}
}
-
-   /* For packets that are non-multicast and not bridged we will pass
-* the necessary information so that the lowerdev can distinguish
-* the source of the packets via the accel_priv value.
-*/
-   accel_priv = vlan->accel_priv;
 xmit_world:
skb->dev = vlan->lowerdev;
-   return dev_queue_xmit_accel(skb, accel_priv);
+   return dev_queue_xmit_accel(skb,
+   netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev 

[jkirsher/next-queue PATCH v2 1/7] net-sysfs: Drop support for XPS and traffic_class on single queue device

2018-06-12 Thread Alexander Duyck
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck 
---
 net/core/net-sysfs.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue 
*queue,
  char *buf)
 {
struct net_device *dev = queue->dev;
-   int index = get_netdev_queue_index(queue);
-   int tc = netdev_txq_to_tc(dev, index);
+   int index;
+   int tc;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+   tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
index = get_netdev_queue_index(queue);
 
if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
cpumask_var_t mask;
int err;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 



[jkirsher/next-queue PATCH v2 2/7] net: Add support for subordinate device traffic classes

2018-06-12 Thread Alexander Duyck
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |   16 
 net/core/dev.c|   89 +
 net/core/net-sysfs.c  |   21 ++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..41b4660 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 * (/sys/class/net/DEV/Q/trans_timeout)
 */
unsigned long   trans_timeout;
+
+   /* Suboordinate device that the queue has been assigned to */
+   struct net_device   *sb_dev;
 /*
  * write-mostly part
  */
@@ -1978,7 +1981,7 @@ struct net_device {
 #ifdef CONFIG_DCB
const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-   u8  num_tc;
+   s16 num_tc;
struct netdev_tc_txqtc_to_txq[TC_MAX_QUEUE];
u8  prio_tc_map[TC_BITMASK + 1];
 
@@ -2032,6 +2035,17 @@ int netdev_get_num_tc(struct net_device *dev)
return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+struct net_device *sb_dev,
+u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+   return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242..27fe4f2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
struct netdev_tc_txq *tc = >tc_to_txq[0];
int i;
 
+   /* walk through the TCs and see if it falls into any of them */
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
if ((txq - tc->offset) < tc->count)
return i;
}
 
+   /* didn't find it, just return -1 to indicate no match */
return -1;
}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
bool active = false;
 
if (dev->num_tc) {
+   /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+   if (num_tc < 0)
+   return -EINVAL;
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+   struct netdev_queue *txq = >_tx[dev->num_tx_queues];
+
+   /* Unbind any subordinate channels */
+   while (txq-- != >_tx[0]) {
+   if (txq->sb_dev)
+   netdev_unbind_sb_channel(dev, txq->sb_dev);
+   }
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
 #endif
+   netdev_unbind_all_sb_channels(dev);
+
+   /* Reset TC configuration of device */
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netde

[jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-06-12 Thread Alexander Duyck
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.

The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.

I am submitting this as an RFC for the netdev mailing list, and officially
submitting it for testing to Jeff Kirsher's next-queue in order to validate
the ixgbe specific bits.

The big changes in this set are:
  Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
  Disable XPS for single queue devices
  Replace accel_priv with sb_dev in ndo_select_queue
  Add sb_dev parameter to fallback function for ndo_select_queue
  Consolidated ndo_select_queue functions that appeared to be duplicates

v2: Implement generic "select_queue" functions instead of "fallback" functions.
Tweak last two patches to account for changes in dev_pick_tx_xxx functions.

---

Alexander Duyck (7):
  net-sysfs: Drop support for XPS and traffic_class on single queue device
  net: Add support for subordinate device traffic classes
  ixgbe: Add code to populate and use macvlan tc to Tx queue map
  net: Add support for subordinate traffic classes to netdev_pick_tx
  net: Add generic ndo_select_queue functions
  net: allow ndo_select_queue to pass netdev
  net: allow fallback function to pass netdev


 drivers/infiniband/hw/hfi1/vnic_main.c|2 
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 -
 drivers/net/bonding/bond_main.c   |3 
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |5 -
 drivers/net/ethernet/broadcom/bcmsysport.c|6 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |6 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |5 -
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |5 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   62 ++--
 drivers/net/ethernet/lantiq_etop.c|   10 -
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|7 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |5 -
 drivers/net/ethernet/renesas/ravb_main.c  |3 
 drivers/net/ethernet/sun/ldmvsw.c |3 
 drivers/net/ethernet/sun/sunvnet.c|3 
 drivers/net/ethernet/ti/netcp_core.c  |9 -
 drivers/net/hyperv/netvsc_drv.c   |6 -
 drivers/net/macvlan.c |   10 -
 drivers/net/net_failover.c|7 +
 drivers/net/team/team.c   |3 
 drivers/net/tun.c |3 
 drivers/net/wireless/marvell/mwifiex/main.c   |3 
 drivers/net/xen-netback/interface.c   |4 -
 drivers/net/xen-netfront.c|3 
 drivers/staging/netlogic/xlr_net.c|9 -
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 -
 include/linux/netdevice.h |   34 -
 net/core/dev.c|  156 ++---
 net/core/net-sysfs.c  |   36 -
 net/mac80211/iface.c  |4 -
 net/packet/af_packet.c|7 +
 35 files changed, 312 insertions(+), 130 deletions(-)

--


Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions

2018-06-11 Thread Alexander Duyck
Jeff,

Looks like I have some bugs on non-x86 architecture. I need to address
these in a v2 of this set so please hold off on applying until I can
get that submitted either tonight or tomorrow morning.

Thanks.

- Alex

On Mon, Jun 11, 2018 at 4:10 PM, kbuild test robot  wrote:
> Hi Alexander,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on net-next/master]
> [also build test ERROR on next-20180608]
> [cannot apply to v4.17]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-support-for-L2-Fwd-Offload-w-o-ndo_select_queue/20180612-015220
> config: arm-allmodconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=arm
>
> All errors (new ones prefixed by >>):
>
>>> drivers/net//ethernet/ti/netcp_core.c:1968:22: error: initialization from 
>>> incompatible pointer type [-Werror=incompatible-pointer-types]
>  .ndo_select_queue = dev_pick_tx_zero,
>  ^~~~
>drivers/net//ethernet/ti/netcp_core.c:1968:22: note: (near initialization 
> for 'netcp_netdev_ops.ndo_select_queue')
>cc1: some warnings being treated as errors
>
> vim +1968 drivers/net//ethernet/ti/netcp_core.c
>
>   1955
>   1956  static const struct net_device_ops netcp_netdev_ops = {
>   1957  .ndo_open   = netcp_ndo_open,
>   1958  .ndo_stop   = netcp_ndo_stop,
>   1959  .ndo_start_xmit = netcp_ndo_start_xmit,
>   1960  .ndo_set_rx_mode= netcp_set_rx_mode,
>   1961  .ndo_do_ioctl   = netcp_ndo_ioctl,
>   1962  .ndo_get_stats64= netcp_get_stats,
>   1963  .ndo_set_mac_address= eth_mac_addr,
>   1964  .ndo_validate_addr  = eth_validate_addr,
>   1965  .ndo_vlan_rx_add_vid= netcp_rx_add_vid,
>   1966  .ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
>   1967  .ndo_tx_timeout = netcp_ndo_tx_timeout,
>> 1968  .ndo_select_queue   = dev_pick_tx_zero,
>   1969  .ndo_setup_tc   = netcp_setup_tc,
>   1970  };
>   1971
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[jkirsher/next-queue PATCH 2/7] net: Add support for subordinate device traffic classes

2018-06-11 Thread Alexander Duyck
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |   16 
 net/core/dev.c|   89 +
 net/core/net-sysfs.c  |   21 ++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..41b4660 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 * (/sys/class/net/DEV/Q/trans_timeout)
 */
unsigned long   trans_timeout;
+
+   /* Suboordinate device that the queue has been assigned to */
+   struct net_device   *sb_dev;
 /*
  * write-mostly part
  */
@@ -1978,7 +1981,7 @@ struct net_device {
 #ifdef CONFIG_DCB
const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-   u8  num_tc;
+   s16 num_tc;
struct netdev_tc_txqtc_to_txq[TC_MAX_QUEUE];
u8  prio_tc_map[TC_BITMASK + 1];
 
@@ -2032,6 +2035,17 @@ int netdev_get_num_tc(struct net_device *dev)
return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+struct net_device *sb_dev,
+u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+   return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242..27fe4f2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
struct netdev_tc_txq *tc = >tc_to_txq[0];
int i;
 
+   /* walk through the TCs and see if it falls into any of them */
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
if ((txq - tc->offset) < tc->count)
return i;
}
 
+   /* didn't find it, just return -1 to indicate no match */
return -1;
}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
bool active = false;
 
if (dev->num_tc) {
+   /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+   if (num_tc < 0)
+   return -EINVAL;
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+   struct netdev_queue *txq = >_tx[dev->num_tx_queues];
+
+   /* Unbind any subordinate channels */
+   while (txq-- != >_tx[0]) {
+   if (txq->sb_dev)
+   netdev_unbind_sb_channel(dev, txq->sb_dev);
+   }
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
 #endif
+   netdev_unbind_all_sb_channels(dev);
+
+   /* Reset TC configuration of device */
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netde

[jkirsher/next-queue PATCH 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-06-11 Thread Alexander Duyck
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.

The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.

I am submitting this as an RFC for the netdev mailing list, and officially
submitting it for testing to Jeff Kirsher's next-queue in order to validate
the ixgbe specific bits.

The big changes in this set are:
  Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
  Disable XPS for single queue devices
  Replace accel_priv with sb_dev in ndo_select_queue
  Add sb_dev parameter to fallback function for ndo_select_queue
  Consolidated ndo_select_queue functions that appeared to be duplicates

---

Alexander Duyck (7):
  net-sysfs: Drop support for XPS and traffic_class on single queue device
  net: Add support for subordinate device traffic classes
  ixgbe: Add code to populate and use macvlan tc to Tx queue map
  net: Add support for subordinate traffic classes to netdev_pick_tx
  net: Add generic ndo_select_queue functions
  net: allow ndo_select_queue to pass netdev
  net: allow fallback function to pass netdev


 drivers/infiniband/hw/hfi1/vnic_main.c|2 
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 -
 drivers/net/bonding/bond_main.c   |3 
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |5 -
 drivers/net/ethernet/broadcom/bcmsysport.c|6 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |6 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |5 -
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |5 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   62 ++--
 drivers/net/ethernet/lantiq_etop.c|   10 -
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|7 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |5 -
 drivers/net/ethernet/renesas/ravb_main.c  |3 
 drivers/net/ethernet/sun/ldmvsw.c |3 
 drivers/net/ethernet/sun/sunvnet.c|3 
 drivers/net/ethernet/ti/netcp_core.c  |9 -
 drivers/net/hyperv/netvsc_drv.c   |6 -
 drivers/net/macvlan.c |   10 -
 drivers/net/net_failover.c|7 +
 drivers/net/team/team.c   |3 
 drivers/net/tun.c |3 
 drivers/net/wireless/marvell/mwifiex/main.c   |3 
 drivers/net/xen-netback/interface.c   |4 -
 drivers/net/xen-netfront.c|3 
 drivers/staging/netlogic/xlr_net.c|9 -
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 -
 include/linux/netdevice.h |   32 
 net/core/dev.c|  154 ++---
 net/core/net-sysfs.c  |   36 +
 net/mac80211/iface.c  |4 -
 net/packet/af_packet.c|9 -
 35 files changed, 306 insertions(+), 134 deletions(-)

--


[jkirsher/next-queue PATCH 3/7] ixgbe: Add code to populate and use macvlan tc to Tx queue map

2018-06-11 Thread Alexander Duyck
This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fc23e36..6e27848 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring 
*rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 struct ixgbe_fwd_adapter *accel)
 {
+   u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+   int num_tc = netdev_get_num_tc(adapter->netdev);
struct net_device *vdev = accel->netdev;
int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
accel->rx_base_queue = baseq;
accel->tx_base_queue = baseq;
 
+   /* record configuration for macvlan interface in vdev */
+   for (i = 0; i < num_tc; i++)
+   netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+i, rss_i, baseq + (rss_i * i));
+
for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
 
netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
 
@@ -8212,18 +8223,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-   struct ixgbe_adapter *adapter;
-   int txq;
 #ifdef IXGBE_FCOE
+   struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
 #endif
+   int txq;
 
if (fwd_adapter) {
-   adapter = netdev_priv(dev);
-   txq = reciprocal_scale(skb_get_hash(skb),
-  adapter->num_rx_queues_per_pool);
+   u8 tc = netdev_get_num_tc(dev) ?
+   netdev_get_prio_tc_map(dev, skb->priority) : 0;
+   struct net_device *vdev = fwd_adapter->netdev;
+
+   txq = vdev->tc_to_txq[tc].offset;
+   txq += reciprocal_scale(skb_get_hash(skb),
+   vdev->tc_to_txq[tc].count);
 
-   return txq + fwd_adapter->tx_base_queue;
+   return txq;
}
 
 #ifdef IXGBE_FCOE
@@ -8777,6 +8792,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device 
*vdev, void *data)
/* if we cannot find a free pool then disable the offload */
netdev_err(vdev, "L2FW offload disabled due to lack of queue 
resources\n");
macvlan_release_l2fw_offload(vdev);
+
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
kfree(accel);
 
return 0;
@@ -9785,6 +9805,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
if (!macvlan_supports_dest_filter(vdev))
return ERR_PTR(-EMEDIUMTYPE);
 
+   /* We need to lock down the macvlan to be a single queue device so that
+* we can reuse the tc_to_txq field in the macvlan netdev to represent
+* the queue mapping to our netdev.
+*/
+   if (netif_is_multiqueue(vdev))
+   return ERR_PTR(-ERANGE);
+
pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
if (pool == adapter->num_rx_pools) {
u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9841,6 +9868,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
return ERR_PTR(-ENOMEM);
 
set_bit(pool, adapter->fwd_bitmask);
+   netdev_set_sb_channel(vdev, pool);
accel->pool = pool;
accel->netdev = vdev;
 
@@ -9882,6 +9910,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void 
*priv)
   

[jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions

2018-06-11 Thread Alexander Duyck
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/lantiq_etop.c   |   10 +-
 drivers/net/ethernet/ti/netcp_core.c |9 +
 drivers/staging/netlogic/xlr_net.c   |9 +
 include/linux/netdevice.h|2 ++
 net/core/dev.c   |   12 
 net/packet/af_packet.c   |9 ++---
 6 files changed, 19 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c 
b/drivers/net/ethernet/lantiq_etop.c
index afc8100..7a637b5 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -563,14 +563,6 @@ struct ltq_etop_priv {
spin_unlock_irqrestore(>lock, flags);
 }
 
-static u16
-ltq_etop_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
-{
-   /* we are currently only using the first queue */
-   return 0;
-}
-
 static int
 ltq_etop_init(struct net_device *dev)
 {
@@ -641,7 +633,7 @@ struct ltq_etop_priv {
.ndo_set_mac_address = ltq_etop_set_mac_address,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_rx_mode = ltq_etop_set_multicast_list,
-   .ndo_select_queue = ltq_etop_select_queue,
+   .ndo_select_queue = dev_pick_tx_zero,
.ndo_init = ltq_etop_init,
.ndo_tx_timeout = ltq_etop_tx_timeout,
 };
diff --git a/drivers/net/ethernet/ti/netcp_core.c 
b/drivers/net/ethernet/ti/netcp_core.c
index e40aa3e..2c455bd 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1889,13 +1889,6 @@ static int netcp_rx_kill_vid(struct net_device *ndev, 
__be16 proto, u16 vid)
return err;
 }
 
-static u16 netcp_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
- select_queue_fallback_t fallback)
-{
-   return 0;
-}
-
 static int netcp_setup_tc(struct net_device *dev, enum tc_setup_type type,
  void *type_data)
 {
@@ -1972,7 +1965,7 @@ static int netcp_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
.ndo_vlan_rx_add_vid= netcp_rx_add_vid,
.ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
.ndo_tx_timeout = netcp_ndo_tx_timeout,
-   .ndo_select_queue   = netcp_select_queue,
+   .ndo_select_queue   = dev_pick_tx_zero,
.ndo_setup_tc   = netcp_setup_tc,
 };
 
diff --git a/drivers/staging/netlogic/xlr_net.c 
b/drivers/staging/netlogic/xlr_net.c
index e461168..4e6611e 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -290,13 +290,6 @@ static netdev_tx_t xlr_net_start_xmit(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
-static u16 xlr_net_select_queue(struct net_device *ndev, struct sk_buff *skb,
-   void *accel_priv,
-   select_queue_fallback_t fallback)
-{
-   return (u16)smp_processor_id();
-}
-
 static void xlr_hw_set_mac_addr(struct net_device *ndev)
 {
struct xlr_net_priv *priv = netdev_priv(ndev);
@@ -403,7 +396,7 @@ static void xlr_stats(struct net_device *ndev, struct 
rtnl_link_stats64 *stats)
.ndo_open = xlr_net_open,
.ndo_stop = xlr_net_stop,
.ndo_start_xmit = xlr_net_start_xmit,
-   .ndo_select_queue = xlr_net_select_queue,
+   .ndo_select_queue = dev_pick_tx_cpu_id,
.ndo_set_mac_address = xlr_net_set_mac_addr,
.ndo_set_rx_mode = xlr_set_rx_mode,
.ndo_get_stats64 = xlr_stats,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91b3ca9..f277149 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2551,6 +2551,8 @@ struct net_device *__dev_get_by_flags(struct net *net, 
unsigned short flags,
 void dev_close_many(struct list_head *head, bool unlink);
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff 
*newskb);
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb);
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2249294..d746fdd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3505,6 +3505,18 @@ static inline int get_xps_queue(struct net_device *dev,
 #endif
 }
 
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb)
+{
+   return 0;
+}
+EXPORT_SYMBOL(dev_pick_tx_zero);
+
+

[jkirsher/next-queue PATCH 7/7] net: allow fallback function to pass netdev

2018-06-11 Thread Alexander Duyck
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.

The only driver that has any signficant change in this patchset should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c|2 +-
 drivers/net/ethernet/broadcom/bcmsysport.c  |4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c   |2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c |2 +-
 drivers/net/hyperv/netvsc_drv.c |2 +-
 drivers/net/net_failover.c  |2 +-
 drivers/net/xen-netback/interface.c |2 +-
 include/linux/netdevice.h   |3 ++-
 net/core/dev.c  |   12 +++-
 net/packet/af_packet.c  |7 +--
 14 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e3befb1..c673ac2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2224,7 +2224,7 @@ static u16 ena_select_queue(struct net_device *dev, 
struct sk_buff *skb,
if (skb_rx_queue_recorded(skb))
qid = skb_get_rx_queue(skb);
else
-   qid = fallback(dev, skb);
+   qid = fallback(dev, skb, NULL);
 
return qid;
 }
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 32f548e..eb890c4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2116,7 +2116,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
unsigned int q, port;
 
if (!netdev_uses_dsa(dev))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
/* DSA tagging layer will have configured the correct queue */
q = BRCM_TAG_GET_QUEUE(queue);
@@ -2124,7 +2124,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
tx_ring = priv->ring_map[q + port * priv->per_port_num_tx_queues];
 
if (unlikely(!tx_ring))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
return tx_ring->index;
 }
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 969dcc9..7a1b99f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1928,7 +1928,8 @@ u16 bnx2x_select_queue(struct net_device *dev, struct 
sk_buff *skb,
}
 
/* select a non-FCoE queue */
-   return fallback(dev, skb) % (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
+   return fallback(dev, skb, NULL) %
+  (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
 }
 
 void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 8de3039..380931d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -972,7 +972,7 @@ static u16 cxgb_select_queue(struct net_device *dev, struct 
sk_buff *skb,
return txq;
}
 
-   return fallback(dev, skb) % dev->real_num_tx_queues;
+   return fallback(dev, skb, NULL) % dev->real_num_tx_queues;
 }
 
 static int closest_timer(const struct sge *s, int time)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index c36a231..8327254 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2033,7 +2033,7 @@ static void hns_nic_get_stats64(struct net_device *ndev,
is_multicast_ether_addr(eth_hdr->h_dest))
return 0;
else
-   return fallback(ndev, skb);
+   return fallback(ndev, skb, NULL);
 }
 
 static const struct net_device_ops hns_nic_netdev_ops = {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5d9867e..eef64d0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8248,11 +8248,11 @@ stati

[jkirsher/next-queue PATCH 6/7] net: allow ndo_select_queue to pass netdev

2018-06-11 Thread Alexander Duyck
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck 
---
 drivers/infiniband/hw/hfi1/vnic_main.c|2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 ++--
 drivers/net/bonding/bond_main.c   |3 ++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |3 ++-
 drivers/net/ethernet/broadcom/bcmsysport.c|2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |3 ++-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 ---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|3 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |3 ++-
 drivers/net/ethernet/renesas/ravb_main.c  |3 ++-
 drivers/net/ethernet/sun/ldmvsw.c |3 ++-
 drivers/net/ethernet/sun/sunvnet.c|3 ++-
 drivers/net/hyperv/netvsc_drv.c   |4 ++--
 drivers/net/net_failover.c|5 +++--
 drivers/net/team/team.c   |3 ++-
 drivers/net/tun.c |3 ++-
 drivers/net/wireless/marvell/mwifiex/main.c   |3 ++-
 drivers/net/xen-netback/interface.c   |2 +-
 drivers/net/xen-netfront.c|3 ++-
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 ++-
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 +++
 include/linux/netdevice.h |   11 +++
 net/core/dev.c|6 --
 net/mac80211/iface.c  |4 ++--
 net/packet/af_packet.c|9 +++--
 30 files changed, 73 insertions(+), 44 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 5d65582..616fc9b 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -423,7 +423,7 @@ static netdev_tx_t hfi1_netdev_start_xmit(struct sk_buff 
*skb,
 
 static u16 hfi1_vnic_select_queue(struct net_device *netdev,
  struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
  select_queue_fallback_t fallback)
 {
struct hfi1_vnic_vport_info *vinfo = opa_vnic_dev_priv(netdev);
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
index 0c8aec6..6155878 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
@@ -95,7 +95,7 @@ static netdev_tx_t opa_netdev_start_xmit(struct sk_buff *skb,
 }
 
 static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff 
*skb,
-void *accel_priv,
+struct net_device *sb_dev,
 select_queue_fallback_t fallback)
 {
struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
@@ -107,7 +107,7 @@ static u16 opa_vnic_select_queue(struct net_device *netdev, 
struct sk_buff *skb,
mdata->entropy = opa_vnic_calc_entropy(skb);
mdata->vl = opa_vnic_get_vl(adapter, skb);
rc = adapter->rn_ops->ndo_select_queue(netdev, skb,
-  accel_priv, fallback);
+  sb_dev, fallback);
skb_pull(skb, sizeof(*mdata));
return rc;
 }
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bd53a71..e33f689 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4094,7 +4094,8 @@ static inline int bond_slave_override(struct bonding 
*bond,
 
 
 static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb,
-void *accel_priv, select_queue_fallback_t fallback)
+struct net_device *sb_dev,
+select_queue_fallback_t fallback)
 {
/* This helper function exists to help dev_pick_tx get the correct
 * destination queue.  Using a helper function skips a call to
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index f2af87d..e3befb1 100644
--- a/drivers/net/eth

[jkirsher/next-queue PATCH 4/7] net: Add support for subordinate traffic classes to netdev_pick_tx

2018-06-11 Thread Alexander Duyck
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   19 +++-
 drivers/net/macvlan.c |   10 +---
 include/linux/netdevice.h |4 +-
 net/core/dev.c|   57 +++--
 4 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6e27848..053a54c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8219,20 +8219,17 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
  input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
-   struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
-#endif
int txq;
 
-   if (fwd_adapter) {
-   u8 tc = netdev_get_num_tc(dev) ?
-   netdev_get_prio_tc_map(dev, skb->priority) : 0;
-   struct net_device *vdev = fwd_adapter->netdev;
+   if (accel_priv) {
+   u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+   struct net_device *vdev = accel_priv;
 
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
@@ -8241,8 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
return txq;
}
 
-#ifdef IXGBE_FCOE
-
/*
 * only execute the code below if protocol is FCoE
 * or FIP and we have FCoE enabled on the adapter
@@ -8268,11 +8263,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
txq -= f->indices;
 
return txq + f->offset;
-#else
-   return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
   struct xdp_frame *xdpf)
 {
@@ -10076,7 +10069,6 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
-   .ndo_select_queue   = ixgbe_select_queue,
.ndo_set_rx_mode= ixgbe_set_rx_mode,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_mac_address= ixgbe_set_mac,
@@ -10099,6 +10091,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_poll_controller= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+   .ndo_select_queue   = ixgbe_select_queue,
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
-   void *accel_priv = NULL;
 
if (vlan->mode == MACVLAN_MODE_BRIDGE) {
const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NET_XMIT_SUCCESS;
}
}
-
-   /* For packets that are non-multicast and not bridged we will pass
-* the necessary information so that the lowerdev can distinguish
-* the source of the packets via the accel_priv value.
-*/
-   accel_priv = vlan->accel_priv;
 xmit_world:
skb->dev = vlan->lowerdev;
-   return dev_queue_xmit_accel(skb, accel_priv);
+   return dev_queue_xmit_accel(skb,
+   netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev 

[jkirsher/next-queue PATCH 1/7] net-sysfs: Drop support for XPS and traffic_class on single queue device

2018-06-11 Thread Alexander Duyck
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck 
---
 net/core/net-sysfs.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue 
*queue,
  char *buf)
 {
struct net_device *dev = queue->dev;
-   int index = get_netdev_queue_index(queue);
-   int tc = netdev_txq_to_tc(dev, index);
+   int index;
+   int tc;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+   tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
index = get_netdev_queue_index(queue);
 
if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
cpumask_var_t mask;
int err;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 



Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Alexander Duyck
On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
 wrote:
> On Wed, 6 Jun 2018 14:54:04 -0700
> "Samudrala, Sridhar"  wrote:
>
>> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>>  The net failover should be a simple library, not a virtual
>>  object with function callbacks (see callback hell).
>> >>> Why just a library? It should do a common things. I think it should be a
>> >>> virtual object. Looks like your patch again splits the common
>> >>> functionality into multiple drivers. That is kind of backwards attitude.
>> >>> I don't get it. We should rather focus on fixing the mess the
>> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >>> model.
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>>
>> I think it should be possible for netvsc to work with 3 dev model if the only
>> requirement is that eth0 will always be present. With net_failover, you will
>> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
>> issue
>> if somehow there is userspace requirement that there can be only 2 netdevs, 
>> not 3
>> when VF is plugged.
>>
>> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc 
>> device
>> and the IP address gets configured on eth0. Will this be an issue?
>
> DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> but that is the way it is.

Why would DPDK care what we do in the kernel? Isn't it just slapping
vfio-pci on the netdevs it sees?


Re: AF_XDP. Was: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04

2018-06-04 Thread Alexander Duyck
On Mon, Jun 4, 2018 at 4:32 PM, Alexei Starovoitov
 wrote:
> On Mon, Jun 04, 2018 at 03:02:31PM -0700, Alexander Duyck wrote:
>> On Mon, Jun 4, 2018 at 2:27 PM, David Miller  wrote:
>> > From: Or Gerlitz 
>> > Date: Tue, 5 Jun 2018 00:11:35 +0300
>> >
>> >> Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be
>> >> merged for this window -- AFAIU from [1], it's still under
>> >> examination/development/research for non Intel HWs, am I correct or
>> >> this is going to get in now?
>> >
>> > All of the pending AF_XDP changes will be merged this merge window.
>> >
>> > I think Intel folks need to review things as fast as possible because
>> > I pretty much refuse to revert the series or disable it in Kconfig at
>> > this point.
>> >
>> > Thank you.
>>
>> My understanding of things is that the current AF_XDP patches were
>> going to be updated to have more of a model agnostic API such that
>> they would work for either the "typewriter" mode or the descriptor
>> ring based approach. The current plan was to have the zero copy
>> patches be a follow-on after the vendor agnostic API bits in the
>> descriptors and such had been sorted out. I believe you guys have the
>> descriptor fixes already right?
>>
>> In my opinion the i40e code isn't mature enough yet to really go into
>> anything other than maybe net-next in a couple weeks. We are going to
>> need a while to get adequate testing in order to flush out all the
>> bugs and performance regressions we are likely to see coming out of
>> this change.
>
> I think the work everyone did in this release cycle increased my confidence
> that the way descriptors are defined and the rest of uapi are stable enough
> and i40e zero copy bits can land in the next release without uapi changes.
> In that sense even if we merge i40e parts now, the other nic vendors
> will be in the same situation and may find things that they would like
> to improve in uapi.
> So I propose we merge the first 7 patches of the last series now and
> let 3 remaining i40e patches go via intel trees for the next release.
> In the mean time other NIC vendors should start actively working
> on AF_XDP support as well.
> If somehow uapi would need tweaks, we can still do minor adjustments
> since 4.18 won't be released for ~10 weeks.
>

That works for me. Actually I think patch 11 can probably be included
as well since that is just sample code and could probably be used by
whatever drivers end up implementing this.

Thanks.

- Alex


Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04

2018-06-04 Thread Alexander Duyck
On Mon, Jun 4, 2018 at 2:27 PM, David Miller  wrote:
> From: Or Gerlitz 
> Date: Tue, 5 Jun 2018 00:11:35 +0300
>
>> Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be
>> merged for this window -- AFAIU from [1], it's still under
>> examination/development/research for non Intel HWs, am I correct or
>> this is going to get in now?
>
> All of the pending AF_XDP changes will be merged this merge window.
>
> I think Intel folks need to review things as fast as possible because
> I pretty much refuse to revert the series or disable it in Kconfig at
> this point.
>
> Thank you.

My understanding of things is that the current AF_XDP patches were
going to be updated to have more of a model agnostic API such that
they would work for either the "typewriter" mode or the descriptor
ring based approach. The current plan was to have the zero copy
patches be a follow-on after the vendor agnostic API bits in the
descriptors and such had been sorted out. I believe you guys have the
descriptor fixes already right?

In my opinion the i40e code isn't mature enough yet to really go into
anything other than maybe net-next in a couple weeks. We are going to
need a while to get adequate testing in order to flush out all the
bugs and performance regressions we are likely to see coming out of
this change.

- Alex


Re: [PATCH bpf-next 10/11] i40e: implement AF_XDP zero-copy support for Tx

2018-06-04 Thread Alexander Duyck
On Mon, Jun 4, 2018 at 5:06 AM, Björn Töpel  wrote:
> From: Magnus Karlsson 
>
> Here, ndo_xsk_async_xmit is implemented. As a shortcut, the existing
> XDP Tx rings are used for zero-copy. This will result in other devices
> doing XDP_REDIRECT to an AF_XDP enabled queue will have its packets
> dropped.
>
> Signed-off-by: Magnus Karlsson 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |   7 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |  93 +++---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  23 +
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 140 
> 
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   2 +
>  include/net/xdp_sock.h  |  14 +++
>  6 files changed, 242 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 8c602424d339..98c18c41809d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3073,8 +3073,12 @@ static int i40e_configure_tx_ring(struct i40e_ring 
> *ring)
> i40e_status err = 0;
> u32 qtx_ctl = 0;
>
> -   if (ring_is_xdp(ring))
> +   ring->clean_tx_irq = i40e_clean_tx_irq;
> +   if (ring_is_xdp(ring)) {
> ring->xsk_umem = i40e_xsk_umem(ring);
> +   if (ring->xsk_umem)
> +   ring->clean_tx_irq = i40e_clean_tx_irq_zc;

Again, I am worried what the performance penalty on this will be given
the retpoline penalty for function pointers.

> +   }
>
> /* some ATR related tx ring init */
> if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
> @@ -12162,6 +12166,7 @@ static const struct net_device_ops i40e_netdev_ops = {
> .ndo_bpf= i40e_xdp,
> .ndo_xdp_xmit   = i40e_xdp_xmit,
> .ndo_xdp_flush  = i40e_xdp_flush,
> +   .ndo_xsk_async_xmit = i40e_xsk_async_xmit,
>  };
>
>  /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 6b1142fbc697..923bb84a93ab 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -10,16 +10,6 @@
>  #include "i40e_trace.h"
>  #include "i40e_prototype.h"
>
> -static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
> -   u32 td_tag)
> -{
> -   return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
> -  ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
> -  ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
> -  ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
> -  ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
> -}
> -
>  #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
>  /**
>   * i40e_fdir - Generate a Flow Director descriptor based on fdata
> @@ -649,9 +639,13 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
> if (!tx_ring->tx_bi)
> return;
>
> -   /* Free all the Tx ring sk_buffs */
> -   for (i = 0; i < tx_ring->count; i++)
> -   i40e_unmap_and_free_tx_resource(tx_ring, _ring->tx_bi[i]);
> +   /* Cleanup only needed for non XSK TX ZC rings */
> +   if (!tx_ring->xsk_umem) {
> +   /* Free all the Tx ring sk_buffs */
> +   for (i = 0; i < tx_ring->count; i++)
> +   i40e_unmap_and_free_tx_resource(tx_ring,
> +   _ring->tx_bi[i]);
> +   }
>
> bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
> memset(tx_ring->tx_bi, 0, bi_size);
> @@ -768,8 +762,40 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
> }
>  }
>
> +void i40e_update_tx_stats(struct i40e_ring *tx_ring,
> + unsigned int total_packets,
> + unsigned int total_bytes)
> +{
> +   u64_stats_update_begin(_ring->syncp);
> +   tx_ring->stats.bytes += total_bytes;
> +   tx_ring->stats.packets += total_packets;
> +   u64_stats_update_end(_ring->syncp);
> +   tx_ring->q_vector->tx.total_bytes += total_bytes;
> +   tx_ring->q_vector->tx.total_packets += total_packets;
> +}
> +
>  #define WB_STRIDE 4
>
> +void i40e_arm_wb(struct i40e_ring *tx_ring,
> +struct i40e_vsi *vsi,
> +int budget)
> +{
> +   if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
> +   /* check to see if there are < 4 descriptors
> +* waiting to be written back, then kick the hardware to force
> +* them to be written back in case we stay in NAPI.
> +* In this mode on X722 we do not enable Interrupt.
> +*/
> +   unsigned int j = i40e_get_tx_pending(tx_ring, false);
> +
> +   if (budget &&
> +   

Re: [PATCH bpf-next 09/11] i40e: implement AF_XDP zero-copy support for Rx

2018-06-04 Thread Alexander Duyck
On Mon, Jun 4, 2018 at 5:05 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This commit adds initial AF_XDP zero-copy support for i40e-based
> NICs. First we add support for the new XDP_QUERY_XSK_UMEM and
> XDP_SETUP_XSK_UMEM commands in ndo_bpf. This allows the AF_XDP socket
> to pass a UMEM to the driver. The driver will then DMA map all the
> frames in the UMEM for the driver. Next, the Rx code will allocate
> frames from the UMEM fill queue, instead of the regular page
> allocator.
>
> Externally, for the rest of the XDP code, the driver internal UMEM
> allocator will appear as a MEM_TYPE_ZERO_COPY.
>
> The commit also introduces a completely new clean_rx_irq/allocator
> functions for zero-copy, and means (functions pointers) to set
> allocators and clean_rx functions.
>
> This first version does not support:
> * passing frames to the stack via XDP_PASS (clone/copy to skb).
> * doing XDP redirect to other than AF_XDP sockets
>   (convert_to_xdp_frame does not clone the frame yet).
>
> Signed-off-by: Björn Töpel 
> ---
>  drivers/net/ethernet/intel/i40e/Makefile|   3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h  |  23 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c |  35 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 163 ++---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h | 128 ++-
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 537 
> 
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  17 +
>  include/net/xdp_sock.h  |  19 +
>  net/xdp/xdp_umem.h  |  10 -
>  9 files changed, 789 insertions(+), 146 deletions(-)
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
>  create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h
>
> diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
> b/drivers/net/ethernet/intel/i40e/Makefile
> index 14397e7e9925..50590e8d1fd1 100644
> --- a/drivers/net/ethernet/intel/i40e/Makefile
> +++ b/drivers/net/ethernet/intel/i40e/Makefile
> @@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
> i40e_txrx.o \
> i40e_ptp.o  \
> i40e_client.o   \
> -   i40e_virtchnl_pf.o
> +   i40e_virtchnl_pf.o \
> +   i40e_xsk.o
>
>  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
> b/drivers/net/ethernet/intel/i40e/i40e.h
> index 7a80652e2500..20955e5dce02 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -786,6 +786,12 @@ struct i40e_vsi {
>
> /* VSI specific handlers */
> irqreturn_t (*irq_handler)(int irq, void *data);
> +
> +   /* AF_XDP zero-copy */
> +   struct xdp_umem **xsk_umems;
> +   u16 num_xsk_umems_used;
> +   u16 num_xsk_umems;
> +
>  } cacheline_internodealigned_in_smp;
>
>  struct i40e_netdev_priv {
> @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct 
> i40e_vsi *vsi)
> return !!vsi->xdp_prog;
>  }
>
> +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> +{
> +   bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> +   int qid = ring->queue_index;
> +
> +   if (ring_is_xdp(ring))
> +   qid -= ring->vsi->alloc_queue_pairs;
> +
> +   if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> +   return NULL;
> +
> +   return ring->vsi->xsk_umems[qid];
> +}
> +
>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
>  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> @@ -1098,4 +1118,7 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>  int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>   struct i40e_cloud_filter *filter,
>   bool add);
> +int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
> +int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
> +
>  #endif /* _I40E_H_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 369a116edaa1..8c602424d339 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -16,6 +17,7 @@
>   */
>  #define CREATE_TRACE_POINTS
>  #include "i40e_trace.h"
> +#include "i40e_xsk.h"
>
>  const char i40e_driver_name[] = "i40e";
>  static const char i40e_driver_string[] =
> @@ -3071,6 +3073,9 @@ static int i40e_configure_tx_ring(struct i40e_ring 
> *ring)
> i40e_status err = 0;
> u32 qtx_ctl = 0;
>
> +   if (ring_is_xdp(ring))
> +   ring->xsk_umem = i40e_xsk_umem(ring);
> +
> /* some ATR related tx ring init */
> if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
>  

[RFC PATCH 4/4] net: Add support for subordinate traffic classes to netdev_pick_tx

2018-06-04 Thread Alexander Duyck
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   15 ++-
 drivers/net/macvlan.c |   10 +
 include/linux/netdevice.h |4 +-
 net/core/dev.c|   55 +++--
 4 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 05d6d48..c42498d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8218,20 +8218,18 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
  input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
-   struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
-#endif
int txq;
 
-   if (fwd_adapter) {
+   if (accel_priv) {
u8 tc = netdev_get_num_tc(dev) ?
netdev_get_prio_tc_map(dev, skb->priority) : 0;
-   struct net_device *vdev = fwd_adapter->netdev;
+   struct net_device *vdev = accel_priv;
 
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
@@ -8240,7 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
return txq;
}
 
-#ifdef IXGBE_FCOE
 
/*
 * only execute the code below if protocol is FCoE
@@ -8267,11 +8264,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
txq -= f->indices;
 
return txq + f->offset;
-#else
-   return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
   struct xdp_frame *xdpf)
 {
@@ -10071,7 +10066,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
-   .ndo_select_queue   = ixgbe_select_queue,
.ndo_set_rx_mode= ixgbe_set_rx_mode,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_mac_address= ixgbe_set_mac,
@@ -10094,6 +10088,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
.ndo_poll_controller= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+   .ndo_select_queue   = ixgbe_select_queue,
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
-   void *accel_priv = NULL;
 
if (vlan->mode == MACVLAN_MODE_BRIDGE) {
const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NET_XMIT_SUCCESS;
}
}
-
-   /* For packets that are non-multicast and not bridged we will pass
-* the necessary information so that the lowerdev can distinguish
-* the source of the packets via the accel_priv value.
-*/
-   accel_priv = vlan->accel_priv;
 xmit_world:
skb->dev = vlan->lowerdev;
-   return dev_queue_xmit_accel(skb, accel_priv);
+   return dev_queue_xmit_accel(skb,
+   netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, 
struct sk_buff *skb)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bbc8045..df5a582 100644
--- a/include/l

[RFC PATCH 3/4] ixgbe: Add code to populate and use macvlan tc to Tx queue map

2018-06-04 Thread Alexander Duyck
This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f460c16..05d6d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring 
*rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 struct ixgbe_fwd_adapter *accel)
 {
+   u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+   int num_tc = netdev_get_num_tc(adapter->netdev);
struct net_device *vdev = accel->netdev;
int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
accel->rx_base_queue = baseq;
accel->tx_base_queue = baseq;
 
+   /* record configuration for macvlan interface in vdev */
+   for (i = 0; i < num_tc; i++)
+   netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+i, rss_i, baseq + (rss_i * i));
+
for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
 
netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
 
@@ -8211,18 +8222,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-   struct ixgbe_adapter *adapter;
-   int txq;
 #ifdef IXGBE_FCOE
+   struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
 #endif
+   int txq;
 
if (fwd_adapter) {
-   adapter = netdev_priv(dev);
-   txq = reciprocal_scale(skb_get_hash(skb),
-  adapter->num_rx_queues_per_pool);
+   u8 tc = netdev_get_num_tc(dev) ?
+   netdev_get_prio_tc_map(dev, skb->priority) : 0;
+   struct net_device *vdev = fwd_adapter->netdev;
+
+   txq = vdev->tc_to_txq[tc].offset;
+   txq += reciprocal_scale(skb_get_hash(skb),
+   vdev->tc_to_txq[tc].count);
 
-   return txq + fwd_adapter->tx_base_queue;
+   return txq;
}
 
 #ifdef IXGBE_FCOE
@@ -8776,6 +8791,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device 
*vdev, void *data)
/* if we cannot find a free pool then disable the offload */
netdev_err(vdev, "L2FW offload disabled due to lack of queue 
resources\n");
macvlan_release_l2fw_offload(vdev);
+
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
kfree(accel);
 
return 0;
@@ -9780,6 +9800,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
if (!macvlan_supports_dest_filter(vdev))
return ERR_PTR(-EMEDIUMTYPE);
 
+   /* We need to lock down the macvlan to be a single queue device so that
+* we can reuse the tc_to_txq field in the macvlan netdev to represent
+* the queue mapping to our netdev.
+*/
+   if (netif_is_multiqueue(vdev))
+   return ERR_PTR(-ERANGE);
+
pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
if (pool == adapter->num_rx_pools) {
u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9836,6 +9863,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
return ERR_PTR(-ENOMEM);
 
set_bit(pool, adapter->fwd_bitmask);
+   netdev_set_sb_channel(vdev, pool);
accel->pool = pool;
accel->netdev = vdev;
 
@@ -9877,6 +9905,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void 
*priv)
   

[RFC PATCH 0/4] Add support for DCB and XPS with MACVLAN offload

2018-06-04 Thread Alexander Duyck
This patch set is meant to address a number of shortcomings in the current
implementation of MACVLAN offload. The main issue being the fact that the
current Tx queue implementation for ixgbe required the use of
ndo_select_queue which doesn't necessarily play well with DCB or XPS.

I started this patch set to address that and thought I would submit the
start of the series as an RFC just to make sure I am headed in the right
direction and to see if there is anything I am overlooking. With these
patches applied we now start seeing what I am calling "subordinate
channels" which show themselves via a "-x" suffix where the "-x" can be
anything from "-1" to "-32767". So for example traffic class 0 with a
subordinate class of 5 will be called out as "0-5" if you dump the traffic
class via the sysfs value for the queue.

The main pieces that still need to be handled are the updating of the
ndo_select_queue function and the fallback call to allow support for
passing the accel_priv which I am changing to a netdev. I figure those
patches will be mostly noise so I didn't see the need to include them in
this RFC.

---

Alexander Duyck (4):
  net-sysfs: Drop support for XPS and traffic_class on single queue device
  net: Add support for subordinate device traffic classes
  ixgbe: Add code to populate and use macvlan tc to Tx queue map
  net: Add support for subordinate traffic classes to netdev_pick_tx


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   55 +++---
 drivers/net/macvlan.c |   10 --
 include/linux/netdevice.h |   20 +++
 net/core/dev.c|  144 +
 net/core/net-sysfs.c  |   36 ++
 5 files changed, 215 insertions(+), 50 deletions(-)

--


[RFC PATCH 1/4] net-sysfs: Drop support for XPS and traffic_class on single queue device

2018-06-04 Thread Alexander Duyck
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck 
---
 net/core/net-sysfs.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue 
*queue,
  char *buf)
 {
struct net_device *dev = queue->dev;
-   int index = get_netdev_queue_index(queue);
-   int tc = netdev_txq_to_tc(dev, index);
+   int index;
+   int tc;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+   tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
index = get_netdev_queue_index(queue);
 
if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
cpumask_var_t mask;
int err;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 



[RFC PATCH 2/4] net: Add support for subordinate device traffic classes

2018-06-04 Thread Alexander Duyck
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |   16 
 net/core/dev.c|   89 +
 net/core/net-sysfs.c  |   21 ++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492..bbc8045 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 * (/sys/class/net/DEV/Q/trans_timeout)
 */
unsigned long   trans_timeout;
+
+   /* Suboordinate device that the queue has been assigned to */
+   struct net_device   *sb_dev;
 /*
  * write-mostly part
  */
@@ -1960,7 +1963,7 @@ struct net_device {
 #ifdef CONFIG_DCB
const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-   u8  num_tc;
+   s16 num_tc;
struct netdev_tc_txqtc_to_txq[TC_MAX_QUEUE];
u8  prio_tc_map[TC_BITMASK + 1];
 
@@ -2014,6 +2017,17 @@ int netdev_get_num_tc(struct net_device *dev)
return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+struct net_device *sb_dev,
+u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+   return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 034c2a9..eba255d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
struct netdev_tc_txq *tc = >tc_to_txq[0];
int i;
 
+   /* walk through the TCs and see if it falls into any of them */
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
if ((txq - tc->offset) < tc->count)
return i;
}
 
+   /* didn't find it, just return -1 to indicate no match */
return -1;
}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
bool active = false;
 
if (dev->num_tc) {
+   /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+   if (num_tc < 0)
+   return -EINVAL;
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+   struct netdev_queue *txq = >_tx[dev->num_tx_queues];
+
+   /* Unbind any subordinate channels */
+   while (txq-- != >_tx[0]) {
+   if (txq->sb_dev)
+   netdev_unbind_sb_channel(dev, txq->sb_dev);
+   }
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
 #endif
+   netdev_unbind_all_sb_channels(dev);
+
+   /* Reset TC configuration of device */
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netde

[net PATCH] net-sysfs: Fix memory leak in XPS configuration

2018-05-31 Thread Alexander Duyck
This patch reorders the error cases in showing the XPS configuration so
that we hold off on memory allocation until after we have verified that we
can support XPS on a given ring.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Alexander Duyck 
---
 net/core/net-sysfs.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c476f07..bb7e80f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1214,9 +1214,6 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
 
-   if (!zalloc_cpumask_var(, GFP_KERNEL))
-   return -ENOMEM;
-
index = get_netdev_queue_index(queue);
 
if (dev->num_tc) {
@@ -1226,6 +1223,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
return -EINVAL;
}
 
+   if (!zalloc_cpumask_var(, GFP_KERNEL))
+   return -ENOMEM;
+
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps);
if (dev_maps) {



Re: [net-next] i40iw/i40e: Remove link dependency on i40e

2018-05-23 Thread Alexander Duyck
On Tue, May 22, 2018 at 11:19 PM, Christoph Hellwig  wrote:
> On Tue, May 22, 2018 at 02:04:06PM -0700, Jeff Kirsher wrote:
>> > Why would you want to do this? The rdma driver is non-functional
>> > without the ethernet driver, so why on earth would we want to defeat
>> > the module dependency mechanism?
>>
>> This change is driven by the OSV's like Red Hat, where customer's were
>> updating the i40e driver, which in turn broke i40iw.
>
> Doctor it hurts when I do this..
>
> There is no reason to make a mess of our drivers because people are
> doing things they should haver never done and that aren't supported
> in Linux.
>
> If Intel didn;t offer any out of tree drivers I'm pretty sure no
> customer would even attempt this.  So fix this where the problem is.

Are you serious? You are never going to see out-of-tree drivers go
away. They exist for the simple reason that most customers/OSVs are
slow to upgrade their kernels so we have people running on a 3.10
something kernel on their RHEL 7.X and want to use the latest greatest
hardware.

I find it ridiculous that you expect us to support a product with
in-kernel only and it is pretty short sighted to insist that that is
the only way to support a product.

With that said it probably wouldn't hurt to throw a WARN_ONCE in here
somewhere that basically informs the user they are doing something
stupid and essentially disabling the i40iw driver if they remove i40e.

- Alex


Re: Regression: Approximate 34% performance hit in receive throughput over ixgbe seen due to build_skb patch

2018-05-22 Thread Alexander Duyck
On Tue, May 22, 2018 at 12:29 PM, William Kucharski
<william.kuchar...@oracle.com> wrote:
>
>
>> On May 22, 2018, at 12:23 PM, Alexander Duyck <alexander.du...@gmail.com> 
>> wrote:
>>
>> 3. There should be a private flag that can be updated via "ethtool
>> --set-priv-flags" called "legacy-rx" that you can enable that will
>> roll back to the original that did the copy-break type approach for
>> small packets and the headers of the frame.
>
> With legacy-rx enabled, most of the regression goes away, but it's still 
> present
> as compared to the code without the patch; the regression then drops to about 
> 6%:
>
> # ethtool --show-priv-flags eno1
> Private flags for eno1:
> legacy-rx: on
>
> Socket  Message  Elapsed  Messages
> SizeSize Time Okay Errors   Throughput
> bytes   bytessecs#  #   10^6bits/sec
>
>  65536  64   60.00 35934709  0 306.64
>  65536   60.00 33791739288.35
>
> Socket  Message  Elapsed  Messages
> SizeSize Time Okay Errors   Throughput
> bytes   bytessecs#  #   10^6bits/sec
>
>  65536  64   60.00 39254351  0 334.97
>  65536   60.00 36761069313.69
>
> Is this variance to be expected, or do you think modification of the
> interrupt delay would achieve better results?
>
>
> William Kucharski
>

I would think with modification of interrupt delay you could probably
do much better if my assumption is correct and the issue is us sitting
on packets for too long so we overrun the socket buffer and start
dropping packets or stalling the Tx.

Thanks.

- Alex


Re: Regression: Approximate 34% performance hit in receive throughput over ixgbe seen due to build_skb patch

2018-05-22 Thread Alexander Duyck
On Tue, May 22, 2018 at 11:00 AM, William Kucharski
 wrote:
> A performance hit of approximately 34% in receive numbers for some packet 
> sizes is
> seen when testing traffic over ixgbe links using the network test netperf.
>
> Starting with the top of tree commit 7addb3e4ad3db6a95a953c59884921b5883dcdec,
> a git bisect narrowed the issue down to:
>
> commit 6f429223b31c550b835b4f066ac034d0cf0cc71e
>
> ixgbe: Add support for build_skb
>
> This patch adds build_skb support to the Rx path.  There are several
> advantages to this change.
>
> 1.  It avoids the memcpy and skb->head allocation for small packets which
> improves performance by about 5% in my tests.
> 2.  It avoids the memcpy, skb->head allocation, and eth_get_headlen
> for larger packets improving performance by about 10% in my tests.
> 3.  For VXLAN packets it allows the full header to be in skb->data which
> improves the performance by as much as 30% in some of my tests.
>
> Netperf was sourced from:
>
> https://hewlettpackard.github.io/netperf/
>
> Two machines were directly connected via ixgbe links.
>
> The process "netserver" was started on 10.196.11.8, and running this test:
>
> # netperf -l 60 -H 10.196.11.8 -i 10,2 -I 99,10 -t UDP_STREAM -- -m 64 -s 
> 32768 -S 32768

Okay, so I can already see what the most likely issue is. The
build_skb code is more CPU efficient, but it will consume more memory
in the process since it is avoiding the memcpy and is instead using a
full 2K block of memory for a small frame. I'm suspecting any
performance issue you are seeing may be due to a slow interrupt rate
causing us to either exhaust available Tx memory, or overrun the
available Rx memory.

There end up being multiple ways to address this.
1. Use a larger value for your "-s/-S" values to allow for more memory
to be handled in the queues.
2. Update the interrupt moderation code for the driver. You can either
manually decrease the per-interrupt delay via "ethtool -C" or just
update the adaptive ITR code, see commit b4ded8327fea ("ixgbe: Update
adaptive ITR algorithm").
3. There should be a private flag that can be updated via "ethtool
--set-priv-flags" called "legacy-rx" that you can enable that will
roll back to the original that did the copy-break type approach for
small packets and the headers of the frame.

Thanks.

- Alex


Re: i40e - Is i40e_force_link_state doing the right thing ?

2018-05-15 Thread Alexander Duyck
On Tue, May 15, 2018 at 1:15 PM, Chaitanya Lala
 wrote:
> Hi,
>
> I am trying to bring up a Intel XL710 4x10G Intel card using the
> latest mainline top-of-tree.
> The problem is that "ifconfig up" and "ifconfig down" do not take
> effect at the link state level.
> I tracked the problem down to i40e_force_link_state() when it is
> called from i40e_down().
> It calls i40e_force_link_state with "is_up" == false. In-turn it
> calls, i40e_aq_set_link_restart_an(hw, true, NULL).
>
> Should the second argument of  i40e_aq_set_link_restart_an be "is_up"
> vs the current "true"
> i.e. i40e_aq_set_link_restart_an(hw, is_up, NULL). ? When I make this
> change, the link state syncs-up with
> the interface administrative state.
>
> Is this a bug ?
>
> Thanks,
> Chaitanya

If you are calling i40e_down the assumption is you are bringing the
interface down, so as not to pass traffic. That is why the "is_up" is
set to false.

Could you provide the dmesg results for when you load the driver and
execute the "ifconfig up" and "ifconfig down" commands? That would
help us to understand what might be going on.

Thanks.

- Alex


Re: [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx

2018-05-15 Thread Alexander Duyck
On Tue, May 15, 2018 at 12:06 PM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> A lot of things here. First we add support for the new
> XDP_SETUP_XSK_UMEM command in ndo_bpf. This allows the AF_XDP socket
> to pass a UMEM to the driver. The driver will then DMA map all the
> frames in the UMEM for the driver. Next, the Rx code will allocate
> frames from the UMEM fill queue, instead of the regular page
> allocator.
>
> Externally, for the rest of the XDP code, the driver the driver
> internal UMEM allocator will appear as a MEM_TYPE_ZERO_COPY.
>
> Keep in mind that having frames coming from userland requires some
> extra care taken when passing them to the regular kernel stack. In
> these cases the ZC frame must be copied.
>
> The commit also introduces a completely new clean_rx_irq/allocator
> functions for zero-copy, and means (functions pointers) to set
> allocators and clean_rx functions.
>
> Finally, a lot of this are *not* implemented here. To mention some:
>
> * No passing to the stack via XDP_PASS (clone/copy to skb).
> * No XDP redirect to other than sockets (convert_to_xdp_frame does not
>   clone the frame yet).
>
> And yes, too much C and too big commit. :-)
>
> Signed-off-by: Björn Töpel 

A few minor comments below.

> ---
>  drivers/net/ethernet/intel/i40e/i40e.h  |  20 ++
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 202 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 400 
> ++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  30 ++-
>  4 files changed, 619 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
> b/drivers/net/ethernet/intel/i40e/i40e.h
> index 7a80652e2500..e6ee6c9bf094 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -786,6 +786,12 @@ struct i40e_vsi {
>
> /* VSI specific handlers */
> irqreturn_t (*irq_handler)(int irq, void *data);
> +
> +   /* AF_XDP zero-copy */
> +   struct xdp_umem **xsk_umems;
> +   u16 num_xsk_umems_used;
> +   u16 num_xsk_umems;
> +
>  } cacheline_internodealigned_in_smp;
>
>  struct i40e_netdev_priv {
> @@ -1090,6 +1096,20 @@ static inline bool i40e_enabled_xdp_vsi(struct 
> i40e_vsi *vsi)
> return !!vsi->xdp_prog;
>  }
>
> +static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
> +{
> +   bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
> +   int qid = ring->queue_index;
> +
> +   if (ring_is_xdp(ring))
> +   qid -= ring->vsi->alloc_queue_pairs;
> +
> +   if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
> +   return NULL;
> +
> +   return ring->vsi->xsk_umems[qid];
> +}
> +
>  int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
>  int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
>  int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index b4c23cf3979c..dc3d668a741e 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* Local includes */
>  #include "i40e.h"
> @@ -3054,6 +3055,9 @@ static int i40e_configure_tx_ring(struct i40e_ring 
> *ring)
> i40e_status err = 0;
> u32 qtx_ctl = 0;
>
> +   if (ring_is_xdp(ring))
> +   ring->xsk_umem = i40e_xsk_umem(ring);
> +
> /* some ATR related tx ring init */
> if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
> ring->atr_sample_rate = vsi->back->atr_sample_rate;
> @@ -3163,13 +3167,31 @@ static int i40e_configure_rx_ring(struct i40e_ring 
> *ring)
> struct i40e_hw *hw = >back->hw;
> struct i40e_hmc_obj_rxq rx_ctx;
> i40e_status err = 0;
> +   int ret;
>
> bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
>
> /* clear the context structure first */
> memset(_ctx, 0, sizeof(rx_ctx));
>
> -   ring->rx_buf_len = vsi->rx_buf_len;
> +   ring->xsk_umem = i40e_xsk_umem(ring);
> +   if (ring->xsk_umem) {
> +   ring->clean_rx_irq = i40e_clean_rx_irq_zc;
> +   ring->alloc_rx_buffers = i40e_alloc_rx_buffers_zc;
> +   ring->rx_buf_len = ring->xsk_umem->props.frame_size -
> +  ring->xsk_umem->frame_headroom -
> +  XDP_PACKET_HEADROOM;
> +   ring->zca.free = i40e_zca_free;
> +   ret = xdp_rxq_info_reg_mem_model(>xdp_rxq,
> +MEM_TYPE_ZERO_COPY,
> +>zca);
> +   if (ret)
> +   return ret;
> +   } else {
> +   

Re: [net-next PATCH v2 0/8] UDP GSO Segmentation clean-ups

2018-05-09 Thread Alexander Duyck
On Wed, May 9, 2018 at 8:39 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Mon, May 7, 2018 at 2:02 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> On Sat, May 5, 2018 at 3:06 AM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>>> On Fri, May 4, 2018 at 8:28 PM, Alexander Duyck
>>> <alexander.du...@gmail.com> wrote:
>>>> This patch set addresses a number of issues I found while sorting out
>>>> enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
>>>> were a number of issues related to the checksum and such that seemed to
>>>> cause either minor irregularities or kernel panics in the case of the
>>>> offload request being allowed to traverse between name spaces.
>>>
>>> Were you able to traverse GSO packets between network namespace before
>>> adding to NETIF_F_GSO_SOFTWARE? It does appear that veth includes
>>> NETIF_F_GSO_ENCAP_ALL, which also allows GSO.
>>
>> Without that change the tunnel wouldn't pass the requests between
>> namespaces. However with it I was able to easily test the software
>> checksum code as otherwise the socket was returning EIO when the
>> hardware checksum was disabled.
>>
>>> In either case, it should not be possible for GSO packets to arrive on a 
>>> veth
>>> device, as that can result in queuing the GSO packet to a recipient socket.
>>> In this regard veth is like loopback and must exclude GSO support.
>>>
>>> I'll take a look.
>>
>> I suspect it was probably sending veth UDP segmentation offload
>> requests. For now I can probably drop he patch that was adding it and
>> it can be added later to individual drivers if needed.
>
> I just tested udpgso_bench_tx over veth (on a commit without your
> patchset).
>
> Having NETIF_F_GSO_UDP_L4 in NETIF_F_GSO_ENCAP_ALL
> and NETIF_F_GSO_ENCAP_ALL in VETH_FEATURES is
> sufficient to receive large packets on the veth peer.
>
> This is clearly a bug, as is for any device that may loop packets
> onto a local socket. Such as macvlan in bridge mode.
>
> I will have to revise commit 83aa025f535f ("udp: add gso support
> to virtual devices")
>
> It remains useful to have this capability on the bonding device. I
> might remove the flag from NETIF_F_GSO_ENCAP_ALL and add
> it specifically to that device.
>
> This is also all relevant to future work of NETIF_F_GSO_SOFTWARE.

Sounds like a plan. In the meantime I am going to see about getting
some internal paperwork taken care of to get UDP GSO added to
ixgbe/ixgbevf as an official feature.

I need to finish up some work I am doing on macvlan over the next
couple of weeks so I won't be focusing on this code for the next month
or so.

Thanks.

- Alex


Re: [net-next PATCH 0/3] Symmetric queue selection using XPS for Rx queues

2018-05-08 Thread Alexander Duyck
On Tue, May 8, 2018 at 10:07 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
>
> On 05/08/2018 09:02 AM, Alexander Duyck wrote:
>> On Tue, May 8, 2018 at 8:15 AM, Tom Herbert <t...@herbertland.com> wrote:
>>> On Thu, Apr 19, 2018 at 7:41 PM, Eric Dumazet <eduma...@google.com> wrote:
>>>> On Thu, Apr 19, 2018 at 6:07 PM Amritha Nambiar <amritha.namb...@intel.com>
>>>> wrote:
>>>>
>>>>> This patch series implements support for Tx queue selection based on
>>>>> Rx queue map. This is done by configuring Rx queue map per Tx-queue
>>>>> using sysfs attribute. If the user configuration for Rx queues does
>>>>> not apply, then the Tx queue selection falls back to XPS using CPUs and
>>>>> finally to hashing.
>>>>
>>>>> XPS is refactored to support Tx queue selection based on either the
>>>>> CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be
>>>>> enabled. By default no receive queues are configured for the Tx queue.
>>>>
>>>>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>>>>
>>>>> This is to enable sending packets on the same Tx-Rx queue pair as this
>>>>> is useful for busy polling multi-threaded workloads where it is not
>>>>> possible to pin the threads to a CPU. This is a rework of Sridhar's
>>>>> patch for symmetric queueing via socket option:
>>>>> https://www.spinics.net/lists/netdev/msg453106.html
>>>>
>>> I suspect this is an artifact of flow director which I believe
>>> required queue pairs to be able to work (i.e. receive queue chose
>>> hardware is determined send queue). But that was only required because
>>> of hardware design, I don't see the rationale for introducing queue
>>> pairs in the software stack. There's no need to correlate the transmit
>>> path with receive path, no need to enforce a 1-1 mapping between RX
>>> and TX queues, and the OOO mitigations should be sufficient when TX
>>> queue changes for a flow.
>>>
>>> Tom
>>
>> If I am not mistaken I think there are benefits to doing this sort of
>> thing with polling as it keeps the Tx work locked into the same queue
>> pair that a given application is polling on. So as a result you can
>> keep the interrupts contained to the queue pair that is being busy
>> polled on and if the application cleans up the packets during the busy
>> poll it ends up being a net savings in terms of both latency and power
>> since the Tx clean-up happens sooner, and it happens on the queue that
>> is already busy polling instead of possibly triggering an interrupt on
>> another CPU.
>>
>> So for example in the case of routing and bridging workloads we
>> already had code that would take the Rx queue and associate it to a Tx
>> queue. One of the ideas behind doing this is to try and keep the CPU
>> overhead low by having a 1:1 mapping. In the case of this code we
>> allow for a little more flexibility in that you could have
>> many-to-many mappings but the general idea and common use case is the
>> same which is a 1:1 mapping.
>
>
> I thought we had everything in place to be able to have this already.
>
> Setting IRQ affinities and XPS is certainly something doable.
>
> This is why I wanted a proper documentation of yet another way to reach the
> same behavior.

IRQ affinities and XPS work for pure NAPI setups, but the problem is
you have to also do application affinity in the case of busy polling
which can provide some additional challenges since then you have to
add code in your application to associate a given queue/CPU to a given
application thread. I believe this is a way of simplifying this.

I agree on the documentation aspect. The usage of this should be well
documented as well as the why of using it.

- Alex


Re: [net-next PATCH 0/3] Symmetric queue selection using XPS for Rx queues

2018-05-08 Thread Alexander Duyck
On Tue, May 8, 2018 at 8:15 AM, Tom Herbert  wrote:
> On Thu, Apr 19, 2018 at 7:41 PM, Eric Dumazet  wrote:
>> On Thu, Apr 19, 2018 at 6:07 PM Amritha Nambiar 
>> wrote:
>>
>>> This patch series implements support for Tx queue selection based on
>>> Rx queue map. This is done by configuring Rx queue map per Tx-queue
>>> using sysfs attribute. If the user configuration for Rx queues does
>>> not apply, then the Tx queue selection falls back to XPS using CPUs and
>>> finally to hashing.
>>
>>> XPS is refactored to support Tx queue selection based on either the
>>> CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be
>>> enabled. By default no receive queues are configured for the Tx queue.
>>
>>> - /sys/class/net/eth0/queues/tx-*/xps_rxqs
>>
>>> This is to enable sending packets on the same Tx-Rx queue pair as this
>>> is useful for busy polling multi-threaded workloads where it is not
>>> possible to pin the threads to a CPU. This is a rework of Sridhar's
>>> patch for symmetric queueing via socket option:
>>> https://www.spinics.net/lists/netdev/msg453106.html
>>
> I suspect this is an artifact of flow director which I believe
> required queue pairs to be able to work (i.e. receive queue chose
> hardware is determined send queue). But that was only required because
> of hardware design, I don't see the rationale for introducing queue
> pairs in the software stack. There's no need to correlate the transmit
> path with receive path, no need to enforce a 1-1 mapping between RX
> and TX queues, and the OOO mitigations should be sufficient when TX
> queue changes for a flow.
>
> Tom

If I am not mistaken I think there are benefits to doing this sort of
thing with polling as it keeps the Tx work locked into the same queue
pair that a given application is polling on. So as a result you can
keep the interrupts contained to the queue pair that is being busy
polled on and if the application cleans up the packets during the busy
poll it ends up being a net savings in terms of both latency and power
since the Tx clean-up happens sooner, and it happens on the queue that
is already busy polling instead of possibly triggering an interrupt on
another CPU.

So for example in the case of routing and bridging workloads we
already had code that would take the Rx queue and associate it to a Tx
queue. One of the ideas behind doing this is to try and keep the CPU
overhead low by having a 1:1 mapping. In the case of this code we
allow for a little more flexibility in that you could have
many-to-many mappings but the general idea and common use case is the
same which is a 1:1 mapping.

- Alex


Re: [net-next PATCH v3 0/6] UDP GSO Segmentation clean-ups

2018-05-07 Thread Alexander Duyck
On Mon, May 7, 2018 at 11:08 AM, Alexander Duyck
<alexander.du...@gmail.com> wrote:
> This patch set addresses a number of issues I found while sorting out
> enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
> were a number of issues related to the checksum and such that seemed to
> cause either minor irregularities or kernel panics in the case of the
> offload request being allowed to traverse between name spaces.
>
> With this set applied I am was able to get UDP GSO traffic to pass over
> vxlan tunnels in both offloaded modes and non-offloaded modes for ixgbe and
> ixgbevf.
>
> I submitted the driver specific patches earlier as an RFC:
> https://patchwork.ozlabs.org/project/netdev/list/?series=42477=both=*
>
> v2: Updated patches based on feedback from Eric Dumazet
> Split first patch into several patches based on feedback from Eric
> v3: Drop patch that was calling pskb_may_pull as it was redundant.
> Added code to use MANGLED_0 in case of UDP checksum
> Drop patch adding NETIF_F_GSO_UDP_L4 to list of GSO software offloads
> Added Acked-by for patches reviewed by Willem and not changed

Just noticed I forgot to update the subject before sending out the
cover page. I updated it for this reply. If needed I will submit a v4,
but for now I will leave this out here to finish up review.

Thanks.

- Alex


Re: [net-next PATCH v3 4/6] udp: Partially unroll handling of first segment and last segment

2018-05-07 Thread Alexander Duyck
On Mon, May 7, 2018 at 12:54 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Mon, May 7, 2018 at 2:57 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> On Mon, May 7, 2018 at 2:08 PM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>>
>>> This patch allows us to take care of unrolling the first segment and the
>>> last segment of the loop for processing the segmented skb. Part of the
>>> motivation for this is that it makes it easier to process the fact that the
>>> first fame and all of the frames in between should be mostly identical
>>> in terms of header data, and the last frame has differences in the length
>>> and partial checksum.
>>>
>>> In addition I am dropping the header length calculation since we don't
>>> really need it for anything but the last frame and it can be easily
>>> obtained by just pulling the data_len and offset of tail from the transport
>>> header.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> I'm not a fan of the more complicated control flow, as I pointed out
>> before. It only seems to save one assignment to uh from segs.
>>
>> Both follow-up patches are now more complex, because they need
>> to add the same code in two locations.
>
> With that said, if you feel strongly, I don't object.
>
> The removal of hdrlen and simplification of arguments is definitely
> an improvement.

Thanks for being understanding about this.

My preference is to keep the loop unrolled as it is since that way it
is not too different from the way we handle this for TCP so it will
maintenance of the two easier. Otherwise I have to add a bunch of
conditional checks inside the loop.

The other advantage to unrolling it as I did is that I don't have to
deal with a ton of extra indentation for an if statement inside of a
while loop.

- Alex


[net-next PATCH v3 6/6] udp: Do not copy destructor if one is not present

2018-05-07 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch makes it so that if a destructor is not present we avoid trying
to update the skb socket or any reference counting that would be associated
with the NULL socket and/or descriptor. By doing this we can support
traffic coming from another namespace without any issues.

Acked-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/udp_offload.c |   22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index d4f2daca0c33..ede2a7305b90 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -195,6 +195,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
struct sk_buff *segs, *seg;
struct udphdr *uh;
unsigned int mss;
+   bool copy_dtor;
__sum16 check;
__be16 newlen;
 
@@ -205,12 +206,14 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
skb_pull(gso_skb, sizeof(*uh));
 
/* clear destructor to avoid skb_segment assigning it to tail */
-   WARN_ON_ONCE(gso_skb->destructor != sock_wfree);
-   gso_skb->destructor = NULL;
+   copy_dtor = gso_skb->destructor == sock_wfree;
+   if (copy_dtor)
+   gso_skb->destructor = NULL;
 
segs = skb_segment(gso_skb, features);
if (unlikely(IS_ERR_OR_NULL(segs))) {
-   gso_skb->destructor = sock_wfree;
+   if (copy_dtor)
+   gso_skb->destructor = sock_wfree;
return segs;
}
 
@@ -229,9 +232,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
 
for (;;) {
-   seg->destructor = sock_wfree;
-   seg->sk = sk;
-   sum_truesize += seg->truesize;
+   if (copy_dtor) {
+   seg->destructor = sock_wfree;
+   seg->sk = sk;
+   sum_truesize += seg->truesize;
+   }
 
if (!seg->next)
break;
@@ -263,8 +268,9 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
uh->check = gso_make_checksum(seg, ~check) ? : CSUM_MANGLED_0;
 
/* update refcount for the packet */
-   refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
-
+   if (copy_dtor)
+   refcount_add(sum_truesize - gso_skb->truesize,
+>sk_wmem_alloc);
return segs;
 }
 EXPORT_SYMBOL_GPL(__udp_gso_segment);



[net-next PATCH v3 5/6] udp: Add support for software checksum and GSO_PARTIAL with GSO offload

2018-05-07 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch adds support for a software provided checksum and GSO_PARTIAL
segmentation support. With this we can offload UDP segmentation on devices
that only have partial support for tunnels.

Since we are no longer needing the hardware checksum we can drop the checks
in the segmentation code that were verifying if it was present.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/udp_offload.c |   29 +++--
 net/ipv6/udp_offload.c |   11 +--
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index b15c78ac3f23..d4f2daca0c33 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -214,6 +214,13 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
return segs;
}
 
+   /* GSO partial and frag_list segmentation only requires splitting
+* the frame into an MSS multiple and possibly a remainder, both
+* cases return a GSO skb. So update the mss now.
+*/
+   if (skb_is_gso(segs))
+   mss *= skb_shinfo(segs)->gso_segs;
+
seg = segs;
uh = udp_hdr(seg);
 
@@ -232,6 +239,12 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
uh->len = newlen;
uh->check = check;
 
+   if (seg->ip_summed == CHECKSUM_PARTIAL)
+   gso_reset_checksum(seg, ~check);
+   else
+   uh->check = gso_make_checksum(seg, ~check) ? :
+   CSUM_MANGLED_0;
+
seg = seg->next;
uh = udp_hdr(seg);
}
@@ -244,6 +257,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
uh->len = newlen;
uh->check = check;
 
+   if (seg->ip_summed == CHECKSUM_PARTIAL)
+   gso_reset_checksum(seg, ~check);
+   else
+   uh->check = gso_make_checksum(seg, ~check) ? : CSUM_MANGLED_0;
+
/* update refcount for the packet */
refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
 
@@ -251,15 +269,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 }
 EXPORT_SYMBOL_GPL(__udp_gso_segment);
 
-static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features)
-{
-   if (!can_checksum_protocol(features, htons(ETH_P_IP)))
-   return ERR_PTR(-EIO);
-
-   return __udp_gso_segment(gso_skb, features);
-}
-
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 netdev_features_t features)
 {
@@ -283,7 +292,7 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
goto out;
 
if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
-   return __udp4_gso_segment(skb, features);
+   return __udp_gso_segment(skb, features);
 
mss = skb_shinfo(skb)->gso_size;
if (unlikely(skb->len <= mss))
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 61e34f1d2fa2..03a2ff3fe1e6 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -17,15 +17,6 @@
 #include 
 #include "ip6_offload.h"
 
-static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features)
-{
-   if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
-   return ERR_PTR(-EIO);
-
-   return __udp_gso_segment(gso_skb, features);
-}
-
 static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 netdev_features_t features)
 {
@@ -58,7 +49,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
goto out;
 
if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
-   return __udp6_gso_segment(skb, features);
+   return __udp_gso_segment(skb, features);
 
/* Do software UFO. Complete and fill in the UDP checksum as HW 
cannot
 * do checksum of UDP packets sent as multiple IP fragments.



[net-next PATCH v3 4/6] udp: Partially unroll handling of first segment and last segment

2018-05-07 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch allows us to take care of unrolling the first segment and the
last segment of the loop for processing the segmented skb. Part of the
motivation for this is that it makes it easier to process the fact that the
first fame and all of the frames in between should be mostly identical
in terms of header data, and the last frame has differences in the length
and partial checksum.

In addition I am dropping the header length calculation since we don't
really need it for anything but the last frame and it can be easily
obtained by just pulling the data_len and offset of tail from the transport
header.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/udp_offload.c |   33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 92c182e99ddc..b15c78ac3f23 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
struct sock *sk = gso_skb->sk;
unsigned int sum_truesize = 0;
struct sk_buff *segs, *seg;
-   unsigned int hdrlen;
struct udphdr *uh;
unsigned int mss;
__sum16 check;
@@ -203,7 +202,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
if (gso_skb->len <= sizeof(*uh) + mss)
return ERR_PTR(-EINVAL);
 
-   hdrlen = gso_skb->data - skb_mac_header(gso_skb);
skb_pull(gso_skb, sizeof(*uh));
 
/* clear destructor to avoid skb_segment assigning it to tail */
@@ -216,30 +214,37 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
return segs;
}
 
-   uh = udp_hdr(segs);
+   seg = segs;
+   uh = udp_hdr(seg);
 
/* compute checksum adjustment based on old length versus new */
newlen = htons(sizeof(*uh) + mss);
check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
 
-   for (seg = segs; seg; seg = seg->next) {
-   uh = udp_hdr(seg);
+   for (;;) {
+   seg->destructor = sock_wfree;
+   seg->sk = sk;
+   sum_truesize += seg->truesize;
 
-   /* last packet can be partial gso_size */
-   if (!seg->next) {
-   newlen = htons(seg->len - hdrlen);
-   check = csum16_add(csum16_sub(uh->check, uh->len),
-  newlen);
-   }
+   if (!seg->next)
+   break;
 
uh->len = newlen;
uh->check = check;
 
-   seg->destructor = sock_wfree;
-   seg->sk = sk;
-   sum_truesize += seg->truesize;
+   seg = seg->next;
+   uh = udp_hdr(seg);
}
 
+   /* last packet can be partial gso_size, account for that in checksum */
+   newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
+  seg->data_len);
+   check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
+
+   uh->len = newlen;
+   uh->check = check;
+
+   /* update refcount for the packet */
refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
 
return segs;



[net-next PATCH v3 3/6] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-07 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch is meant to allow us to avoid having to recompute the checksum
from scratch and have it passed as a parameter.

Instead of taking that approach we can take advantage of the fact that the
length that was used to compute the existing checksum is included in the
UDP header.

Finally to avoid the need to invert the result we can just call csum16_add
and csum16_sub directly. By doing this we can avoid a number of
instructions in the loop that is handling segmentation.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 include/net/udp.h  |3 +--
 net/ipv4/udp_offload.c |   32 ++--
 net/ipv6/udp_offload.c |7 +--
 3 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 8bd83b044ecd..9289b6425032 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -175,8 +175,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features,
- __sum16 check);
+ netdev_features_t features);
 
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index c1afcd2f1a76..92c182e99ddc 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -188,8 +188,7 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 EXPORT_SYMBOL(skb_udp_tunnel_segment);
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features,
- __sum16 check)
+ netdev_features_t features)
 {
struct sock *sk = gso_skb->sk;
unsigned int sum_truesize = 0;
@@ -197,6 +196,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
unsigned int hdrlen;
struct udphdr *uh;
unsigned int mss;
+   __sum16 check;
+   __be16 newlen;
 
mss = skb_shinfo(gso_skb)->gso_size;
if (gso_skb->len <= sizeof(*uh) + mss)
@@ -215,17 +216,25 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
return segs;
}
 
+   uh = udp_hdr(segs);
+
+   /* compute checksum adjustment based on old length versus new */
+   newlen = htons(sizeof(*uh) + mss);
+   check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
+
for (seg = segs; seg; seg = seg->next) {
uh = udp_hdr(seg);
-   uh->len = htons(seg->len - hdrlen);
-   uh->check = check;
 
/* last packet can be partial gso_size */
-   if (!seg->next)
-   csum_replace2(>check, htons(mss),
- htons(seg->len - hdrlen - sizeof(*uh)));
+   if (!seg->next) {
+   newlen = htons(seg->len - hdrlen);
+   check = csum16_add(csum16_sub(uh->check, uh->len),
+  newlen);
+   }
+
+   uh->len = newlen;
+   uh->check = check;
 
-   uh->check = ~uh->check;
seg->destructor = sock_wfree;
seg->sk = sk;
sum_truesize += seg->truesize;
@@ -240,15 +249,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features)
 {
-   const struct iphdr *iph = ip_hdr(gso_skb);
-   unsigned int mss = skb_shinfo(gso_skb)->gso_size;
-
if (!can_checksum_protocol(features, htons(ETH_P_IP)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features,
-udp_v4_check(sizeof(struct udphdr) + mss,
- iph->saddr, iph->daddr, 0));
+   return __udp_gso_segment(gso_skb, features);
 }
 
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index dea03ec09715..61e34f1d2fa2 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -20,15 +20,10 @@
 static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features)
 {
-   const struct ipv6hdr *ip6h = ipv6_hdr(gso_skb);
-   unsigned int mss = skb_shinfo(gso_skb)->gso_size;
-
if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features,
-  

[net-next PATCH v3 0/6] Series short description

2018-05-07 Thread Alexander Duyck
This patch set addresses a number of issues I found while sorting out
enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
were a number of issues related to the checksum and such that seemed to
cause either minor irregularities or kernel panics in the case of the
offload request being allowed to traverse between name spaces.

With this set applied I am was able to get UDP GSO traffic to pass over
vxlan tunnels in both offloaded modes and non-offloaded modes for ixgbe and
ixgbevf.

I submitted the driver specific patches earlier as an RFC:
https://patchwork.ozlabs.org/project/netdev/list/?series=42477=both=*

v2: Updated patches based on feedback from Eric Dumazet
Split first patch into several patches based on feedback from Eric
v3: Drop patch that was calling pskb_may_pull as it was redundant.
Added code to use MANGLED_0 in case of UDP checksum
Drop patch adding NETIF_F_GSO_UDP_L4 to list of GSO software offloads
Added Acked-by for patches reviewed by Willem and not changed

---

Alexander Duyck (6):
  udp: Record gso_segs when supporting UDP segmentation offload
  udp: Do not pass MSS as parameter to GSO segmentation
  udp: Do not pass checksum as a parameter to GSO segmentation
  udp: Partially unroll handling of first segment and last segment
  udp: Add support for software checksum and GSO_PARTIAL with GSO offload
  udp: Do not copy destructor if one is not present


 include/net/udp.h  |3 +-
 net/ipv4/udp.c |2 +
 net/ipv4/udp_offload.c |   94 +++-
 net/ipv6/udp_offload.c |   16 +---
 4 files changed, 64 insertions(+), 51 deletions(-)

--


[net-next PATCH v3 2/6] udp: Do not pass MSS as parameter to GSO segmentation

2018-05-07 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

There is no point in passing MSS as a parameter for for the GSO
segmentation call as it is already available via the shared info for the
skb itself.

Reviewed-by: Eric Dumazet <eduma...@google.com>
Acked-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 include/net/udp.h  |2 +-
 net/ipv4/udp_offload.c |6 --
 net/ipv6/udp_offload.c |2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 05990746810e..8bd83b044ecd 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -176,7 +176,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features,
- unsigned int mss, __sum16 check);
+ __sum16 check);
 
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 006257092f06..c1afcd2f1a76 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -189,14 +189,16 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff 
*skb,
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features,
- unsigned int mss, __sum16 check)
+ __sum16 check)
 {
struct sock *sk = gso_skb->sk;
unsigned int sum_truesize = 0;
struct sk_buff *segs, *seg;
unsigned int hdrlen;
struct udphdr *uh;
+   unsigned int mss;
 
+   mss = skb_shinfo(gso_skb)->gso_size;
if (gso_skb->len <= sizeof(*uh) + mss)
return ERR_PTR(-EINVAL);
 
@@ -244,7 +246,7 @@ static struct sk_buff *__udp4_gso_segment(struct sk_buff 
*gso_skb,
if (!can_checksum_protocol(features, htons(ETH_P_IP)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features, mss,
+   return __udp_gso_segment(gso_skb, features,
 udp_v4_check(sizeof(struct udphdr) + mss,
  iph->saddr, iph->daddr, 0));
 }
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index f7b85b1e6b3e..dea03ec09715 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -26,7 +26,7 @@ static struct sk_buff *__udp6_gso_segment(struct sk_buff 
*gso_skb,
if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features, mss,
+   return __udp_gso_segment(gso_skb, features,
 udp_v6_check(sizeof(struct udphdr) + mss,
  >saddr, >daddr, 0));
 }



[net-next PATCH v3 1/6] udp: Record gso_segs when supporting UDP segmentation offload

2018-05-07 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

We need to record the number of segments that will be generated when this
frame is segmented. The expectation is that if gso_size is set then
gso_segs is set as well. Without this some drivers such as ixgbe get
confused if they attempt to offload this as they record 0 segments for the
entire packet instead of the correct value.

Reviewed-by: Eric Dumazet <eduma...@google.com>
Acked-by: Willem de Bruijn <will...@google.com>
Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/udp.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index dd3102a37ef9..e07db83b311e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -793,6 +793,8 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 
*fl4,
 
skb_shinfo(skb)->gso_size = cork->gso_size;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
+   skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len - sizeof(uh),
+cork->gso_size);
goto csum_partial;
}
 



Re: [net-next PATCH v2 0/8] UDP GSO Segmentation clean-ups

2018-05-07 Thread Alexander Duyck
On Sat, May 5, 2018 at 3:06 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Fri, May 4, 2018 at 8:28 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> This patch set addresses a number of issues I found while sorting out
>> enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
>> were a number of issues related to the checksum and such that seemed to
>> cause either minor irregularities or kernel panics in the case of the
>> offload request being allowed to traverse between name spaces.
>
> Were you able to traverse GSO packets between network namespace before
> adding to NETIF_F_GSO_SOFTWARE? It does appear that veth includes
> NETIF_F_GSO_ENCAP_ALL, which also allows GSO.

Without that change the tunnel wouldn't pass the requests between
namespaces. However with it I was able to easily test the software
checksum code as otherwise the socket was returning EIO when the
hardware checksum was disabled.

> In either case, it should not be possible for GSO packets to arrive on a veth
> device, as that can result in queuing the GSO packet to a recipient socket.
> In this regard veth is like loopback and must exclude GSO support.
>
> I'll take a look.

I suspect it was probably sending veth UDP segmentation offload
requests. For now I can probably drop he patch that was adding it and
it can be added later to individual drivers if needed.

Thanks.

- Alex


Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-06 Thread Alexander Duyck
On Sun, May 6, 2018 at 10:17 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Sat, May 5, 2018 at 7:39 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> On Sat, May 5, 2018 at 3:01 AM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>>> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
>>> <alexander.du...@gmail.com> wrote:
>>>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>>>
>>>> This patch is meant to allow us to avoid having to recompute the checksum
>>>> from scratch and have it passed as a parameter.
>>>>
>>>> Instead of taking that approach we can take advantage of the fact that the
>>>> length that was used to compute the existing checksum is included in the
>>>> UDP header. If we cancel that out by adding the value XOR with 0x we
>>>> can then just add the new length in and fold that into the new result.
>>>>
>>>> I think this may be fixing a checksum bug in the original code as well
>>>> since the checksum that was passed included the UDP header in the checksum
>>>> computation, but then excluded it for the adjustment on the last frame. I
>>>> believe this may have an effect on things in the cases where the two differ
>>>> by bits that would result in things crossing the byte boundaries.
>>>
>>> The replacement code, below, subtracts original payload size then adds
>>> the new payload size. mss here excludes the udp header size.
>>>
>>>> /* last packet can be partial gso_size */
>>>> -   if (!seg->next)
>>>> -   csum_replace2(>check, htons(mss),
>>>> - htons(seg->len - hdrlen - 
>>>> sizeof(*uh)));
>>
>> That is my point. When you calculated your checksum you included the
>> UDP header in the calculation.
>>
>> -   return __udp_gso_segment(gso_skb, features,
>> -udp_v4_check(sizeof(struct udphdr) + mss,
>> - iph->saddr, iph->daddr, 0));
>>
>> Basically the problem is in one spot you are adding the sizeof(struct
>> udphdr) + mss and then in another you are cancelling it out as mss and
>> trying to account for it by also dropping the UDP header from the
>> payload length of the value you are adding. That works in the cases
>> where the effect doesn't cause any issues with the byte ordering,
>> however I think when mss + 8 crosses a byte boundary it can lead to
>> issues since the calculation is done on a byte swapped value.
>
> Do you mean that the issue is that the arithmetic operations
> on a __be16 in csum_replace2 may be incorrect if they exceed
> the least significant byte?
>
> csum_replace2 is used in many locations in the stack to adjust a network
> byte order csum when the payload length changes (e.g., iph->tot_len in
> inet_gro_complete).
>
> Or am I missing something specific about the udphdr calculations?

Actually it looks like the math I was applying to test this was off.

Basically the part I wasn't a fan of is the fact that we account for
the UDP header in the first calculation but not in the next. I guess
in the grand scheme of things though you are just dropping it from
both the value being removed and the value being added so it works out
do to the fact that the checksum can be associative.

I guess I was just being too literal in my thinking. Still an
expensive way of doing this though. I'll update the patch description.

Thanks.

- Alex


Re: [net-next PATCH v2 6/8] udp: Add support for software checksum and GSO_PARTIAL with GSO offload

2018-05-06 Thread Alexander Duyck
On Sun, May 6, 2018 at 2:50 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Sat, May 5, 2018 at 3:31 AM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> This patch adds support for a software provided checksum and GSO_PARTIAL
>> segmentation support. With this we can offload UDP segmentation on devices
>> that only have partial support for tunnels.
>>
>> Since we are no longer needing the hardware checksum we can drop the checks
>> in the segmentation code that were verifying if it was present.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>> ---
>>  net/ipv4/udp_offload.c |   28 ++--
>>  net/ipv6/udp_offload.c |   11 +--
>>  2 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>> index 946d06d2aa0c..fd94bbb369b2 100644
>> --- a/net/ipv4/udp_offload.c
>> +++ b/net/ipv4/udp_offload.c
>> @@ -217,6 +217,13 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> return segs;
>> }
>>
>> +   /* GSO partial and frag_list segmentation only requires splitting
>> +* the frame into an MSS multiple and possibly a remainder, both
>> +* cases return a GSO skb. So update the mss now.
>> +*/
>> +   if (skb_is_gso(segs))
>> +   mss *= skb_shinfo(segs)->gso_segs;
>> +
>> seg = segs;
>> uh = udp_hdr(seg);
>>
>> @@ -237,6 +244,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> uh->len = newlen;
>> uh->check = check;
>>
>> +   if (seg->ip_summed == CHECKSUM_PARTIAL)
>> +   gso_reset_checksum(seg, ~check);
>> +   else
>> +   uh->check = gso_make_checksum(seg, ~check);
>
> Here and below, this needs
>
>   if (uh->check == 0)
> uh->check = CSUM_MANGLED_0;
>
> similar to __skb_udp_tunnel_segment?

Good call, though I think I might change this up a bit and do something like:
uh->check = gso_make_checksum(seg, ~check) ? : CSUM_MANGLED_0;

That way I can avoid the extra read.

Thanks.

- Alex


Re: Locking in network code

2018-05-06 Thread Alexander Duyck
On Sun, May 6, 2018 at 6:43 AM, Jacob S. Moroni  wrote:
> Hello,
>
> I have a stupid question regarding which variant of spin_lock to use
> throughout the network stack, and inside RX handlers specifically.
>
> It's my understanding that skbuffs are normally passed into the stack
> from soft IRQ context if the device is using NAPI, and hard IRQ
> context if it's not using NAPI (and I guess process context too if the
> driver does it's own workqueue thing).
>
> So, that means that handlers registered with netdev_rx_handler_register
> may end up being called from any context.

I am pretty sure the Rx handlers are all called from softirq context.
The hard IRQ will just call netif_rx which will queue the packet up to
be handles in the soft IRQ later.

> However, the RX handler in the macvlan code calls ip_check_defrag,
> which could eventually lead to a call to ip_defrag, which ends
> up taking a regular spin_lock around the call to ip_frag_queue.
>
> Is this a risk of deadlock, and if not, why?
>
> What if you're running a system with one CPU and a packet fragment
> arrives on a NAPI interface, then, while the spin_lock is held,
> another fragment somehow arrives on another interface which does
> its processing in hard IRQ context?
>
> --
>   Jacob S. Moroni
>   m...@jakemoroni.com

Take a look at the netif_rx code and it should answer most of your
questions. Basically everything is handed off from the hard IRQ to the
soft IRQ via a backlog queue and then handled in net_rx_action.

- Alex


Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-05 Thread Alexander Duyck
On Sat, May 5, 2018 at 3:01 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> This patch is meant to allow us to avoid having to recompute the checksum
>> from scratch and have it passed as a parameter.
>>
>> Instead of taking that approach we can take advantage of the fact that the
>> length that was used to compute the existing checksum is included in the
>> UDP header. If we cancel that out by adding the value XOR with 0x we
>> can then just add the new length in and fold that into the new result.
>>
>> I think this may be fixing a checksum bug in the original code as well
>> since the checksum that was passed included the UDP header in the checksum
>> computation, but then excluded it for the adjustment on the last frame. I
>> believe this may have an effect on things in the cases where the two differ
>> by bits that would result in things crossing the byte boundaries.
>
> The replacement code, below, subtracts original payload size then adds
> the new payload size. mss here excludes the udp header size.
>
>> /* last packet can be partial gso_size */
>> -   if (!seg->next)
>> -   csum_replace2(>check, htons(mss),
>> - htons(seg->len - hdrlen - 
>> sizeof(*uh)));

That is my point. When you calculated your checksum you included the
UDP header in the calculation.

-   return __udp_gso_segment(gso_skb, features,
-udp_v4_check(sizeof(struct udphdr) + mss,
- iph->saddr, iph->daddr, 0));

Basically the problem is in one spot you are adding the sizeof(struct
udphdr) + mss and then in another you are cancelling it out as mss and
trying to account for it by also dropping the UDP header from the
payload length of the value you are adding. That works in the cases
where the effect doesn't cause any issues with the byte ordering,
however I think when mss + 8 crosses a byte boundary it can lead to
issues since the calculation is done on a byte swapped value.


Re: [net-next PATCH v2 5/8] udp: Partially unroll handling of first segment and last segment

2018-05-05 Thread Alexander Duyck
On Sat, May 5, 2018 at 1:37 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> This patch allows us to take care of unrolling the first segment and the
>> last segment
>
> Only the last segment needs to be unrolled, right?

I need the first segment as I have to get access to the UDP header to
recompute what the checksum should actually be for the first frame and
all the frames in between.

>> of the loop for processing the segmented skb. Part of the
>> motivation for this is that it makes it easier to process the fact that the
>> first fame and all of the frames in between should be mostly identical
>> in terms of header data, and the last frame has differences in the length
>> and partial checksum.
>>
>> In addition I am dropping the header length calculation since we don't
>> really need it for anything but the last frame and it can be easily
>> obtained by just pulling the data_len and offset of tail from the transport
>> header.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>> ---
>>
>> v2: New break-out patch based on one patch from earlier series
>>
>>  net/ipv4/udp_offload.c |   35 ---
>>  1 file changed, 20 insertions(+), 15 deletions(-)
>>
>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>> index 5c4bb8b9e83e..946d06d2aa0c 100644
>> --- a/net/ipv4/udp_offload.c
>> +++ b/net/ipv4/udp_offload.c
>> @@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
>> struct sock *sk = gso_skb->sk;
>> unsigned int sum_truesize = 0;
>> -   unsigned int hdrlen;
>> struct udphdr *uh;
>> unsigned int mss;
>> __sum16 check;
>> @@ -206,7 +205,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> if (!pskb_may_pull(gso_skb, sizeof(*uh)))
>> goto out;
>>
>> -   hdrlen = gso_skb->data - skb_mac_header(gso_skb);
>> skb_pull(gso_skb, sizeof(*uh));
>>
>> /* clear destructor to avoid skb_segment assigning it to tail */
>> @@ -219,7 +217,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> return segs;
>> }
>>
>> -   uh = udp_hdr(segs);
>> +   seg = segs;
>> +   uh = udp_hdr(seg);
>>
>> /* compute checksum adjustment based on old length versus new */
>> newlen = htons(sizeof(*uh) + mss);
>> @@ -227,25 +226,31 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> ((__force u32)uh->len ^ 0x) +
>> (__force u32)newlen));
>>
>> -   for (seg = segs; seg; seg = seg->next) {
>> -   uh = udp_hdr(seg);
>> +   for (;;) {
>> +   seg->destructor = sock_wfree;
>> +   seg->sk = sk;
>> +   sum_truesize += seg->truesize;
>>
>> -   /* last packet can be partial gso_size */
>> -   if (!seg->next) {
>> -   newlen = htons(seg->len - hdrlen);
>> -   check = ~csum_fold((__force __wsum)((__force 
>> u32)uh->check +
>> -   ((__force 
>> u32)uh->len ^ 0x) +
>> -   (__force 
>> u32)newlen));
>> -   }
>> +   if (!seg->next)
>> +   break;
>
> Not critical, but I find replacing
>
>   for (seg = segs; seg; seg = seg->next) {
> uh = udp_hdr(seg);
> ...
>   }
>
> with
>
>   uh = udp_hdr(seg);
>   for (;;) {
> ...
> if (!seg->next)
>   break;
>
> uh = udp_hdr(seg);
>   }
>
> much less easy to parse and it inflates patch size. How about just
>
>   - for (seg = segs; seg; seg = seg->next)
>   + for( seg = segs; seg->next; seg = seg->next)
>
>

The problem is I need access to the UDP header before I start the
loop. That was why I pulled seg = segs and the "uh = udp_hdr(seg)" out
seperately

>>
>> uh->len = newlen;
>> uh->check = check;
>>
>> -   seg->

Re: [net-next PATCH v2 2/8] udp: Verify that pulling UDP header in GSO segmentation doesn't fail

2018-05-05 Thread Alexander Duyck
On Sat, May 5, 2018 at 1:12 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Fri, May 4, 2018 at 8:29 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> We should verify that we can pull the UDP header before we attempt to do
>> so. Otherwise if this fails we have no way of knowing and GSO will not work
>> correctly.
>
> This already happened in the callers udp[46]_ufo_fragment

Ah. Okay I didn't see that. I may add a comment somewhere indicating
that is the case as it is a bit buried with all the calls.

Thanks.

- Alex


Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-04 Thread Alexander Duyck
On Fri, May 4, 2018 at 1:19 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
>
> On 05/04/2018 11:30 AM, Alexander Duyck wrote:
>> From: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> This patch is meant to allow us to avoid having to recompute the checksum
>> from scratch and have it passed as a parameter.
>>
>> Instead of taking that approach we can take advantage of the fact that the
>> length that was used to compute the existing checksum is included in the
>> UDP header. If we cancel that out by adding the value XOR with 0x we
>> can then just add the new length in and fold that into the new result.
>>
>
>>
>> + uh = udp_hdr(segs);
>> +
>> + /* compute checksum adjustment based on old length versus new */
>> + newlen = htons(sizeof(*uh) + mss);
>> + check = ~csum_fold((__force __wsum)((__force u32)uh->check +
>> + ((__force u32)uh->len ^ 0x) +
>> + (__force u32)newlen));
>> +
>
>
> Can't this use csum_sub() instead of this ^ 0x trick ?

I could but that actually adds more instructions to all this since
csum_sub will perform the inversion across a 32b checksum when we only
need to bitflip a 16 bit value. I had considered doing (u16)(~uh->len)
but thought type casing it more than once would be a pain as well.

What I wanted to avoid is having to do the extra math to account for
the rollover. Adding 3 16 bit values will result in at most a 18 bit
value which can then be folded. Doing it this way we avoid that extra
add w/ carry logic that is needed for csum_add/sub.


[net-next PATCH v2 8/8] net: Add NETIF_F_GSO_UDP_L4 to list of GSO offloads with fallback

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

Enable UDP offload as a generic software offload since we can now handle it
for multiple cases including if the checksum isn't present and via
gso_partial in the case of tunnels.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 include/linux/netdev_features.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index c87c3a3453c1..efbd8b2c0197 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -184,7 +184,8 @@ enum {
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE   (NETIF_F_ALL_TSO | \
-NETIF_F_GSO_SCTP)
+NETIF_F_GSO_SCTP| \
+NETIF_F_GSO_UDP_L4)
 
 /*
  * If one device supports one of these features, then enable them



[net-next PATCH v2 7/8] udp: Do not copy destructor if one is not present

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch makes it so that if a destructor is not present we avoid trying
to update the skb socket or any reference counting that would be associated
with the NULL socket and/or descriptor. By doing this we can support
traffic coming from another namespace without any issues.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---

v2: Do not overwrite destructor if not sock_wfree as per Eric
Drop WARN_ON as per Eric

 net/ipv4/udp_offload.c |   21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index fd94bbb369b2..e802f6331589 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -195,6 +195,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
unsigned int sum_truesize = 0;
struct udphdr *uh;
unsigned int mss;
+   bool copy_dtor;
__sum16 check;
__be16 newlen;
 
@@ -208,12 +209,14 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
skb_pull(gso_skb, sizeof(*uh));
 
/* clear destructor to avoid skb_segment assigning it to tail */
-   WARN_ON_ONCE(gso_skb->destructor != sock_wfree);
-   gso_skb->destructor = NULL;
+   copy_dtor = gso_skb->destructor == sock_wfree;
+   if (copy_dtor)
+   gso_skb->destructor = NULL;
 
segs = skb_segment(gso_skb, features);
if (unlikely(IS_ERR_OR_NULL(segs))) {
-   gso_skb->destructor = sock_wfree;
+   if (copy_dtor)
+   gso_skb->destructor = sock_wfree;
return segs;
}
 
@@ -234,9 +237,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
(__force u32)newlen));
 
for (;;) {
-   seg->destructor = sock_wfree;
-   seg->sk = sk;
-   sum_truesize += seg->truesize;
+   if (copy_dtor) {
+   seg->destructor = sock_wfree;
+   seg->sk = sk;
+   sum_truesize += seg->truesize;
+   }
 
if (!seg->next)
break;
@@ -268,7 +273,9 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
uh->check = gso_make_checksum(seg, ~check);
 
/* update refcount for the packet */
-   refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
+   if (copy_dtor)
+   refcount_add(sum_truesize - gso_skb->truesize,
+>sk_wmem_alloc);
 out:
return segs;
 }



[net-next PATCH v2 6/8] udp: Add support for software checksum and GSO_PARTIAL with GSO offload

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch adds support for a software provided checksum and GSO_PARTIAL
segmentation support. With this we can offload UDP segmentation on devices
that only have partial support for tunnels.

Since we are no longer needing the hardware checksum we can drop the checks
in the segmentation code that were verifying if it was present.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/udp_offload.c |   28 ++--
 net/ipv6/udp_offload.c |   11 +--
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 946d06d2aa0c..fd94bbb369b2 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -217,6 +217,13 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
return segs;
}
 
+   /* GSO partial and frag_list segmentation only requires splitting
+* the frame into an MSS multiple and possibly a remainder, both
+* cases return a GSO skb. So update the mss now.
+*/
+   if (skb_is_gso(segs))
+   mss *= skb_shinfo(segs)->gso_segs;
+
seg = segs;
uh = udp_hdr(seg);
 
@@ -237,6 +244,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
uh->len = newlen;
uh->check = check;
 
+   if (seg->ip_summed == CHECKSUM_PARTIAL)
+   gso_reset_checksum(seg, ~check);
+   else
+   uh->check = gso_make_checksum(seg, ~check);
+
seg = seg->next;
uh = udp_hdr(seg);
}
@@ -250,6 +262,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
uh->len = newlen;
uh->check = check;
 
+   if (seg->ip_summed == CHECKSUM_PARTIAL)
+   gso_reset_checksum(seg, ~check);
+   else
+   uh->check = gso_make_checksum(seg, ~check);
+
/* update refcount for the packet */
refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
 out:
@@ -257,15 +274,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 }
 EXPORT_SYMBOL_GPL(__udp_gso_segment);
 
-static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features)
-{
-   if (!can_checksum_protocol(features, htons(ETH_P_IP)))
-   return ERR_PTR(-EIO);
-
-   return __udp_gso_segment(gso_skb, features);
-}
-
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 netdev_features_t features)
 {
@@ -289,7 +297,7 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
goto out;
 
if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
-   return __udp4_gso_segment(skb, features);
+   return __udp_gso_segment(skb, features);
 
mss = skb_shinfo(skb)->gso_size;
if (unlikely(skb->len <= mss))
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 61e34f1d2fa2..03a2ff3fe1e6 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -17,15 +17,6 @@
 #include 
 #include "ip6_offload.h"
 
-static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features)
-{
-   if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
-   return ERR_PTR(-EIO);
-
-   return __udp_gso_segment(gso_skb, features);
-}
-
 static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 netdev_features_t features)
 {
@@ -58,7 +49,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
goto out;
 
if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
-   return __udp6_gso_segment(skb, features);
+   return __udp_gso_segment(skb, features);
 
/* Do software UFO. Complete and fill in the UDP checksum as HW 
cannot
 * do checksum of UDP packets sent as multiple IP fragments.



[net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch is meant to allow us to avoid having to recompute the checksum
from scratch and have it passed as a parameter.

Instead of taking that approach we can take advantage of the fact that the
length that was used to compute the existing checksum is included in the
UDP header. If we cancel that out by adding the value XOR with 0x we
can then just add the new length in and fold that into the new result.

I think this may be fixing a checksum bug in the original code as well
since the checksum that was passed included the UDP header in the checksum
computation, but then excluded it for the adjustment on the last frame. I
believe this may have an effect on things in the cases where the two differ
by bits that would result in things crossing the byte boundaries.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 include/net/udp.h  |3 +--
 net/ipv4/udp_offload.c |   35 +--
 net/ipv6/udp_offload.c |7 +--
 3 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 8bd83b044ecd..9289b6425032 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -175,8 +175,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features,
- __sum16 check);
+ netdev_features_t features);
 
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f21b63adcbbc..5c4bb8b9e83e 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -188,8 +188,7 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 EXPORT_SYMBOL(skb_udp_tunnel_segment);
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
- netdev_features_t features,
- __sum16 check)
+ netdev_features_t features)
 {
struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
struct sock *sk = gso_skb->sk;
@@ -197,6 +196,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
unsigned int hdrlen;
struct udphdr *uh;
unsigned int mss;
+   __sum16 check;
+   __be16 newlen;
 
mss = skb_shinfo(gso_skb)->gso_size;
if (gso_skb->len <= sizeof(*uh) + mss)
@@ -218,17 +219,28 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
return segs;
}
 
+   uh = udp_hdr(segs);
+
+   /* compute checksum adjustment based on old length versus new */
+   newlen = htons(sizeof(*uh) + mss);
+   check = ~csum_fold((__force __wsum)((__force u32)uh->check +
+   ((__force u32)uh->len ^ 0x) +
+   (__force u32)newlen));
+
for (seg = segs; seg; seg = seg->next) {
uh = udp_hdr(seg);
-   uh->len = htons(seg->len - hdrlen);
-   uh->check = check;
 
/* last packet can be partial gso_size */
-   if (!seg->next)
-   csum_replace2(>check, htons(mss),
- htons(seg->len - hdrlen - sizeof(*uh)));
+   if (!seg->next) {
+   newlen = htons(seg->len - hdrlen);
+   check = ~csum_fold((__force __wsum)((__force 
u32)uh->check +
+   ((__force 
u32)uh->len ^ 0x) +
+   (__force 
u32)newlen));
+   }
+
+   uh->len = newlen;
+   uh->check = check;
 
-   uh->check = ~uh->check;
seg->destructor = sock_wfree;
seg->sk = sk;
sum_truesize += seg->truesize;
@@ -243,15 +255,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features)
 {
-   const struct iphdr *iph = ip_hdr(gso_skb);
-   unsigned int mss = skb_shinfo(gso_skb)->gso_size;
-
if (!can_checksum_protocol(features, htons(ETH_P_IP)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features,
-udp_v4_check(sizeof(struct udphdr) + mss,
- iph->saddr, iph->daddr, 0));
+   return __udp_gso_segment(gso_skb, features);
 }
 
 static struct s

[net-next PATCH v2 5/8] udp: Partially unroll handling of first segment and last segment

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

This patch allows us to take care of unrolling the first segment and the
last segment of the loop for processing the segmented skb. Part of the
motivation for this is that it makes it easier to process the fact that the
first fame and all of the frames in between should be mostly identical
in terms of header data, and the last frame has differences in the length
and partial checksum.

In addition I am dropping the header length calculation since we don't
really need it for anything but the last frame and it can be easily
obtained by just pulling the data_len and offset of tail from the transport
header.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 net/ipv4/udp_offload.c |   35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 5c4bb8b9e83e..946d06d2aa0c 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
struct sock *sk = gso_skb->sk;
unsigned int sum_truesize = 0;
-   unsigned int hdrlen;
struct udphdr *uh;
unsigned int mss;
__sum16 check;
@@ -206,7 +205,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
if (!pskb_may_pull(gso_skb, sizeof(*uh)))
goto out;
 
-   hdrlen = gso_skb->data - skb_mac_header(gso_skb);
skb_pull(gso_skb, sizeof(*uh));
 
/* clear destructor to avoid skb_segment assigning it to tail */
@@ -219,7 +217,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
return segs;
}
 
-   uh = udp_hdr(segs);
+   seg = segs;
+   uh = udp_hdr(seg);
 
/* compute checksum adjustment based on old length versus new */
newlen = htons(sizeof(*uh) + mss);
@@ -227,25 +226,31 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
((__force u32)uh->len ^ 0x) +
(__force u32)newlen));
 
-   for (seg = segs; seg; seg = seg->next) {
-   uh = udp_hdr(seg);
+   for (;;) {
+   seg->destructor = sock_wfree;
+   seg->sk = sk;
+   sum_truesize += seg->truesize;
 
-   /* last packet can be partial gso_size */
-   if (!seg->next) {
-   newlen = htons(seg->len - hdrlen);
-   check = ~csum_fold((__force __wsum)((__force 
u32)uh->check +
-   ((__force 
u32)uh->len ^ 0x) +
-   (__force 
u32)newlen));
-   }
+   if (!seg->next)
+   break;
 
uh->len = newlen;
uh->check = check;
 
-   seg->destructor = sock_wfree;
-   seg->sk = sk;
-   sum_truesize += seg->truesize;
+   seg = seg->next;
+   uh = udp_hdr(seg);
}
 
+   /* last packet can be partial gso_size, account for that in checksum */
+   newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
+  seg->data_len);
+   check = ~csum_fold((__force __wsum)((__force u32)uh->check +
+   ((__force u32)uh->len ^ 0x)  +
+   (__force u32)newlen));
+   uh->len = newlen;
+   uh->check = check;
+
+   /* update refcount for the packet */
refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
 out:
return segs;



[net-next PATCH v2 3/8] udp: Do not pass MSS as parameter to GSO segmentation

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

There is no point in passing MSS as a parameter for for the GSO
segmentation call as it is already available via the shared info for the
skb itself.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 include/net/udp.h  |2 +-
 net/ipv4/udp_offload.c |6 --
 net/ipv6/udp_offload.c |2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 05990746810e..8bd83b044ecd 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -176,7 +176,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features,
- unsigned int mss, __sum16 check);
+ __sum16 check);
 
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 8303fff42940..f21b63adcbbc 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -189,14 +189,16 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff 
*skb,
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features,
- unsigned int mss, __sum16 check)
+ __sum16 check)
 {
struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
struct sock *sk = gso_skb->sk;
unsigned int sum_truesize = 0;
unsigned int hdrlen;
struct udphdr *uh;
+   unsigned int mss;
 
+   mss = skb_shinfo(gso_skb)->gso_size;
if (gso_skb->len <= sizeof(*uh) + mss)
goto out;
 
@@ -247,7 +249,7 @@ static struct sk_buff *__udp4_gso_segment(struct sk_buff 
*gso_skb,
if (!can_checksum_protocol(features, htons(ETH_P_IP)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features, mss,
+   return __udp_gso_segment(gso_skb, features,
 udp_v4_check(sizeof(struct udphdr) + mss,
  iph->saddr, iph->daddr, 0));
 }
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index f7b85b1e6b3e..dea03ec09715 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -26,7 +26,7 @@ static struct sk_buff *__udp6_gso_segment(struct sk_buff 
*gso_skb,
if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
return ERR_PTR(-EIO);
 
-   return __udp_gso_segment(gso_skb, features, mss,
+   return __udp_gso_segment(gso_skb, features,
 udp_v6_check(sizeof(struct udphdr) + mss,
  >saddr, >daddr, 0));
 }



[net-next PATCH v2 2/8] udp: Verify that pulling UDP header in GSO segmentation doesn't fail

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

We should verify that we can pull the UDP header before we attempt to do
so. Otherwise if this fails we have no way of knowing and GSO will not work
correctly.

Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 net/ipv4/udp_offload.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 006257092f06..8303fff42940 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -191,14 +191,17 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
  netdev_features_t features,
  unsigned int mss, __sum16 check)
 {
+   struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
struct sock *sk = gso_skb->sk;
unsigned int sum_truesize = 0;
-   struct sk_buff *segs, *seg;
unsigned int hdrlen;
struct udphdr *uh;
 
if (gso_skb->len <= sizeof(*uh) + mss)
-   return ERR_PTR(-EINVAL);
+   goto out;
+
+   if (!pskb_may_pull(gso_skb, sizeof(*uh)))
+   goto out;
 
hdrlen = gso_skb->data - skb_mac_header(gso_skb);
skb_pull(gso_skb, sizeof(*uh));
@@ -230,7 +233,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
}
 
refcount_add(sum_truesize - gso_skb->truesize, >sk_wmem_alloc);
-
+out:
return segs;
 }
 EXPORT_SYMBOL_GPL(__udp_gso_segment);



[net-next PATCH v2 1/8] udp: Record gso_segs when supporting UDP segmentation offload

2018-05-04 Thread Alexander Duyck
From: Alexander Duyck <alexander.h.du...@intel.com>

We need to record the number of segments that will be generated when this
frame is segmented. The expectation is that if gso_size is set then
gso_segs is set as well. Without this some drivers such as ixgbe get
confused if they attempt to offload this as they record 0 segments for the
entire packet instead of the correct value.

Reviewed-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 net/ipv4/udp.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index dd3102a37ef9..e07db83b311e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -793,6 +793,8 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 
*fl4,
 
skb_shinfo(skb)->gso_size = cork->gso_size;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
+   skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len - sizeof(uh),
+cork->gso_size);
goto csum_partial;
}
 



[net-next PATCH v2 0/8] UDP GSO Segmentation clean-ups

2018-05-04 Thread Alexander Duyck
This patch set addresses a number of issues I found while sorting out
enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
were a number of issues related to the checksum and such that seemed to
cause either minor irregularities or kernel panics in the case of the
offload request being allowed to traverse between name spaces.

With this set applied I am was able to get UDP GSO traffic to pass over
vxlan tunnels in both offloaded modes and non-offloaded modes for ixgbe and
ixgbevf.

I submitted the driver specific patches earlier as an RFC:
https://patchwork.ozlabs.org/project/netdev/list/?series=42477=both=*

v2: Updated patches based on feedback from Eric Dumazet
Split first patch into several patches based on feedback from Eric

---

Alexander Duyck (8):
  udp: Record gso_segs when supporting UDP segmentation offload
  udp: Verify that pulling UDP header in GSO segmentation doesn't fail
  udp: Do not pass MSS as parameter to GSO segmentation
  udp: Do not pass checksum as a parameter to GSO segmentation
  udp: Partially unroll handling of first segment and last segment
  udp: Add support for software checksum and GSO_PARTIAL with GSO offload
  udp: Do not copy destructor if one is not present
  net: Add NETIF_F_GSO_UDP_L4 to list of GSO offloads with fallback


 include/linux/netdev_features.h |3 +
 include/net/udp.h   |3 -
 net/ipv4/udp.c  |2 +
 net/ipv4/udp_offload.c  |  104 ++-
 net/ipv6/udp_offload.c  |   16 --
 5 files changed, 74 insertions(+), 54 deletions(-)

--


  1   2   3   4   5   6   7   8   9   10   >