Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread John Fastabend
On 16-06-30 08:53 AM, Jiri Pirko wrote:
> Thu, Jun 30, 2016 at 05:40:57PM CEST, john.fastab...@gmail.com wrote:
>> On 16-06-30 03:52 AM, Jiri Pirko wrote:
>>> Thu, Jun 30, 2016 at 09:57:21AM CEST, john.fastab...@gmail.com wrote:
 On 16-06-30 12:41 AM, Jiri Pirko wrote:
> Thu, Jun 30, 2016 at 09:13:55AM CEST, sridhar.samudr...@intel.com wrote:
>>
>>
>> On 6/29/2016 11:25 PM, Jiri Pirko wrote:
>>> Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
 On 16-06-29 08:35 PM, John Fastabend wrote:
> On 16-06-29 03:09 PM, John Fastabend wrote:
>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>>  wrote:
 On 16-06-29 07:48 AM, Or Gerlitz wrote:
> On 6/28/2016 10:31 PM, John Fastabend wrote:
>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>> implementing new features. Don't make things any more 
>>> complicated :(
>>> [...]
>> Maybe I'm reading to much into the devlink flag names and if 
>> instead
>> you use a switch like the following,
>>VF representer : enable/disable the creation VF netdev's to 
>> represent
>> the virtual functions on the PF
>> Much less complicated then magic switching between forwarding 
>> logic IMO
>> and you don't whack a default configuration that an entire stack 
>> (e.g.
>> libvirt) has been built to use.
> Re letting the user to observe/modify the rules added by the
> driver/firmware while legacy mode. Even if possible with 
> bridge/fdb, it
> will be really pragmatical and doesn't make sense to get that 
> donefor
> the TC subsystem. So this isn't a well defined solution and 
> anyway, as
> you said, legacy mode enhancements is a different exercise. 
> Personally,
> I agree with Jiri, that we should legacy be legacyand focus on 
> adding
> the new model.
 The ixgbe driver already supports bridge and tc commands without 
 the VF
 representer.  Adding the VF representer to these drivers just 
 extends
 the existing support so we have an identifier for VFs and now the
 redirect action works and the fdb commands can specify the VF 
 netdevs.
 I don't see this as a problem because we already do it today with
 'ip' and bridge tools.
>>> To be precise, for both ixgbe and mlx5, the existing tc support
>>> (u32/ixgbe, flower/mlx5) is not for switching functionality but 
>>> rather
>>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>>> redirect to VF, but this is only for south --> north (wire --> VF)
>>> traffic, w.o the VF rep you can't do the other way around.
>>>
>> Correct which is why we need the VF rep. So we are completely in
>> sync there.
>>
>>> Just to clarify, to what exact bridge command support did you refer 
>>> for ixgbe?
>> 'bridge fdb' commands are supported today on the PF. But its the
>> same story as above we need the VF rep to also use it on the
>> VF representer
>>
>> Also 'bridge link' command for veb/vepa modes is supported and the
>> other link attributes could be supported with additional driver
>> support. No need for core changes here. But again yes only on the
>> PF so again we need the VF reps.
>>
>>> The forwarding done in the legacy mode is not well defined, and
>>> different across vendors, adding there the VF reps will not make it
>>> any better b/c some steering rules will be set by tc/bridge offloads
>>> while other rules will be put by the driver.
>>> I don't see how this takes us to better place.
>> In legacy mode or any other mode you are defining some default policy
>> and rules.
>>
>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in 
>> the
>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>> today. And similarly can be modified today using 'ip link' and 
>> 'bridge
>> fdb' at least on the intel devices. Its not undefined in any way with
>> a quick query of the tools we can learn exactly what the 
>> configuration
>> is and even change it. This works fairly well with existing 
>> controllers
>> and stacks.
>>
>> The limitations are 'ip' only supports a single MAC address per 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread Jiri Pirko
Thu, Jun 30, 2016 at 05:40:57PM CEST, john.fastab...@gmail.com wrote:
>On 16-06-30 03:52 AM, Jiri Pirko wrote:
>> Thu, Jun 30, 2016 at 09:57:21AM CEST, john.fastab...@gmail.com wrote:
>>> On 16-06-30 12:41 AM, Jiri Pirko wrote:
 Thu, Jun 30, 2016 at 09:13:55AM CEST, sridhar.samudr...@intel.com wrote:
>
>
> On 6/29/2016 11:25 PM, Jiri Pirko wrote:
>> Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
>>> On 16-06-29 08:35 PM, John Fastabend wrote:
 On 16-06-29 03:09 PM, John Fastabend wrote:
> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>  wrote:
>>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
 On 6/28/2016 10:31 PM, John Fastabend wrote:
> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>> Why?! Please, leave legacy be legacy. Use the new mode for
>> implementing new features. Don't make things any more 
>> complicated :(
>> [...]
> Maybe I'm reading to much into the devlink flag names and if 
> instead
> you use a switch like the following,
>VF representer : enable/disable the creation VF netdev's to 
> represent
> the virtual functions on the PF
> Much less complicated then magic switching between forwarding 
> logic IMO
> and you don't whack a default configuration that an entire stack 
> (e.g.
> libvirt) has been built to use.
 Re letting the user to observe/modify the rules added by the
 driver/firmware while legacy mode. Even if possible with 
 bridge/fdb, it
 will be really pragmatical and doesn't make sense to get that 
 donefor
 the TC subsystem. So this isn't a well defined solution and 
 anyway, as
 you said, legacy mode enhancements is a different exercise. 
 Personally,
 I agree with Jiri, that we should legacy be legacyand focus on 
 adding
 the new model.
>>> The ixgbe driver already supports bridge and tc commands without 
>>> the VF
>>> representer.  Adding the VF representer to these drivers just 
>>> extends
>>> the existing support so we have an identifier for VFs and now the
>>> redirect action works and the fdb commands can specify the VF 
>>> netdevs.
>>> I don't see this as a problem because we already do it today with
>>> 'ip' and bridge tools.
>> To be precise, for both ixgbe and mlx5, the existing tc support
>> (u32/ixgbe, flower/mlx5) is not for switching functionality but 
>> rather
>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>> redirect to VF, but this is only for south --> north (wire --> VF)
>> traffic, w.o the VF rep you can't do the other way around.
>>
> Correct which is why we need the VF rep. So we are completely in
> sync there.
>
>> Just to clarify, to what exact bridge command support did you refer 
>> for ixgbe?
> 'bridge fdb' commands are supported today on the PF. But its the
> same story as above we need the VF rep to also use it on the
> VF representer
>
> Also 'bridge link' command for veb/vepa modes is supported and the
> other link attributes could be supported with additional driver
> support. No need for core changes here. But again yes only on the
> PF so again we need the VF reps.
>
>> The forwarding done in the legacy mode is not well defined, and
>> different across vendors, adding there the VF reps will not make it
>> any better b/c some steering rules will be set by tc/bridge offloads
>> while other rules will be put by the driver.
>> I don't see how this takes us to better place.
> In legacy mode or any other mode you are defining some default policy
> and rules.
>
> In the legacy mode we use mac/vlan assigned l2 forwarding entries in 
> the
> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
> today. And similarly can be modified today using 'ip link' and 'bridge
> fdb' at least on the intel devices. Its not undefined in any way with
> a quick query of the tools we can learn exactly what the configuration
> is and even change it. This works fairly well with existing 
> controllers
> and stacks.
>
> The limitations are 'ip' only supports a single MAC address per VF and
> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
> or namespace we lose visibility of it. Providing a VF rep for 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread John Fastabend
On 16-06-30 03:52 AM, Jiri Pirko wrote:
> Thu, Jun 30, 2016 at 09:57:21AM CEST, john.fastab...@gmail.com wrote:
>> On 16-06-30 12:41 AM, Jiri Pirko wrote:
>>> Thu, Jun 30, 2016 at 09:13:55AM CEST, sridhar.samudr...@intel.com wrote:


 On 6/29/2016 11:25 PM, Jiri Pirko wrote:
> Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
>> On 16-06-29 08:35 PM, John Fastabend wrote:
>>> On 16-06-29 03:09 PM, John Fastabend wrote:
 On 16-06-29 02:33 PM, Or Gerlitz wrote:
> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>  wrote:
>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>>> On 6/28/2016 10:31 PM, John Fastabend wrote:
 On 16-06-28 12:12 PM, Jiri Pirko wrote:
> Why?! Please, leave legacy be legacy. Use the new mode for
> implementing new features. Don't make things any more complicated 
> :(
> [...]
 Maybe I'm reading to much into the devlink flag names and if 
 instead
 you use a switch like the following,
VF representer : enable/disable the creation VF netdev's to 
 represent
 the virtual functions on the PF
 Much less complicated then magic switching between forwarding 
 logic IMO
 and you don't whack a default configuration that an entire stack 
 (e.g.
 libvirt) has been built to use.
>>> Re letting the user to observe/modify the rules added by the
>>> driver/firmware while legacy mode. Even if possible with 
>>> bridge/fdb, it
>>> will be really pragmatical and doesn't make sense to get that 
>>> donefor
>>> the TC subsystem. So this isn't a well defined solution and anyway, 
>>> as
>>> you said, legacy mode enhancements is a different exercise. 
>>> Personally,
>>> I agree with Jiri, that we should legacy be legacyand focus on 
>>> adding
>>> the new model.
>> The ixgbe driver already supports bridge and tc commands without the 
>> VF
>> representer.  Adding the VF representer to these drivers just extends
>> the existing support so we have an identifier for VFs and now the
>> redirect action works and the fdb commands can specify the VF 
>> netdevs.
>> I don't see this as a problem because we already do it today with
>> 'ip' and bridge tools.
> To be precise, for both ixgbe and mlx5, the existing tc support
> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
> redirect to VF, but this is only for south --> north (wire --> VF)
> traffic, w.o the VF rep you can't do the other way around.
>
 Correct which is why we need the VF rep. So we are completely in
 sync there.

> Just to clarify, to what exact bridge command support did you refer 
> for ixgbe?
 'bridge fdb' commands are supported today on the PF. But its the
 same story as above we need the VF rep to also use it on the
 VF representer

 Also 'bridge link' command for veb/vepa modes is supported and the
 other link attributes could be supported with additional driver
 support. No need for core changes here. But again yes only on the
 PF so again we need the VF reps.

> The forwarding done in the legacy mode is not well defined, and
> different across vendors, adding there the VF reps will not make it
> any better b/c some steering rules will be set by tc/bridge offloads
> while other rules will be put by the driver.
> I don't see how this takes us to better place.
 In legacy mode or any other mode you are defining some default policy
 and rules.

 In the legacy mode we use mac/vlan assigned l2 forwarding entries in 
 the
 hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
 today. And similarly can be modified today using 'ip link' and 'bridge
 fdb' at least on the intel devices. Its not undefined in any way with
 a quick query of the tools we can learn exactly what the configuration
 is and even change it. This works fairly well with existing controllers
 and stacks.

 The limitations are 'ip' only supports a single MAC address per VF and
 'tc' doesn't work on VF ports because when the VF is assigned to a VM
 or namespace we lose visibility of it. Providing a VF rep for this
 solves both of those problems.

 In this new mode the default policy is to create a default miss rule
 and implement no l2 forwarding rules. Unfortunately not 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread Or Gerlitz
On Thu, Jun 30, 2016 at 1:52 PM, Jiri Pirko  wrote:

> Why not to have 2 modes:

> 1) lagacy - the current solution, blackbox eswitch, undefined behaviour
> 2) switchdev - with representors, all features possible as on physical
> switches, whitebox eswitch configured using standard tools?

yep, this makes sense to me. I will rename the offloads mode to be
called switchdev
and we'll respin V2 with few more small fixes and clarification on the
change log of this patch


Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread Jiri Pirko
Thu, Jun 30, 2016 at 09:57:21AM CEST, john.fastab...@gmail.com wrote:
>On 16-06-30 12:41 AM, Jiri Pirko wrote:
>> Thu, Jun 30, 2016 at 09:13:55AM CEST, sridhar.samudr...@intel.com wrote:
>>>
>>>
>>> On 6/29/2016 11:25 PM, Jiri Pirko wrote:
 Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
> On 16-06-29 08:35 PM, John Fastabend wrote:
>> On 16-06-29 03:09 PM, John Fastabend wrote:
>>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
 On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
  wrote:
> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>> On 6/28/2016 10:31 PM, John Fastabend wrote:
>>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
 Why?! Please, leave legacy be legacy. Use the new mode for
 implementing new features. Don't make things any more complicated 
 :(
 [...]
>>> Maybe I'm reading to much into the devlink flag names and if instead
>>> you use a switch like the following,
>>>VF representer : enable/disable the creation VF netdev's to 
>>> represent
>>> the virtual functions on the PF
>>> Much less complicated then magic switching between forwarding logic 
>>> IMO
>>> and you don't whack a default configuration that an entire stack 
>>> (e.g.
>>> libvirt) has been built to use.
>> Re letting the user to observe/modify the rules added by the
>> driver/firmware while legacy mode. Even if possible with bridge/fdb, 
>> it
>> will be really pragmatical and doesn't make sense to get that donefor
>> the TC subsystem. So this isn't a well defined solution and anyway, 
>> as
>> you said, legacy mode enhancements is a different exercise. 
>> Personally,
>> I agree with Jiri, that we should legacy be legacyand focus on adding
>> the new model.
> The ixgbe driver already supports bridge and tc commands without the 
> VF
> representer.  Adding the VF representer to these drivers just extends
> the existing support so we have an identifier for VFs and now the
> redirect action works and the fdb commands can specify the VF netdevs.
> I don't see this as a problem because we already do it today with
> 'ip' and bridge tools.
 To be precise, for both ixgbe and mlx5, the existing tc support
 (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
 for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
 redirect to VF, but this is only for south --> north (wire --> VF)
 traffic, w.o the VF rep you can't do the other way around.

>>> Correct which is why we need the VF rep. So we are completely in
>>> sync there.
>>>
 Just to clarify, to what exact bridge command support did you refer 
 for ixgbe?
>>> 'bridge fdb' commands are supported today on the PF. But its the
>>> same story as above we need the VF rep to also use it on the
>>> VF representer
>>>
>>> Also 'bridge link' command for veb/vepa modes is supported and the
>>> other link attributes could be supported with additional driver
>>> support. No need for core changes here. But again yes only on the
>>> PF so again we need the VF reps.
>>>
 The forwarding done in the legacy mode is not well defined, and
 different across vendors, adding there the VF reps will not make it
 any better b/c some steering rules will be set by tc/bridge offloads
 while other rules will be put by the driver.
 I don't see how this takes us to better place.
>>> In legacy mode or any other mode you are defining some default policy
>>> and rules.
>>>
>>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>>> today. And similarly can be modified today using 'ip link' and 'bridge
>>> fdb' at least on the intel devices. Its not undefined in any way with
>>> a quick query of the tools we can learn exactly what the configuration
>>> is and even change it. This works fairly well with existing controllers
>>> and stacks.
>>>
>>> The limitations are 'ip' only supports a single MAC address per VF and
>>> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
>>> or namespace we lose visibility of it. Providing a VF rep for this
>>> solves both of those problems.
>>>
>>> In this new mode the default policy is to create a default miss rule
>>> and implement no l2 forwarding rules. Unfortunately not all hardware
>>> in use supports this default miss rule case but would still benefit
>> >from having a VF rep. So we shouldn't make this a stipulation for
>>> enabling VF reps. It 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread John Fastabend
On 16-06-30 12:41 AM, Jiri Pirko wrote:
> Thu, Jun 30, 2016 at 09:13:55AM CEST, sridhar.samudr...@intel.com wrote:
>>
>>
>> On 6/29/2016 11:25 PM, Jiri Pirko wrote:
>>> Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
 On 16-06-29 08:35 PM, John Fastabend wrote:
> On 16-06-29 03:09 PM, John Fastabend wrote:
>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>>  wrote:
 On 16-06-29 07:48 AM, Or Gerlitz wrote:
> On 6/28/2016 10:31 PM, John Fastabend wrote:
>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>> implementing new features. Don't make things any more complicated :(
>>> [...]
>> Maybe I'm reading to much into the devlink flag names and if instead
>> you use a switch like the following,
>>VF representer : enable/disable the creation VF netdev's to 
>> represent
>> the virtual functions on the PF
>> Much less complicated then magic switching between forwarding logic 
>> IMO
>> and you don't whack a default configuration that an entire stack 
>> (e.g.
>> libvirt) has been built to use.
> Re letting the user to observe/modify the rules added by the
> driver/firmware while legacy mode. Even if possible with bridge/fdb, 
> it
> will be really pragmatical and doesn't make sense to get that donefor
> the TC subsystem. So this isn't a well defined solution and anyway, as
> you said, legacy mode enhancements is a different exercise. 
> Personally,
> I agree with Jiri, that we should legacy be legacyand focus on adding
> the new model.
 The ixgbe driver already supports bridge and tc commands without the VF
 representer.  Adding the VF representer to these drivers just extends
 the existing support so we have an identifier for VFs and now the
 redirect action works and the fdb commands can specify the VF netdevs.
 I don't see this as a problem because we already do it today with
 'ip' and bridge tools.
>>> To be precise, for both ixgbe and mlx5, the existing tc support
>>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>>> redirect to VF, but this is only for south --> north (wire --> VF)
>>> traffic, w.o the VF rep you can't do the other way around.
>>>
>> Correct which is why we need the VF rep. So we are completely in
>> sync there.
>>
>>> Just to clarify, to what exact bridge command support did you refer for 
>>> ixgbe?
>> 'bridge fdb' commands are supported today on the PF. But its the
>> same story as above we need the VF rep to also use it on the
>> VF representer
>>
>> Also 'bridge link' command for veb/vepa modes is supported and the
>> other link attributes could be supported with additional driver
>> support. No need for core changes here. But again yes only on the
>> PF so again we need the VF reps.
>>
>>> The forwarding done in the legacy mode is not well defined, and
>>> different across vendors, adding there the VF reps will not make it
>>> any better b/c some steering rules will be set by tc/bridge offloads
>>> while other rules will be put by the driver.
>>> I don't see how this takes us to better place.
>> In legacy mode or any other mode you are defining some default policy
>> and rules.
>>
>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>> today. And similarly can be modified today using 'ip link' and 'bridge
>> fdb' at least on the intel devices. Its not undefined in any way with
>> a quick query of the tools we can learn exactly what the configuration
>> is and even change it. This works fairly well with existing controllers
>> and stacks.
>>
>> The limitations are 'ip' only supports a single MAC address per VF and
>> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
>> or namespace we lose visibility of it. Providing a VF rep for this
>> solves both of those problems.
>>
>> In this new mode the default policy is to create a default miss rule
>> and implement no l2 forwarding rules. Unfortunately not all hardware
>> in use supports this default miss rule case but would still benefit
> >from having a VF rep. So we shouldn't make this a stipulation for
>> enabling VF reps. It also changes a default policy that has been in
>> place for years without IMO at least any compelling reason. It will
>> be easy enough to change the default l2 policy to a flow based model

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread Jiri Pirko
Thu, Jun 30, 2016 at 09:13:55AM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 6/29/2016 11:25 PM, Jiri Pirko wrote:
>>Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
>>>On 16-06-29 08:35 PM, John Fastabend wrote:
On 16-06-29 03:09 PM, John Fastabend wrote:
>On 16-06-29 02:33 PM, Or Gerlitz wrote:
>>On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>> wrote:
>>>On 16-06-29 07:48 AM, Or Gerlitz wrote:
On 6/28/2016 10:31 PM, John Fastabend wrote:
>On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>Why?! Please, leave legacy be legacy. Use the new mode for
>>implementing new features. Don't make things any more complicated :(
>>[...]
>Maybe I'm reading to much into the devlink flag names and if instead
>you use a switch like the following,
>VF representer : enable/disable the creation VF netdev's to 
> represent
> the virtual functions on the PF
>Much less complicated then magic switching between forwarding logic IMO
>and you don't whack a default configuration that an entire stack (e.g.
>libvirt) has been built to use.
Re letting the user to observe/modify the rules added by the
driver/firmware while legacy mode. Even if possible with bridge/fdb, it
will be really pragmatical and doesn't make sense to get that donefor
the TC subsystem. So this isn't a well defined solution and anyway, as
you said, legacy mode enhancements is a different exercise. Personally,
I agree with Jiri, that we should legacy be legacyand focus on adding
the new model.
>>>The ixgbe driver already supports bridge and tc commands without the VF
>>>representer.  Adding the VF representer to these drivers just extends
>>>the existing support so we have an identifier for VFs and now the
>>>redirect action works and the fdb commands can specify the VF netdevs.
>>>I don't see this as a problem because we already do it today with
>>>'ip' and bridge tools.
>>To be precise, for both ixgbe and mlx5, the existing tc support
>>(u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>>for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>>redirect to VF, but this is only for south --> north (wire --> VF)
>>traffic, w.o the VF rep you can't do the other way around.
>>
>Correct which is why we need the VF rep. So we are completely in
>sync there.
>
>>Just to clarify, to what exact bridge command support did you refer for 
>>ixgbe?
>'bridge fdb' commands are supported today on the PF. But its the
>same story as above we need the VF rep to also use it on the
>VF representer
>
>Also 'bridge link' command for veb/vepa modes is supported and the
>other link attributes could be supported with additional driver
>support. No need for core changes here. But again yes only on the
>PF so again we need the VF reps.
>
>>The forwarding done in the legacy mode is not well defined, and
>>different across vendors, adding there the VF reps will not make it
>>any better b/c some steering rules will be set by tc/bridge offloads
>>while other rules will be put by the driver.
>>I don't see how this takes us to better place.
>In legacy mode or any other mode you are defining some default policy
>and rules.
>
>In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>today. And similarly can be modified today using 'ip link' and 'bridge
>fdb' at least on the intel devices. Its not undefined in any way with
>a quick query of the tools we can learn exactly what the configuration
>is and even change it. This works fairly well with existing controllers
>and stacks.
>
>The limitations are 'ip' only supports a single MAC address per VF and
>'tc' doesn't work on VF ports because when the VF is assigned to a VM
>or namespace we lose visibility of it. Providing a VF rep for this
>solves both of those problems.
>
>In this new mode the default policy is to create a default miss rule
>and implement no l2 forwarding rules. Unfortunately not all hardware
>in use supports this default miss rule case but would still benefit
>from having a VF rep. So we shouldn't make this a stipulation for
>enabling VF reps. It also changes a default policy that has been in
>place for years without IMO at least any compelling reason. It will
>be easy enough to change the default l2 policy to a flow based model
>with a few bridge/tc commands.
>
>>>We are also slightly in disagreement about what the default should be
>>>with VF netdevs. I think the default should be the same L2 mac/vlan
>>>switch behavior and see no reason to 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread Samudrala, Sridhar



On 6/29/2016 11:25 PM, Jiri Pirko wrote:

Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:

On 16-06-29 08:35 PM, John Fastabend wrote:

On 16-06-29 03:09 PM, John Fastabend wrote:

On 16-06-29 02:33 PM, Or Gerlitz wrote:

On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
 wrote:

On 16-06-29 07:48 AM, Or Gerlitz wrote:

On 6/28/2016 10:31 PM, John Fastabend wrote:

On 16-06-28 12:12 PM, Jiri Pirko wrote:

Why?! Please, leave legacy be legacy. Use the new mode for
implementing new features. Don't make things any more complicated :(

[...]

Maybe I'm reading to much into the devlink flag names and if instead
you use a switch like the following,
VF representer : enable/disable the creation VF netdev's to represent
 the virtual functions on the PF
Much less complicated then magic switching between forwarding logic IMO
and you don't whack a default configuration that an entire stack (e.g.
libvirt) has been built to use.

Re letting the user to observe/modify the rules added by the
driver/firmware while legacy mode. Even if possible with bridge/fdb, it
will be really pragmatical and doesn't make sense to get that donefor
the TC subsystem. So this isn't a well defined solution and anyway, as
you said, legacy mode enhancements is a different exercise. Personally,
I agree with Jiri, that we should legacy be legacyand focus on adding
the new model.

The ixgbe driver already supports bridge and tc commands without the VF
representer.  Adding the VF representer to these drivers just extends
the existing support so we have an identifier for VFs and now the
redirect action works and the fdb commands can specify the VF netdevs.
I don't see this as a problem because we already do it today with
'ip' and bridge tools.

To be precise, for both ixgbe and mlx5, the existing tc support
(u32/ixgbe, flower/mlx5) is not for switching functionality but rather
for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
redirect to VF, but this is only for south --> north (wire --> VF)
traffic, w.o the VF rep you can't do the other way around.


Correct which is why we need the VF rep. So we are completely in
sync there.


Just to clarify, to what exact bridge command support did you refer for ixgbe?

'bridge fdb' commands are supported today on the PF. But its the
same story as above we need the VF rep to also use it on the
VF representer

Also 'bridge link' command for veb/vepa modes is supported and the
other link attributes could be supported with additional driver
support. No need for core changes here. But again yes only on the
PF so again we need the VF reps.


The forwarding done in the legacy mode is not well defined, and
different across vendors, adding there the VF reps will not make it
any better b/c some steering rules will be set by tc/bridge offloads
while other rules will be put by the driver.
I don't see how this takes us to better place.

In legacy mode or any other mode you are defining some default policy
and rules.

In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
today. And similarly can be modified today using 'ip link' and 'bridge
fdb' at least on the intel devices. Its not undefined in any way with
a quick query of the tools we can learn exactly what the configuration
is and even change it. This works fairly well with existing controllers
and stacks.

The limitations are 'ip' only supports a single MAC address per VF and
'tc' doesn't work on VF ports because when the VF is assigned to a VM
or namespace we lose visibility of it. Providing a VF rep for this
solves both of those problems.

In this new mode the default policy is to create a default miss rule
and implement no l2 forwarding rules. Unfortunately not all hardware
in use supports this default miss rule case but would still benefit
from having a VF rep. So we shouldn't make this a stipulation for
enabling VF reps. It also changes a default policy that has been in
place for years without IMO at least any compelling reason. It will
be easy enough to change the default l2 policy to a flow based model
with a few bridge/tc commands.


We are also slightly in disagreement about what the default should be
with VF netdevs. I think the default should be the same L2 mac/vlan
switch behavior and see no reason to change it by default just because
we added VF netdevs. The infrastructure libvirt/openstack/etc are built
around this default today. But I guess nothing in this series specifies
what the defaults of any given driver will be. VF netdevs are still
useful even on older hardware that only supports mac/vlan forwarding to
expose statistics and send/receive control frames such as lldp.

Again, this is not about default engineering... and using the VF reps
(not VF netdevs) in legacy mode only make it more cryptic to my
opinion. I agree some changes would be needed in openstack to support
the 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-30 Thread Jiri Pirko
Thu, Jun 30, 2016 at 06:04:39AM CEST, john.fastab...@gmail.com wrote:
>On 16-06-29 08:35 PM, John Fastabend wrote:
>> On 16-06-29 03:09 PM, John Fastabend wrote:
>>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
 On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
  wrote:
> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>> On 6/28/2016 10:31 PM, John Fastabend wrote:
>>> On 16-06-28 12:12 PM, Jiri Pirko wrote:

 Why?! Please, leave legacy be legacy. Use the new mode for
 implementing new features. Don't make things any more complicated :(

 [...]
>>> Maybe I'm reading to much into the devlink flag names and if instead
>>> you use a switch like the following,

>>>VF representer : enable/disable the creation VF netdev's to represent
>>> the virtual functions on the PF

>>> Much less complicated then magic switching between forwarding logic IMO
>>> and you don't whack a default configuration that an entire stack (e.g.
>>> libvirt) has been built to use.

>> Re letting the user to observe/modify the rules added by the
>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
>> will be really pragmatical and doesn't make sense to get that donefor
>> the TC subsystem. So this isn't a well defined solution and anyway, as
>> you said, legacy mode enhancements is a different exercise. Personally,
>> I agree with Jiri, that we should legacy be legacyand focus on adding
>> the new model.

> The ixgbe driver already supports bridge and tc commands without the VF
> representer.  Adding the VF representer to these drivers just extends
> the existing support so we have an identifier for VFs and now the
> redirect action works and the fdb commands can specify the VF netdevs.
> I don't see this as a problem because we already do it today with
> 'ip' and bridge tools.

 To be precise, for both ixgbe and mlx5, the existing tc support
 (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
 for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
 redirect to VF, but this is only for south --> north (wire --> VF)
 traffic, w.o the VF rep you can't do the other way around.

>>>
>>> Correct which is why we need the VF rep. So we are completely in
>>> sync there.
>>>
 Just to clarify, to what exact bridge command support did you refer for 
 ixgbe?
>>>
>>> 'bridge fdb' commands are supported today on the PF. But its the
>>> same story as above we need the VF rep to also use it on the
>>> VF representer
>>>
>>> Also 'bridge link' command for veb/vepa modes is supported and the
>>> other link attributes could be supported with additional driver
>>> support. No need for core changes here. But again yes only on the
>>> PF so again we need the VF reps.
>>>

 The forwarding done in the legacy mode is not well defined, and
 different across vendors, adding there the VF reps will not make it
 any better b/c some steering rules will be set by tc/bridge offloads
 while other rules will be put by the driver.
 I don't see how this takes us to better place.
>>>
>>> In legacy mode or any other mode you are defining some default policy
>>> and rules.
>>>
>>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>>> today. And similarly can be modified today using 'ip link' and 'bridge
>>> fdb' at least on the intel devices. Its not undefined in any way with
>>> a quick query of the tools we can learn exactly what the configuration
>>> is and even change it. This works fairly well with existing controllers
>>> and stacks.
>>>
>>> The limitations are 'ip' only supports a single MAC address per VF and
>>> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
>>> or namespace we lose visibility of it. Providing a VF rep for this
>>> solves both of those problems.
>>>
>>> In this new mode the default policy is to create a default miss rule
>>> and implement no l2 forwarding rules. Unfortunately not all hardware
>>> in use supports this default miss rule case but would still benefit
>>> from having a VF rep. So we shouldn't make this a stipulation for
>>> enabling VF reps. It also changes a default policy that has been in
>>> place for years without IMO at least any compelling reason. It will
>>> be easy enough to change the default l2 policy to a flow based model
>>> with a few bridge/tc commands.
>>>

> We are also slightly in disagreement about what the default should be
> with VF netdevs. I think the default should be the same L2 mac/vlan
> switch behavior and see no reason to change it by default just because
> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
> around this default today. But I guess nothing in this 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 08:35 PM, John Fastabend wrote:
> On 16-06-29 03:09 PM, John Fastabend wrote:
>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>>  wrote:
 On 16-06-29 07:48 AM, Or Gerlitz wrote:
> On 6/28/2016 10:31 PM, John Fastabend wrote:
>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>>
>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>> implementing new features. Don't make things any more complicated :(
>>>
>>> [...]
>> Maybe I'm reading to much into the devlink flag names and if instead
>> you use a switch like the following,
>>>
>>VF representer : enable/disable the creation VF netdev's to represent
>> the virtual functions on the PF
>>>
>> Much less complicated then magic switching between forwarding logic IMO
>> and you don't whack a default configuration that an entire stack (e.g.
>> libvirt) has been built to use.
>>>
> Re letting the user to observe/modify the rules added by the
> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
> will be really pragmatical and doesn't make sense to get that donefor
> the TC subsystem. So this isn't a well defined solution and anyway, as
> you said, legacy mode enhancements is a different exercise. Personally,
> I agree with Jiri, that we should legacy be legacyand focus on adding
> the new model.
>>>
 The ixgbe driver already supports bridge and tc commands without the VF
 representer.  Adding the VF representer to these drivers just extends
 the existing support so we have an identifier for VFs and now the
 redirect action works and the fdb commands can specify the VF netdevs.
 I don't see this as a problem because we already do it today with
 'ip' and bridge tools.
>>>
>>> To be precise, for both ixgbe and mlx5, the existing tc support
>>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>>> redirect to VF, but this is only for south --> north (wire --> VF)
>>> traffic, w.o the VF rep you can't do the other way around.
>>>
>>
>> Correct which is why we need the VF rep. So we are completely in
>> sync there.
>>
>>> Just to clarify, to what exact bridge command support did you refer for 
>>> ixgbe?
>>
>> 'bridge fdb' commands are supported today on the PF. But its the
>> same story as above we need the VF rep to also use it on the
>> VF representer
>>
>> Also 'bridge link' command for veb/vepa modes is supported and the
>> other link attributes could be supported with additional driver
>> support. No need for core changes here. But again yes only on the
>> PF so again we need the VF reps.
>>
>>>
>>> The forwarding done in the legacy mode is not well defined, and
>>> different across vendors, adding there the VF reps will not make it
>>> any better b/c some steering rules will be set by tc/bridge offloads
>>> while other rules will be put by the driver.
>>> I don't see how this takes us to better place.
>>
>> In legacy mode or any other mode you are defining some default policy
>> and rules.
>>
>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>> today. And similarly can be modified today using 'ip link' and 'bridge
>> fdb' at least on the intel devices. Its not undefined in any way with
>> a quick query of the tools we can learn exactly what the configuration
>> is and even change it. This works fairly well with existing controllers
>> and stacks.
>>
>> The limitations are 'ip' only supports a single MAC address per VF and
>> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
>> or namespace we lose visibility of it. Providing a VF rep for this
>> solves both of those problems.
>>
>> In this new mode the default policy is to create a default miss rule
>> and implement no l2 forwarding rules. Unfortunately not all hardware
>> in use supports this default miss rule case but would still benefit
>> from having a VF rep. So we shouldn't make this a stipulation for
>> enabling VF reps. It also changes a default policy that has been in
>> place for years without IMO at least any compelling reason. It will
>> be easy enough to change the default l2 policy to a flow based model
>> with a few bridge/tc commands.
>>
>>>
 We are also slightly in disagreement about what the default should be
 with VF netdevs. I think the default should be the same L2 mac/vlan
 switch behavior and see no reason to change it by default just because
 we added VF netdevs. The infrastructure libvirt/openstack/etc are built
 around this default today. But I guess nothing in this series specifies
 what the defaults of any given driver will be. VF netdevs are still
 useful even on older hardware that only supports mac/vlan forwarding to

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 03:09 PM, John Fastabend wrote:
> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>  wrote:
>>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
 On 6/28/2016 10:31 PM, John Fastabend wrote:
> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>
>> Why?! Please, leave legacy be legacy. Use the new mode for
>> implementing new features. Don't make things any more complicated :(
>>
>> [...]
> Maybe I'm reading to much into the devlink flag names and if instead
> you use a switch like the following,
>>
>VF representer : enable/disable the creation VF netdev's to represent
> the virtual functions on the PF
>>
> Much less complicated then magic switching between forwarding logic IMO
> and you don't whack a default configuration that an entire stack (e.g.
> libvirt) has been built to use.
>>
 Re letting the user to observe/modify the rules added by the
 driver/firmware while legacy mode. Even if possible with bridge/fdb, it
 will be really pragmatical and doesn't make sense to get that donefor
 the TC subsystem. So this isn't a well defined solution and anyway, as
 you said, legacy mode enhancements is a different exercise. Personally,
 I agree with Jiri, that we should legacy be legacyand focus on adding
 the new model.
>>
>>> The ixgbe driver already supports bridge and tc commands without the VF
>>> representer.  Adding the VF representer to these drivers just extends
>>> the existing support so we have an identifier for VFs and now the
>>> redirect action works and the fdb commands can specify the VF netdevs.
>>> I don't see this as a problem because we already do it today with
>>> 'ip' and bridge tools.
>>
>> To be precise, for both ixgbe and mlx5, the existing tc support
>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>> redirect to VF, but this is only for south --> north (wire --> VF)
>> traffic, w.o the VF rep you can't do the other way around.
>>
> 
> Correct which is why we need the VF rep. So we are completely in
> sync there.
> 
>> Just to clarify, to what exact bridge command support did you refer for 
>> ixgbe?
> 
> 'bridge fdb' commands are supported today on the PF. But its the
> same story as above we need the VF rep to also use it on the
> VF representer
> 
> Also 'bridge link' command for veb/vepa modes is supported and the
> other link attributes could be supported with additional driver
> support. No need for core changes here. But again yes only on the
> PF so again we need the VF reps.
> 
>>
>> The forwarding done in the legacy mode is not well defined, and
>> different across vendors, adding there the VF reps will not make it
>> any better b/c some steering rules will be set by tc/bridge offloads
>> while other rules will be put by the driver.
>> I don't see how this takes us to better place.
> 
> In legacy mode or any other mode you are defining some default policy
> and rules.
> 
> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
> today. And similarly can be modified today using 'ip link' and 'bridge
> fdb' at least on the intel devices. Its not undefined in any way with
> a quick query of the tools we can learn exactly what the configuration
> is and even change it. This works fairly well with existing controllers
> and stacks.
> 
> The limitations are 'ip' only supports a single MAC address per VF and
> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
> or namespace we lose visibility of it. Providing a VF rep for this
> solves both of those problems.
> 
> In this new mode the default policy is to create a default miss rule
> and implement no l2 forwarding rules. Unfortunately not all hardware
> in use supports this default miss rule case but would still benefit
> from having a VF rep. So we shouldn't make this a stipulation for
> enabling VF reps. It also changes a default policy that has been in
> place for years without IMO at least any compelling reason. It will
> be easy enough to change the default l2 policy to a flow based model
> with a few bridge/tc commands.
> 
>>
>>> We are also slightly in disagreement about what the default should be
>>> with VF netdevs. I think the default should be the same L2 mac/vlan
>>> switch behavior and see no reason to change it by default just because
>>> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
>>> around this default today. But I guess nothing in this series specifies
>>> what the defaults of any given driver will be. VF netdevs are still
>>> useful even on older hardware that only supports mac/vlan forwarding to
>>> expose statistics and send/receive control frames such as lldp.
>>
>> Again, this is not about default engineering... and using the 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 02:33 PM, Or Gerlitz wrote:
> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>  wrote:
>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>>> On 6/28/2016 10:31 PM, John Fastabend wrote:
 On 16-06-28 12:12 PM, Jiri Pirko wrote:
> 
> Why?! Please, leave legacy be legacy. Use the new mode for
> implementing new features. Don't make things any more complicated :(
> 
> [...]
 Maybe I'm reading to much into the devlink flag names and if instead
 you use a switch like the following,
> 
VF representer : enable/disable the creation VF netdev's to represent
 the virtual functions on the PF
> 
 Much less complicated then magic switching between forwarding logic IMO
 and you don't whack a default configuration that an entire stack (e.g.
 libvirt) has been built to use.
> 
>>> Re letting the user to observe/modify the rules added by the
>>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
>>> will be really pragmatical and doesn't make sense to get that donefor
>>> the TC subsystem. So this isn't a well defined solution and anyway, as
>>> you said, legacy mode enhancements is a different exercise. Personally,
>>> I agree with Jiri, that we should legacy be legacyand focus on adding
>>> the new model.
> 
>> The ixgbe driver already supports bridge and tc commands without the VF
>> representer.  Adding the VF representer to these drivers just extends
>> the existing support so we have an identifier for VFs and now the
>> redirect action works and the fdb commands can specify the VF netdevs.
>> I don't see this as a problem because we already do it today with
>> 'ip' and bridge tools.
> 
> To be precise, for both ixgbe and mlx5, the existing tc support
> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
> redirect to VF, but this is only for south --> north (wire --> VF)
> traffic, w.o the VF rep you can't do the other way around.
> 

Correct which is why we need the VF rep. So we are completely in
sync there.

> Just to clarify, to what exact bridge command support did you refer for ixgbe?

'bridge fdb' commands are supported today on the PF. But its the
same story as above we need the VF rep to also use it on the
VF representer

Also 'bridge link' command for veb/vepa modes is supported and the
other link attributes could be supported with additional driver
support. No need for core changes here. But again yes only on the
PF so again we need the VF reps.

> 
> The forwarding done in the legacy mode is not well defined, and
> different across vendors, adding there the VF reps will not make it
> any better b/c some steering rules will be set by tc/bridge offloads
> while other rules will be put by the driver.
> I don't see how this takes us to better place.

In legacy mode or any other mode you are defining some default policy
and rules.

In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
today. And similarly can be modified today using 'ip link' and 'bridge
fdb' at least on the intel devices. Its not undefined in any way with
a quick query of the tools we can learn exactly what the configuration
is and even change it. This works fairly well with existing controllers
and stacks.

The limitations are 'ip' only supports a single MAC address per VF and
'tc' doesn't work on VF ports because when the VF is assigned to a VM
or namespace we lose visibility of it. Providing a VF rep for this
solves both of those problems.

In this new mode the default policy is to create a default miss rule
and implement no l2 forwarding rules. Unfortunately not all hardware
in use supports this default miss rule case but would still benefit
from having a VF rep. So we shouldn't make this a stipulation for
enabling VF reps. It also changes a default policy that has been in
place for years without IMO at least any compelling reason. It will
be easy enough to change the default l2 policy to a flow based model
with a few bridge/tc commands.

> 
>> We are also slightly in disagreement about what the default should be
>> with VF netdevs. I think the default should be the same L2 mac/vlan
>> switch behavior and see no reason to change it by default just because
>> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
>> around this default today. But I guess nothing in this series specifies
>> what the defaults of any given driver will be. VF netdevs are still
>> useful even on older hardware that only supports mac/vlan forwarding to
>> expose statistics and send/receive control frames such as lldp.
> 
> Again, this is not about default engineering... and using the VF reps
> (not VF netdevs) in legacy mode only make it more cryptic to my
> opinion. I agree some changes would be needed in openstack to support
> the new model, but this is how 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread Or Gerlitz
On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
 wrote:
> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>> On 6/28/2016 10:31 PM, John Fastabend wrote:
>>> On 16-06-28 12:12 PM, Jiri Pirko wrote:

 Why?! Please, leave legacy be legacy. Use the new mode for
 implementing new features. Don't make things any more complicated :(

[...]
>>> Maybe I'm reading to much into the devlink flag names and if instead
>>> you use a switch like the following,

>>>VF representer : enable/disable the creation VF netdev's to represent
>>> the virtual functions on the PF

>>> Much less complicated then magic switching between forwarding logic IMO
>>> and you don't whack a default configuration that an entire stack (e.g.
>>> libvirt) has been built to use.

>> Re letting the user to observe/modify the rules added by the
>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
>> will be really pragmatical and doesn't make sense to get that donefor
>> the TC subsystem. So this isn't a well defined solution and anyway, as
>> you said, legacy mode enhancements is a different exercise. Personally,
>> I agree with Jiri, that we should legacy be legacyand focus on adding
>> the new model.

> The ixgbe driver already supports bridge and tc commands without the VF
> representer.  Adding the VF representer to these drivers just extends
> the existing support so we have an identifier for VFs and now the
> redirect action works and the fdb commands can specify the VF netdevs.
> I don't see this as a problem because we already do it today with
> 'ip' and bridge tools.

To be precise, for both ixgbe and mlx5, the existing tc support
(u32/ixgbe, flower/mlx5) is not for switching functionality but rather
for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
redirect to VF, but this is only for south --> north (wire --> VF)
traffic, w.o the VF rep you can't do the other way around.

Just to clarify, to what exact bridge command support did you refer for ixgbe?

The forwarding done in the legacy mode is not well defined, and
different across vendors, adding there the VF reps will not make it
any better b/c some steering rules will be set by tc/bridge offloads
while other rules will be put by the driver.
I don't see how this takes us to better place.

> We are also slightly in disagreement about what the default should be
> with VF netdevs. I think the default should be the same L2 mac/vlan
> switch behavior and see no reason to change it by default just because
> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
> around this default today. But I guess nothing in this series specifies
> what the defaults of any given driver will be. VF netdevs are still
> useful even on older hardware that only supports mac/vlan forwarding to
> expose statistics and send/receive control frames such as lldp.

Again, this is not about default engineering... and using the VF reps
(not VF netdevs) in legacy mode only make it more cryptic to my
opinion. I agree some changes would be needed in openstack to support
the new model, but this is how progress is made... you can't always
make all layer above you unchanged. Note that the VF reps behave the
same as tap devices (v-switch doing xmit on tap --> recv in VM, VM
sends --> recv on tap into the v-switch), so the change in open-stack
would not be that big.

[...]

> Why I think the VF representer is a per port ethtool flag and not a
> devlink option is my use case might be to assign a PF into a VM or
> namespace where I don't want VF netdevs.

again, we think the correct place to set how the eswitch is managed is
through eswitch manager PCI devices and not net devices and hence
ethtool is not the way to go.

Also, how do you want your e-switch to be managed in this case?


Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 07:48 AM, Or Gerlitz wrote:
> On 6/28/2016 10:31 PM, John Fastabend wrote:
>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>>
>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>> implementing new features. Don't make things any more complicated :(
>>>
>> OK so how I read this is there are two things going on that are being
>> conflated together. Creating VF netdev's is linked to the PCIe
>> subsystems and brings VFs into the netdev model. This is a good thing
>> but doesn't need to be a global nic policy it can be per port hence
>> the ethtool flag vs devlink discussion. I don't actually have a use case
>> to have one port with VF netdevs and another without it so I'm not too
>> particular on this. Logically it looks like a per port setting because
>> the hardware has no issues with making one physical function create
>> a netdev for each of its VFs and the other one run without these
>> netdevs. This is why I called it out.
>>
>> How this relates to bridge, tc, etc. is now you have a identifier to
>> configure instead of using strange 'ip link set ... vf#' commands. This
>> is great. But I see no reason the hardware has to make changes to
>> the existing tables or any of this. Before we used 'bridge fdb' and 'ip
>> link' now we can use bridge tools more effectively and can deprecate
>> the overloaded use of ip. But again I see no reason to thrash the
>> forwarding state of the switch because we happen to be adding VFs.
>> Having a set of fdb rules to forward MAC/Vlan pairs (as we do now)
>> seems like a perfectly reasonable default. Add with this patch now
>> when I run 'fdb show' I can see the defaults.
>>
>> Maybe I'm reading to much into the devlink flag names and if instead
>> you use a switch like the following,
>>
>>VF representer : enable/disable the creation VF netdev's to represent
>> the virtual functions on the PF
>>
>>
>> Much less complicated then magic switching between forwarding logic IMO
>> and you don't whack a default configuration that an entire stack (e.g.
>> libvirt) has been built to use.
> 
> 
> John,
> 
> I'll try to address here the core questions and arguments you brought.
> 

thanks. Also just to reiterate I really like the series just a few
details here.

> Re letting the user to observe/modify the rules added by the
> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
> will be really pragmatical and doesn't make sense to get that donefor
> the TC subsystem. So this isn't a well defined solution and anyway, as
> you said, legacy mode enhancements is a different exercise. Personally,
> I agree with Jiri, that we should legacy be legacyand focus on adding
> the new model.
> 

The ixgbe driver already supports bridge and tc commands without the VF
representer.  Adding the VF representer to these drivers just extends
the existing support so we have an identifier for VFs and now the
redirect action works and the fdb commands can specify the VF netdevs.
I don't see this as a problem because we already do it today with
'ip' and bridge tools.

We are also slightly in disagreement about what the default should be
with VF netdevs. I think the default should be the same L2 mac/vlan
switch behavior and see no reason to change it by default just because
we added VF netdevs. The infrastructure libvirt/openstack/etc are built
around this default today. But I guess nothing in this series specifies
what the defaults of any given driver will be. VF netdevs are still
useful even on older hardware that only supports mac/vlan forwarding to
expose statistics and send/receive control frames such as lldp.

> The new model has few building blocks, and by all means, have the VF
> representors is not the full story, which is not magic but rather the
> following:
> 
> 1. VF (vport) representors netdevices + the needed mechanics
> (send-to-vport rules that makes xmit on VF rep --> recv on VF)
> 

We all agree on this. For me this should be its own knob VF netdevs or
no VF netdevs.

There is also my point that this is really a port attribute of the PCIe
configuration not a switch attribute.

> 2. handling HW data-patch misses --> send to CPU or drop

Yep need this also but we have a standard way to configure this already
with bridge and 'tc' so why have a toggle for it? Also you don't know
in the driver where I want to send missed packets. In some use cases I
have the VM is managing the system and in these cases I want to send
missed packets to a VF.

In ixgbe we get this for free (with the vf identifier netdevs) because
we have 'tc' and 'bridge' already hooked up. With 'tc' you can define
a wild card match with low priority and with 'bridge' model you can
setup the flood ports to do this.

> 
> 3. ability to offload SW rules (tc/bridge/etc) using VF representors and
> ingress qdiscs / bridge fdb rules / switchdev fdb rule, etc
> 
> The knob we suggested says that the system is put into a state where
> 1,2,3 are needed to make it full 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread Or Gerlitz

On 6/28/2016 10:31 PM, John Fastabend wrote:

On 16-06-28 12:12 PM, Jiri Pirko wrote:


Why?! Please, leave legacy be legacy. Use the new mode for implementing new 
features. Don't make things any more complicated :(


OK so how I read this is there are two things going on that are being
conflated together. Creating VF netdev's is linked to the PCIe
subsystems and brings VFs into the netdev model. This is a good thing
but doesn't need to be a global nic policy it can be per port hence
the ethtool flag vs devlink discussion. I don't actually have a use case
to have one port with VF netdevs and another without it so I'm not too
particular on this. Logically it looks like a per port setting because
the hardware has no issues with making one physical function create
a netdev for each of its VFs and the other one run without these
netdevs. This is why I called it out.

How this relates to bridge, tc, etc. is now you have a identifier to
configure instead of using strange 'ip link set ... vf#' commands. This
is great. But I see no reason the hardware has to make changes to
the existing tables or any of this. Before we used 'bridge fdb' and 'ip
link' now we can use bridge tools more effectively and can deprecate
the overloaded use of ip. But again I see no reason to thrash the
forwarding state of the switch because we happen to be adding VFs.
Having a set of fdb rules to forward MAC/Vlan pairs (as we do now)
seems like a perfectly reasonable default. Add with this patch now
when I run 'fdb show' I can see the defaults.

Maybe I'm reading to much into the devlink flag names and if instead
you use a switch like the following,

   VF representer : enable/disable the creation VF netdev's to represent
the virtual functions on the PF


Much less complicated then magic switching between forwarding logic IMO
and you don't whack a default configuration that an entire stack (e.g.
libvirt) has been built to use.



John,

I'll try to address here the core questions and arguments you brought.

Re letting the user to observe/modify the rules added by the 
driver/firmware while legacy mode. Even if possible with bridge/fdb, it 
will be really pragmatical and doesn't make sense to get that donefor 
the TC subsystem. So this isn't a well defined solution and anyway, as 
you said, legacy mode enhancements is a different exercise. Personally, 
I agree with Jiri, that we should legacy be legacyand focus on adding 
the new model.


The new model has few building blocks, and by all means, have the VF 
representors is not the full story, which is not magic but rather the 
following:


1. VF (vport) representors netdevices + the needed mechanics 
(send-to-vport rules that makes xmit on VF rep --> recv on VF)


2. handling HW data-patch misses --> send to CPU or drop

3. ability to offload SW rules (tc/bridge/etc) using VF representors and 
ingress qdiscs / bridge fdb rules / switchdev fdb rule, etc


The knob we suggested says that the system is put into a state where 
1,2,3 are needed to make it full performance and functional one. This 
submission includes parts 1 and 2, so the offloading of SW rules will 
done in successive submission which uses TC offloads which are already 
upstream (u32 or flower).


So... we're almost in agreement, do you have another name for the knob 
that goes beyond creation/deletion of VF reps? maybe that would be it 
for making a progress...


Or.



Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread Or Gerlitz

On 6/28/2016 7:19 PM, John Fastabend wrote:

On 16-06-28 03:25 AM, Or Gerlitz wrote:

On 6/28/2016 8:57 AM, John Fastabend wrote:


hmm so in the hardware I have there is actually a l2 table and various
other tables so I don't have any issue with doing table setup. I would
like to see a table_create/table_delete/table_show devlink commands at
some point though but I'm not there yet. This would allow users to
optimize the table slices if they cared to. But that is future work
IMO. Certainly not needed in this series at least.


Agree that we could do that and agree that we need not do that now, as 
was agreed (...) in Seville,we are not yet to the geography (== HW 
tables and table graph) advertisement and setup class.






The offloads mode needs to create a black hole miss rule and
send-to-vport rules and create the tables so they can contain later
rules set by the kernel in a way which is HW/driver dependent.

Agreed a black hole miss rule needs to be applied but rather than apply
it automatically with some toggle I would prefer to just add a 'tc' rule
for this. Or alternatively it can be added by configuring flooding
ports so that only a single port is in the flooding mode. This could
all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
Then the user defines the state and not the driver writer. It really is
cleaner in my opinion.


The black hole serves for throwing packets arriving from **anywhere** 
and not matched to any other HW rule towards the CPU where the e-switch 
manager runs.  Hence, it would be correct in my opinion to have it set 
by the e-switch manager and it means when some API/knob is applied on 
PCI device and not network device, so tc and Co will not really serve 
nicely for that.



And send-to-vport rules I'm not entirely clear on what these actually
are used for. Is this a rule to match packets sent from a VF representer
netdev to the actual VF pcie device? If this is the case its seems to
me that any packet sent on a VF representer should be sent to the VF
directly and these rules can be created when the VF is created. Or did
you mean some other rule by this?


YES, send-to-vports rule serve for having the functionality which is 
described in the cover letter and on the relevant commit/s: doing xmit 
on VF rep netdevice always ends up with the packet to arrive the VF PCI 
device. We create these HW rules when in the offloads mode per each 
VF/rep indeed.


So when a driver SRIOV logic wakes up in offloads mode (hopefully will 
happen a lot soon...), they would (1) create VF reps (2) set these 
rules, sure.  Currently a transition from legacy to offloads is defined 
and these two acts are than on the transition, e.g for mlx5 whose 
current default is legacy.


Or.


Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread John Fastabend
On 16-06-28 12:12 PM, Jiri Pirko wrote:
> Tue, Jun 28, 2016 at 09:04:00PM CEST, sridhar.samudr...@intel.com wrote:
>>
>>
>> On 6/28/2016 11:46 AM, Jiri Pirko wrote:
>>> Tue, Jun 28, 2016 at 07:19:06PM CEST, john.fastab...@gmail.com wrote:
 On 16-06-28 09:19 AM, John Fastabend wrote:
> On 16-06-28 03:25 AM, Or Gerlitz wrote:
>> On 6/28/2016 8:57 AM, John Fastabend wrote:
>>> On 16-06-27 09:07 AM, Saeed Mahameed wrote:
 Add the commands to set and show the mode of SRIOV E-Switch, two
 modes are supported:

 * legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
 * offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows
 based) set by the host OS

 Nice work overall also I really appreciated that the core networking
 interfaces appear to able to support this without any change.
>> thanks..
>>
 On this patch though do we really need modes like this? My concern with
 modes is two fold. One its another knob that some controller will have
 to get right which I would prefer to avoid. And two I suspect switching
 between the two modes flushes the tables or leaves them in some
 unexpected state? At least I can't figure out what the expected should
 be off-hand.
>> Re the 1st concern (another knob), I think we do want that, see below
>>
>> Re the 2nd concern, I will re-read the cover letter and change logs and
>> if needed clarify/improve: the transition is clean! When you are moving
> >from legacy to offloads or the other way around, nothing is left in
>> unexpected state,  all HW forwarding tables as filled by the current
>> mode are flushed and next they are set as needed for the new mode.
>>
> OK if I had read the entire patch series maybe I would have caught this
> :)
>
 Could we instead continue to use the "legacy" mode by default by just
 populating the fdb table correctly and then if users want to enable
 the "offloads" mode they can modify the fdb tables by deleting entries
 or adding them or just extending the dmac/vf mapping via 'tc'. This
 would seem natural to me. The flooding rules in fdb might need to be
 exposed a bit more cleanly to get the right default flooding behavior
 etc. But to me at least this would be much cleaner. Everything will be
 nicely defined and we wont have issues with drivers doing slightly
 and subtle different defaults between legacy/offload and the 
 transitions
 between the states or on resets or etc. If users need to discover the
 current configuration then they just query fdb, query tc, and the state
 is known no need for any magic toggle switch as best I can see.
>>
>> Few comments here:
>>
>> Each mode has it's own way of the driver doing setup for the HW tables
>> and how population of the HW tables is done.
> hmm so in the hardware I have there is actually a l2 table and various
> other tables so I don't have any issue with doing table setup. I would
> like to see a table_create/table_delete/table_show devlink commands at
> some point though but I'm not there yet. This would allow users to
> optimize the table slices if they cared to. But that is future work
> IMO. Certainly not needed in this series at least. If you want I can
> show you a patch I had for this against rocker but it was before devlink
> so it would need some porting.
>
>> The offloads mode needs to create a black hole miss rule and
>> send-to-vport rules and create the tables so they can contain later
>> rules set by the kernel in a way which is HW/driver dependent.
> Agreed a black hole miss rule needs to be applied but rather than apply
> it automatically with some toggle I would prefer to just add a 'tc' rule
> for this. Or alternatively it can be added by configuring flooding
> ports so that only a single port is in the flooding mode. This could
> all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
> Then the user defines the state and not the driver writer. It really is
> cleaner in my opinion.
>
> One oddball case I have is if I have two PF functions behind a single
> network facing port. Yes its a bit strange but in this case its nice to
> pick which host facing PF to flood on vs the driver picking one.
>
> And send-to-vport rules I'm not entirely clear on what these actually
> are used for. Is this a rule to match packets sent from a VF representer
> netdev to the actual VF pcie device? If this is the case its seems to
> me that any packet sent on a VF representer should be sent to the VF
> directly and these rules can be created when the VF is created. Or did
> you mean some other rule by this?
>
>> The legacy mode 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread Jiri Pirko
Tue, Jun 28, 2016 at 09:04:00PM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 6/28/2016 11:46 AM, Jiri Pirko wrote:
>>Tue, Jun 28, 2016 at 07:19:06PM CEST, john.fastab...@gmail.com wrote:
>>>On 16-06-28 09:19 AM, John Fastabend wrote:
On 16-06-28 03:25 AM, Or Gerlitz wrote:
>On 6/28/2016 8:57 AM, John Fastabend wrote:
>>On 16-06-27 09:07 AM, Saeed Mahameed wrote:
>>>Add the commands to set and show the mode of SRIOV E-Switch, two
>>>modes are supported:
>>>
>>>* legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
>>>* offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows
>>>based) set by the host OS
>>>
>>>Nice work overall also I really appreciated that the core networking
>>>interfaces appear to able to support this without any change.
>thanks..
>
>>>On this patch though do we really need modes like this? My concern with
>>>modes is two fold. One its another knob that some controller will have
>>>to get right which I would prefer to avoid. And two I suspect switching
>>>between the two modes flushes the tables or leaves them in some
>>>unexpected state? At least I can't figure out what the expected should
>>>be off-hand.
>Re the 1st concern (another knob), I think we do want that, see below
>
>Re the 2nd concern, I will re-read the cover letter and change logs and
>if needed clarify/improve: the transition is clean! When you are moving
>from legacy to offloads or the other way around, nothing is left in
>unexpected state,  all HW forwarding tables as filled by the current
>mode are flushed and next they are set as needed for the new mode.
>
OK if I had read the entire patch series maybe I would have caught this
:)

>>>Could we instead continue to use the "legacy" mode by default by just
>>>populating the fdb table correctly and then if users want to enable
>>>the "offloads" mode they can modify the fdb tables by deleting entries
>>>or adding them or just extending the dmac/vf mapping via 'tc'. This
>>>would seem natural to me. The flooding rules in fdb might need to be
>>>exposed a bit more cleanly to get the right default flooding behavior
>>>etc. But to me at least this would be much cleaner. Everything will be
>>>nicely defined and we wont have issues with drivers doing slightly
>>>and subtle different defaults between legacy/offload and the transitions
>>>between the states or on resets or etc. If users need to discover the
>>>current configuration then they just query fdb, query tc, and the state
>>>is known no need for any magic toggle switch as best I can see.
>
>Few comments here:
>
>Each mode has it's own way of the driver doing setup for the HW tables
>and how population of the HW tables is done.
hmm so in the hardware I have there is actually a l2 table and various
other tables so I don't have any issue with doing table setup. I would
like to see a table_create/table_delete/table_show devlink commands at
some point though but I'm not there yet. This would allow users to
optimize the table slices if they cared to. But that is future work
IMO. Certainly not needed in this series at least. If you want I can
show you a patch I had for this against rocker but it was before devlink
so it would need some porting.

>The offloads mode needs to create a black hole miss rule and
>send-to-vport rules and create the tables so they can contain later
>rules set by the kernel in a way which is HW/driver dependent.
Agreed a black hole miss rule needs to be applied but rather than apply
it automatically with some toggle I would prefer to just add a 'tc' rule
for this. Or alternatively it can be added by configuring flooding
ports so that only a single port is in the flooding mode. This could
all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
Then the user defines the state and not the driver writer. It really is
cleaner in my opinion.

One oddball case I have is if I have two PF functions behind a single
network facing port. Yes its a bit strange but in this case its nice to
pick which host facing PF to flood on vs the driver picking one.

And send-to-vport rules I'm not entirely clear on what these actually
are used for. Is this a rule to match packets sent from a VF representer
netdev to the actual VF pcie device? If this is the case its seems to
me that any packet sent on a VF representer should be sent to the VF
directly and these rules can be created when the VF is created. Or did
you mean some other rule by this?

>The legacy mode creates the tables differently and populates them later
>with rule set by
>the driver and not the kernel.
>
>Even if we put the different table setup issue a side, I don't think it
>would be 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread Samudrala, Sridhar



On 6/28/2016 11:46 AM, Jiri Pirko wrote:

Tue, Jun 28, 2016 at 07:19:06PM CEST, john.fastab...@gmail.com wrote:

On 16-06-28 09:19 AM, John Fastabend wrote:

On 16-06-28 03:25 AM, Or Gerlitz wrote:

On 6/28/2016 8:57 AM, John Fastabend wrote:

On 16-06-27 09:07 AM, Saeed Mahameed wrote:

Add the commands to set and show the mode of SRIOV E-Switch, two
modes are supported:

* legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
* offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows
based) set by the host OS

Nice work overall also I really appreciated that the core networking
interfaces appear to able to support this without any change.

thanks..


On this patch though do we really need modes like this? My concern with
modes is two fold. One its another knob that some controller will have
to get right which I would prefer to avoid. And two I suspect switching
between the two modes flushes the tables or leaves them in some
unexpected state? At least I can't figure out what the expected should
be off-hand.

Re the 1st concern (another knob), I think we do want that, see below

Re the 2nd concern, I will re-read the cover letter and change logs and
if needed clarify/improve: the transition is clean! When you are moving
from legacy to offloads or the other way around, nothing is left in
unexpected state,  all HW forwarding tables as filled by the current
mode are flushed and next they are set as needed for the new mode.


OK if I had read the entire patch series maybe I would have caught this
:)


Could we instead continue to use the "legacy" mode by default by just
populating the fdb table correctly and then if users want to enable
the "offloads" mode they can modify the fdb tables by deleting entries
or adding them or just extending the dmac/vf mapping via 'tc'. This
would seem natural to me. The flooding rules in fdb might need to be
exposed a bit more cleanly to get the right default flooding behavior
etc. But to me at least this would be much cleaner. Everything will be
nicely defined and we wont have issues with drivers doing slightly
and subtle different defaults between legacy/offload and the transitions
between the states or on resets or etc. If users need to discover the
current configuration then they just query fdb, query tc, and the state
is known no need for any magic toggle switch as best I can see.


Few comments here:

Each mode has it's own way of the driver doing setup for the HW tables
and how population of the HW tables is done.

hmm so in the hardware I have there is actually a l2 table and various
other tables so I don't have any issue with doing table setup. I would
like to see a table_create/table_delete/table_show devlink commands at
some point though but I'm not there yet. This would allow users to
optimize the table slices if they cared to. But that is future work
IMO. Certainly not needed in this series at least. If you want I can
show you a patch I had for this against rocker but it was before devlink
so it would need some porting.


The offloads mode needs to create a black hole miss rule and
send-to-vport rules and create the tables so they can contain later
rules set by the kernel in a way which is HW/driver dependent.

Agreed a black hole miss rule needs to be applied but rather than apply
it automatically with some toggle I would prefer to just add a 'tc' rule
for this. Or alternatively it can be added by configuring flooding
ports so that only a single port is in the flooding mode. This could
all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
Then the user defines the state and not the driver writer. It really is
cleaner in my opinion.

One oddball case I have is if I have two PF functions behind a single
network facing port. Yes its a bit strange but in this case its nice to
pick which host facing PF to flood on vs the driver picking one.

And send-to-vport rules I'm not entirely clear on what these actually
are used for. Is this a rule to match packets sent from a VF representer
netdev to the actual VF pcie device? If this is the case its seems to
me that any packet sent on a VF representer should be sent to the VF
directly and these rules can be created when the VF is created. Or did
you mean some other rule by this?


The legacy mode creates the tables differently and populates them later
with rule set by
the driver and not the kernel.

Even if we put the different table setup issue a side, I don't think it
would be correct for bridge/tc to remove rules they didn't add, which is
needed under your proposal when moving from legacy type rules to
offloads mode. Querying is problematic too, since legacy could (and
does) involve some default rules set by the FW, e.g that deals with
outer world (== not belonging to VM on this host) MACs which are
invisible to the driver.

But even legacy mode should report the correct fdb table and setup.
I don't think querying should be a problem if the driver reports the
configuration 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread Jiri Pirko
Tue, Jun 28, 2016 at 07:19:06PM CEST, john.fastab...@gmail.com wrote:
>On 16-06-28 09:19 AM, John Fastabend wrote:
>> On 16-06-28 03:25 AM, Or Gerlitz wrote:
>>> On 6/28/2016 8:57 AM, John Fastabend wrote:
 On 16-06-27 09:07 AM, Saeed Mahameed wrote:
> Add the commands to set and show the mode of SRIOV E-Switch, two
> modes are supported:
>
> * legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
> * offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows
> based) set by the host OS
>
> Nice work overall also I really appreciated that the core networking
> interfaces appear to able to support this without any change.
>>>
>>> thanks..
>>>
> On this patch though do we really need modes like this? My concern with
> modes is two fold. One its another knob that some controller will have
> to get right which I would prefer to avoid. And two I suspect switching
> between the two modes flushes the tables or leaves them in some
> unexpected state? At least I can't figure out what the expected should
> be off-hand.
>>>
>>> Re the 1st concern (another knob), I think we do want that, see below
>>>
>>> Re the 2nd concern, I will re-read the cover letter and change logs and
>>> if needed clarify/improve: the transition is clean! When you are moving
>>> from legacy to offloads or the other way around, nothing is left in
>>> unexpected state,  all HW forwarding tables as filled by the current
>>> mode are flushed and next they are set as needed for the new mode.
>>>
>> 
>> OK if I had read the entire patch series maybe I would have caught this
>> :)
>> 
> Could we instead continue to use the "legacy" mode by default by just
> populating the fdb table correctly and then if users want to enable
> the "offloads" mode they can modify the fdb tables by deleting entries
> or adding them or just extending the dmac/vf mapping via 'tc'. This
> would seem natural to me. The flooding rules in fdb might need to be
> exposed a bit more cleanly to get the right default flooding behavior
> etc. But to me at least this would be much cleaner. Everything will be
> nicely defined and we wont have issues with drivers doing slightly
> and subtle different defaults between legacy/offload and the transitions
> between the states or on resets or etc. If users need to discover the
> current configuration then they just query fdb, query tc, and the state
> is known no need for any magic toggle switch as best I can see.
>>>
>>>
>>> Few comments here:
>>>
>>> Each mode has it's own way of the driver doing setup for the HW tables
>>> and how population of the HW tables is done.
>> 
>> hmm so in the hardware I have there is actually a l2 table and various
>> other tables so I don't have any issue with doing table setup. I would
>> like to see a table_create/table_delete/table_show devlink commands at
>> some point though but I'm not there yet. This would allow users to
>> optimize the table slices if they cared to. But that is future work
>> IMO. Certainly not needed in this series at least. If you want I can
>> show you a patch I had for this against rocker but it was before devlink
>> so it would need some porting.
>> 
>>>
>>> The offloads mode needs to create a black hole miss rule and
>>> send-to-vport rules and create the tables so they can contain later
>>> rules set by the kernel in a way which is HW/driver dependent.
>> 
>> Agreed a black hole miss rule needs to be applied but rather than apply
>> it automatically with some toggle I would prefer to just add a 'tc' rule
>> for this. Or alternatively it can be added by configuring flooding
>> ports so that only a single port is in the flooding mode. This could
>> all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
>> Then the user defines the state and not the driver writer. It really is
>> cleaner in my opinion.
>> 
>> One oddball case I have is if I have two PF functions behind a single
>> network facing port. Yes its a bit strange but in this case its nice to
>> pick which host facing PF to flood on vs the driver picking one.
>> 
>> And send-to-vport rules I'm not entirely clear on what these actually
>> are used for. Is this a rule to match packets sent from a VF representer
>> netdev to the actual VF pcie device? If this is the case its seems to
>> me that any packet sent on a VF representer should be sent to the VF
>> directly and these rules can be created when the VF is created. Or did
>> you mean some other rule by this?
>> 
>>>
>>> The legacy mode creates the tables differently and populates them later
>>> with rule set by
>>> the driver and not the kernel.
>>>
>>> Even if we put the different table setup issue a side, I don't think it
>>> would be correct for bridge/tc to remove rules they didn't add, which is
>>> needed under your proposal when moving from legacy type rules to
>>> offloads mode. Querying is problematic too, 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread John Fastabend
On 16-06-28 09:19 AM, John Fastabend wrote:
> On 16-06-28 03:25 AM, Or Gerlitz wrote:
>> On 6/28/2016 8:57 AM, John Fastabend wrote:
>>> On 16-06-27 09:07 AM, Saeed Mahameed wrote:
 Add the commands to set and show the mode of SRIOV E-Switch, two
 modes are supported:

 * legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
 * offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows
 based) set by the host OS

 Nice work overall also I really appreciated that the core networking
 interfaces appear to able to support this without any change.
>>
>> thanks..
>>
 On this patch though do we really need modes like this? My concern with
 modes is two fold. One its another knob that some controller will have
 to get right which I would prefer to avoid. And two I suspect switching
 between the two modes flushes the tables or leaves them in some
 unexpected state? At least I can't figure out what the expected should
 be off-hand.
>>
>> Re the 1st concern (another knob), I think we do want that, see below
>>
>> Re the 2nd concern, I will re-read the cover letter and change logs and
>> if needed clarify/improve: the transition is clean! When you are moving
>> from legacy to offloads or the other way around, nothing is left in
>> unexpected state,  all HW forwarding tables as filled by the current
>> mode are flushed and next they are set as needed for the new mode.
>>
> 
> OK if I had read the entire patch series maybe I would have caught this
> :)
> 
 Could we instead continue to use the "legacy" mode by default by just
 populating the fdb table correctly and then if users want to enable
 the "offloads" mode they can modify the fdb tables by deleting entries
 or adding them or just extending the dmac/vf mapping via 'tc'. This
 would seem natural to me. The flooding rules in fdb might need to be
 exposed a bit more cleanly to get the right default flooding behavior
 etc. But to me at least this would be much cleaner. Everything will be
 nicely defined and we wont have issues with drivers doing slightly
 and subtle different defaults between legacy/offload and the transitions
 between the states or on resets or etc. If users need to discover the
 current configuration then they just query fdb, query tc, and the state
 is known no need for any magic toggle switch as best I can see.
>>
>>
>> Few comments here:
>>
>> Each mode has it's own way of the driver doing setup for the HW tables
>> and how population of the HW tables is done.
> 
> hmm so in the hardware I have there is actually a l2 table and various
> other tables so I don't have any issue with doing table setup. I would
> like to see a table_create/table_delete/table_show devlink commands at
> some point though but I'm not there yet. This would allow users to
> optimize the table slices if they cared to. But that is future work
> IMO. Certainly not needed in this series at least. If you want I can
> show you a patch I had for this against rocker but it was before devlink
> so it would need some porting.
> 
>>
>> The offloads mode needs to create a black hole miss rule and
>> send-to-vport rules and create the tables so they can contain later
>> rules set by the kernel in a way which is HW/driver dependent.
> 
> Agreed a black hole miss rule needs to be applied but rather than apply
> it automatically with some toggle I would prefer to just add a 'tc' rule
> for this. Or alternatively it can be added by configuring flooding
> ports so that only a single port is in the flooding mode. This could
> all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
> Then the user defines the state and not the driver writer. It really is
> cleaner in my opinion.
> 
> One oddball case I have is if I have two PF functions behind a single
> network facing port. Yes its a bit strange but in this case its nice to
> pick which host facing PF to flood on vs the driver picking one.
> 
> And send-to-vport rules I'm not entirely clear on what these actually
> are used for. Is this a rule to match packets sent from a VF representer
> netdev to the actual VF pcie device? If this is the case its seems to
> me that any packet sent on a VF representer should be sent to the VF
> directly and these rules can be created when the VF is created. Or did
> you mean some other rule by this?
> 
>>
>> The legacy mode creates the tables differently and populates them later
>> with rule set by
>> the driver and not the kernel.
>>
>> Even if we put the different table setup issue a side, I don't think it
>> would be correct for bridge/tc to remove rules they didn't add, which is
>> needed under your proposal when moving from legacy type rules to
>> offloads mode. Querying is problematic too, since legacy could (and
>> does) involve some default rules set by the FW, e.g that deals with
>> outer world (== not belonging to VM on this host) MACs which are
>> 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread John Fastabend
On 16-06-28 03:25 AM, Or Gerlitz wrote:
> On 6/28/2016 8:57 AM, John Fastabend wrote:
>> On 16-06-27 09:07 AM, Saeed Mahameed wrote:
>>> Add the commands to set and show the mode of SRIOV E-Switch, two
>>> modes are supported:
>>>
>>> * legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
>>> * offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows
>>> based) set by the host OS
>>>
>>> Nice work overall also I really appreciated that the core networking
>>> interfaces appear to able to support this without any change.
> 
> thanks..
> 
>>> On this patch though do we really need modes like this? My concern with
>>> modes is two fold. One its another knob that some controller will have
>>> to get right which I would prefer to avoid. And two I suspect switching
>>> between the two modes flushes the tables or leaves them in some
>>> unexpected state? At least I can't figure out what the expected should
>>> be off-hand.
> 
> Re the 1st concern (another knob), I think we do want that, see below
> 
> Re the 2nd concern, I will re-read the cover letter and change logs and
> if needed clarify/improve: the transition is clean! When you are moving
> from legacy to offloads or the other way around, nothing is left in
> unexpected state,  all HW forwarding tables as filled by the current
> mode are flushed and next they are set as needed for the new mode.
> 

OK if I had read the entire patch series maybe I would have caught this
:)

>>> Could we instead continue to use the "legacy" mode by default by just
>>> populating the fdb table correctly and then if users want to enable
>>> the "offloads" mode they can modify the fdb tables by deleting entries
>>> or adding them or just extending the dmac/vf mapping via 'tc'. This
>>> would seem natural to me. The flooding rules in fdb might need to be
>>> exposed a bit more cleanly to get the right default flooding behavior
>>> etc. But to me at least this would be much cleaner. Everything will be
>>> nicely defined and we wont have issues with drivers doing slightly
>>> and subtle different defaults between legacy/offload and the transitions
>>> between the states or on resets or etc. If users need to discover the
>>> current configuration then they just query fdb, query tc, and the state
>>> is known no need for any magic toggle switch as best I can see.
> 
> 
> Few comments here:
> 
> Each mode has it's own way of the driver doing setup for the HW tables
> and how population of the HW tables is done.

hmm so in the hardware I have there is actually a l2 table and various
other tables so I don't have any issue with doing table setup. I would
like to see a table_create/table_delete/table_show devlink commands at
some point though but I'm not there yet. This would allow users to
optimize the table slices if they cared to. But that is future work
IMO. Certainly not needed in this series at least. If you want I can
show you a patch I had for this against rocker but it was before devlink
so it would need some porting.

> 
> The offloads mode needs to create a black hole miss rule and
> send-to-vport rules and create the tables so they can contain later
> rules set by the kernel in a way which is HW/driver dependent.

Agreed a black hole miss rule needs to be applied but rather than apply
it automatically with some toggle I would prefer to just add a 'tc' rule
for this. Or alternatively it can be added by configuring flooding
ports so that only a single port is in the flooding mode. This could
all be done via 'bridge fdb ...' and 'bridge link ...' today I believe.
Then the user defines the state and not the driver writer. It really is
cleaner in my opinion.

One oddball case I have is if I have two PF functions behind a single
network facing port. Yes its a bit strange but in this case its nice to
pick which host facing PF to flood on vs the driver picking one.

And send-to-vport rules I'm not entirely clear on what these actually
are used for. Is this a rule to match packets sent from a VF representer
netdev to the actual VF pcie device? If this is the case its seems to
me that any packet sent on a VF representer should be sent to the VF
directly and these rules can be created when the VF is created. Or did
you mean some other rule by this?

> 
> The legacy mode creates the tables differently and populates them later
> with rule set by
> the driver and not the kernel.
> 
> Even if we put the different table setup issue a side, I don't think it
> would be correct for bridge/tc to remove rules they didn't add, which is
> needed under your proposal when moving from legacy type rules to
> offloads mode. Querying is problematic too, since legacy could (and
> does) involve some default rules set by the FW, e.g that deals with
> outer world (== not belonging to VM on this host) MACs which are
> invisible to the driver.

But even legacy mode should report the correct fdb table and setup.
I don't think querying should be a problem if the driver reports the

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread Jiri Pirko
Mon, Jun 27, 2016 at 06:07:21PM CEST, sae...@mellanox.com wrote:
>From: Or Gerlitz 
>
>Add the commands to set and show the mode of SRIOV E-Switch,
>two modes are supported:
>
>* legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
>* offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows based) set 
>by the host OS
>
>Signed-off-by: Or Gerlitz 
>Signed-off-by: Saeed Mahameed 

Acked-by: Jiri Pirko 

Looks fine to me. Usable for many drivers of devices containing embedded
switch. We need this for clean transition from legacy handling of embedded
switches we have in drivers currently to new switchdev model.

Thanks!


Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-28 Thread Or Gerlitz

On 6/28/2016 8:57 AM, John Fastabend wrote:

On 16-06-27 09:07 AM, Saeed Mahameed wrote:

Add the commands to set and show the mode of SRIOV E-Switch, two modes are 
supported:

* legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
* offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows based) set 
by the host OS

Nice work overall also I really appreciated that the core networking
interfaces appear to able to support this without any change.


thanks..


On this patch though do we really need modes like this? My concern with
modes is two fold. One its another knob that some controller will have
to get right which I would prefer to avoid. And two I suspect switching
between the two modes flushes the tables or leaves them in some
unexpected state? At least I can't figure out what the expected should
be off-hand.


Re the 1st concern (another knob), I think we do want that, see below

Re the 2nd concern, I will re-read the cover letter and change logs and 
if needed clarify/improve: the transition is clean! When you are moving 
from legacy to offloads or the other way around, nothing is left in 
unexpected state,  all HW forwarding tables as filled by the current 
mode are flushed and next they are set as needed for the new mode.



Could we instead continue to use the "legacy" mode by default by just
populating the fdb table correctly and then if users want to enable
the "offloads" mode they can modify the fdb tables by deleting entries
or adding them or just extending the dmac/vf mapping via 'tc'. This
would seem natural to me. The flooding rules in fdb might need to be
exposed a bit more cleanly to get the right default flooding behavior
etc. But to me at least this would be much cleaner. Everything will be
nicely defined and we wont have issues with drivers doing slightly
and subtle different defaults between legacy/offload and the transitions
between the states or on resets or etc. If users need to discover the
current configuration then they just query fdb, query tc, and the state
is known no need for any magic toggle switch as best I can see.



Few comments here:

Each mode has it's own way of the driver doing setup for the HW tables 
and how population of the HW tables is done.


The offloads mode needs to create a black hole miss rule and 
send-to-vport rules and create the tables so they can contain later 
rules set by the kernel in a way which is HW/driver dependent.


The legacy mode creates the tables differently and populates them later 
with rule set by

the driver and not the kernel.

Even if we put the different table setup issue a side, I don't think it 
would be correct for bridge/tc to remove rules they didn't add, which is 
needed under your proposal when moving from legacy type rules to 
offloads mode. Querying is problematic too, since legacy could (and 
does) involve some default rules set by the FW, e.g that deals with 
outer world (== not belonging to VM on this host) MACs which are 
invisible to the driver.


That legacy was here and we can't avoid handling it properly for which 
this knob is needed. Note that a vendor can choose to put their default 
to be offloads, hopefully over time, we will all go there :)



Otherwise I didn't review the mlx code but read the commit msgs and
it looks good. I'll take a closer look in the morning.


appreciated



Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-27 Thread John Fastabend
On 16-06-27 09:07 AM, Saeed Mahameed wrote:
> From: Or Gerlitz 
> 
> Add the commands to set and show the mode of SRIOV E-Switch,
> two modes are supported:
> 
> * legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
> * offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows based) 
> set by the host OS
> 
> Signed-off-by: Or Gerlitz 
> Signed-off-by: Saeed Mahameed 
> ---


Hi,

Nice work overall also I really appreciated that the core networking
interfaces appear to able to support this without any change.

On this patch though do we really need modes like this? My concern with
modes is two fold. One its another knob that some controller will have
to get right which I would prefer to avoid. And two I suspect switching
between the two modes flushes the tables or leaves them in some
unexpected state? At least I can't figure out what the expected should
be off-hand.

Could we instead continue to use the "legacy" mode by default by just
populating the fdb table correctly and then if users want to enable
the "offloads" mode they can modify the fdb tables by deleting entries
or adding them or just extending the dmac/vf mapping via 'tc'. This
would seem natural to me. The flooding rules in fdb might need to be
exposed a bit more cleanly to get the right default flooding behavior
etc. But to me at least this would be much cleaner. Everything will be
nicely defined and we wont have issues with drivers doing slightly
and subtle different defaults between legacy/offload and the transitions
between the states or on resets or etc. If users need to discover the
current configuration then they just query fdb, query tc, and the state
is known no need for any magic toggle switch as best I can see.

Otherwise I didn't review the mlx code but read the commit msgs and
it looks good. I'll take a closer look in the morning.

Thanks,
John




[PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-27 Thread Saeed Mahameed
From: Or Gerlitz 

Add the commands to set and show the mode of SRIOV E-Switch,
two modes are supported:

* legacy   : operating in the "old" L2 based mode (DMAC --> VF vport)
* offloads : offloading SW rules/policy (e.g Bridge/FDB or TC/Flows based) set 
by the host OS

Signed-off-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 include/net/devlink.h|  3 ++
 include/uapi/linux/devlink.h |  9 +
 net/core/devlink.c   | 87 
 3 files changed, 99 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 1d45b61..c99ffe8 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -90,6 +90,9 @@ struct devlink_ops {
   u16 tc_index,
   enum devlink_sb_pool_type pool_type,
   u32 *p_cur, u32 *p_max);
+
+   int (*eswitch_mode_get)(struct devlink *devlink, u16 *p_mode);
+   int (*eswitch_mode_set)(struct devlink *devlink, u16 mode);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index ba0073b..dd7c1b4 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -57,6 +57,8 @@ enum devlink_command {
DEVLINK_CMD_SB_OCC_SNAPSHOT,
DEVLINK_CMD_SB_OCC_MAX_CLEAR,
 
+   DEVLINK_CMD_ESWITCH_MODE_GET,
+   DEVLINK_CMD_ESWITCH_MODE_SET,
/* add new commands above here */
 
__DEVLINK_CMD_MAX,
@@ -95,6 +97,12 @@ enum devlink_sb_threshold_type {
 
 #define DEVLINK_SB_THRESHOLD_TO_ALPHA_MAX 20
 
+enum devlink_eswitch_mode {
+   DEVLINK_ESWITCH_MODE_NONE,
+   DEVLINK_ESWITCH_MODE_LEGACY,
+   DEVLINK_ESWITCH_MODE_OFFLOADS,
+};
+
 enum devlink_attr {
/* don't change the order or add anything between, this is ABI! */
DEVLINK_ATTR_UNSPEC,
@@ -125,6 +133,7 @@ enum devlink_attr {
DEVLINK_ATTR_SB_TC_INDEX,   /* u16 */
DEVLINK_ATTR_SB_OCC_CUR,/* u32 */
DEVLINK_ATTR_SB_OCC_MAX,/* u32 */
+   DEVLINK_ATTR_ESWITCH_MODE,  /* u16 */
 
/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 933e8d4..b2e592a 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1394,6 +1394,78 @@ static int devlink_nl_cmd_sb_occ_max_clear_doit(struct 
sk_buff *skb,
return -EOPNOTSUPP;
 }
 
+static int devlink_eswitch_fill(struct sk_buff *msg, struct devlink *devlink,
+   enum devlink_command cmd, u32 portid,
+   u32 seq, int flags, u16 mode)
+{
+   void *hdr;
+
+   hdr = genlmsg_put(msg, portid, seq, _nl_family, flags, cmd);
+   if (!hdr)
+   return -EMSGSIZE;
+
+   if (devlink_nl_put_handle(msg, devlink))
+   goto nla_put_failure;
+
+   if (nla_put_u16(msg, DEVLINK_ATTR_ESWITCH_MODE, mode))
+   goto nla_put_failure;
+
+   genlmsg_end(msg, hdr);
+   return 0;
+
+nla_put_failure:
+   genlmsg_cancel(msg, hdr);
+   return -EMSGSIZE;
+}
+
+static int devlink_nl_cmd_eswitch_mode_get_doit(struct sk_buff *skb,
+   struct genl_info *info)
+{
+   struct devlink *devlink = info->user_ptr[0];
+   const struct devlink_ops *ops = devlink->ops;
+   struct sk_buff *msg;
+   u16 mode;
+   int err;
+
+   if (!ops || !ops->eswitch_mode_get)
+   return -EOPNOTSUPP;
+
+   err = ops->eswitch_mode_get(devlink, );
+   if (err)
+   return err;
+
+   msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+   if (!msg)
+   return -ENOMEM;
+
+   err = devlink_eswitch_fill(msg, devlink, DEVLINK_CMD_ESWITCH_MODE_GET,
+  info->snd_portid, info->snd_seq, 0, mode);
+
+   if (err) {
+   nlmsg_free(msg);
+   return err;
+   }
+
+   return genlmsg_reply(msg, info);
+}
+
+static int devlink_nl_cmd_eswitch_mode_set_doit(struct sk_buff *skb,
+   struct genl_info *info)
+{
+   struct devlink *devlink = info->user_ptr[0];
+   const struct devlink_ops *ops = devlink->ops;
+   u16 mode;
+
+   if (!info->attrs[DEVLINK_ATTR_ESWITCH_MODE])
+   return -EINVAL;
+
+   mode = nla_get_u16(info->attrs[DEVLINK_ATTR_ESWITCH_MODE]);
+
+   if (ops && ops->eswitch_mode_set)
+   return ops->eswitch_mode_set(devlink, mode);
+   return -EOPNOTSUPP;
+}
+
 static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING },
[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING },
@@ -1407,6 +1479,7 @@ static const struct