Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
On Wed, Mar 14, 2018 at 5:56 PM, Jiri Pirkowrote: > Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz...@gmail.com wrote: >>On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko wrote: >>> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz...@gmail.com wrote: On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko wrote: >> This sounds nice for the case where one install ingress tc rules on the bond (lets call them type 1, see next) One obstacle pointed by my colleague, Rabie, is that when the upper layer issues stat call on the filter, they will get two replies, this can confuse them and lead to wrong decisions (aging). I wonder if/how we can set a knob >>> >>> The bonding itself would not do anything on stats update >>> command (TC_CLSFLOWER_STATS for example). Only the slaves would do >>> update. So there will be only reply from slaves. >>> >>> Bond/team is just going to probagare block bind/unbind down. Nothing else. >> >>Do we agree that user space will get the replies of all lower (slave) devices, >>or I am missing something here? > > "user space will get the replies" - not sure what exactly do you mean by > this. The stats would be accumulated over all devices/drivers who > registered block callback. OK, this is probably something I have to check, thanks 2. bond being egress port of a rule 2.1 VF rep --> uplink 0 2.2 VF rep --> uplink 1 and we do that in the driver (add/del two HW rules, combine the stat results, etc) >>> >>> That is up to the driver. If the driver can share block between 2 >>> devices, he can do that. If he cannot share, it will just report stats >>> for every device separatelly (2 block cbs registered) and tc will see >>> them both together. No need to do anything in driver. >> >>right >> 3. ingress rule on VF rep port with shared tunnel device being the egress (encap) and where the routing of the underlay (tunnel) goes through LAG. >> >>> Same as "2." >> >>ok >> 4. ingress rule shared tunnel device being the ingress and VF rep port being the egress (decap) >>> I don't follow :( >> the way tunneling is handled in tc classifier/action is >> encap: ingress: net port, action1: tunnel key set action2: mirred to >> shared-tunnel device >> decap: ingress: shared tunnel device, action1: tunnel key unset >> action2: mirred to net port >> type 4 are the decap rules, when we offload it to as HW ACL we stretch >> the line and the ingress in a HW port too (e.g uplink port in NICs) > Okay, I see. But where's the bond here? Is it the one I mentioned as > "mirred redirect to lag"? since the ingress port is not HW port, we will use the egdev approach and offload the rule as the uplink of this VF rep port being the ingress. Since we will see that this uplink is into LAG, we will offload another rule which the 2nd uplink being the ingress >>> I see another thing we need to sanitize: vxlan rule ingress match action >>> mirred redirect to lag >>right, we don't have for NIC but for switch ASIC, I guess it is applicable > Yes, it is. For future NICs I guess it is going to be as well. might
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz...@gmail.com wrote: >On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirkowrote: >> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz...@gmail.com wrote: >>>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko wrote: > >>>This sounds nice for the case where one install ingress tc rules on >>>the bond (lets >>>call them type 1, see next) >>> >>>One obstacle pointed by my colleague, Rabie, is that when the upper layer >>>issues stat call on the filter, they will get two replies, this can confuse >>>them >>>and lead to wrong decisions (aging). I wonder if/how we can set a knob >> >> The bonding itself would not do anything on stats update >> command (TC_CLSFLOWER_STATS for example). Only the slaves would do >> update. So there will be only reply from slaves. >> >> Bond/team is just going to probagare block bind/unbind down. Nothing else. > >Do we agree that user space will get the replies of all lower (slave) devices, >or I am missing something here? "user space will get the replies" - not sure what exactly do you mean by this. The stats would be accumulated over all devices/drivers who registered block callback. > >>>2. bond being egress port of a rule >>>2.1 VF rep --> uplink 0 >>>2.2 VF rep --> uplink 1 >>> >>>and we do that in the driver (add/del two HW rules, combine the stat >>>results, etc) >> >> That is up to the driver. If the driver can share block between 2 >> devices, he can do that. If he cannot share, it will just report stats >> for every device separatelly (2 block cbs registered) and tc will see >> them both together. No need to do anything in driver. > >right > >>>3. ingress rule on VF rep port with shared tunnel device being the >>>egress (encap) >>>and where the routing of the underlay (tunnel) goes through LAG. > >> Same as "2." > >ok > >>>4. ingress rule shared tunnel device being the ingress and VF rep port >>>being the egress (decap) > >> I don't follow :( > >the way tunneling is handled in tc classifier/action is > >encap: ingress: net port, action1: tunnel key set action2: mirred to >shared-tunnel device > >decap: ingress: shared tunnel device, action1: tunnel key unset >action2: mirred to net port > >type 4 are the decap rules, when we offload it to as HW ACL we stretch >the line and the ingress >in a HW port too (e.g uplink port in NICs) Okay, I see. But where's the bond here? Is it the one I mentioned as "mirred redirect to lag"? > > >>>this uses the egdev facility to be offloaded into the our driver, and >>>then in the driver >>>we will treat it like type 1, two rules need to be installed into HW, >>>but now, we can't delegate them >>>from the vxlan device b/c it has no direct connection with the bond. > >> I see another thing we need to sanitize: vxlan rule ingress match action >> mirred redirect to lag > >right, we don't have for NIC but for switch ASIC, I guess it is applicable Yes, it is. For future NICs I guess it is going to be as well.
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
Wed, Mar 14, 2018 at 02:50:02AM CET, jakub.kicin...@netronome.com wrote: >On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote: >> > Starting with type 2, in our current NIC HW APIs we have to duplicate >> > these rules >> > into two rules set to HW: >> > >> > 2.1 VF rep --> uplink 0 >> > 2.2 VF rep --> uplink 1 >> > >> > and we do that in the driver (add/del two HW rules, combine the stat >> > results, etc) > >Ack, I think our HW API also will require us to duplicate the rules >today, but IMHO we should implement some common helper module in the >core that would work for any block sharing rather than bond specific >solution. But how? Only the driver knows if in case it has 2 netdevices if the HW is capable of share or not. And accordingly, it registers 1cb instance or 2cb instances (1 for each netdev). I don't see how you can move it in core... > >> > 3. ingress rule on VF rep port with shared tunnel device being the >> > egress (encap) >> > and where the routing of the underlay (tunnel) goes through LAG. >> > >> > in our case, this is like 2.1/2.2 above, offload two rules, combine stats >> > >> > 4. ingress rule shared tunnel device being the ingress and VF rep port >> > being the egress (decap) >> > >> > this uses the egdev facility to be offloaded into the our driver, and >> > then in the driver >> > we will treat it like type 1, two rules need to be installed into HW, >> > but now, we can't delegate them >> > from the vxlan device b/c it has no direct connection with the bond. > >Let's get rid of the egdev crutch first then :] I don't see how you can do it. Note that this exists to catch insertions of rules that have "mirred redirect" to the dev which is interested in the rules. Originally it was done in a very ugly way (please see git history), and I converted it to egdev - I was not able to find any nicer solution :/ Any ideas for improvement?
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirkowrote: > Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz...@gmail.com wrote: >>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko wrote: >>This sounds nice for the case where one install ingress tc rules on >>the bond (lets >>call them type 1, see next) >> >>One obstacle pointed by my colleague, Rabie, is that when the upper layer >>issues stat call on the filter, they will get two replies, this can confuse >>them >>and lead to wrong decisions (aging). I wonder if/how we can set a knob > > The bonding itself would not do anything on stats update > command (TC_CLSFLOWER_STATS for example). Only the slaves would do > update. So there will be only reply from slaves. > > Bond/team is just going to probagare block bind/unbind down. Nothing else. Do we agree that user space will get the replies of all lower (slave) devices, or I am missing something here? >>2. bond being egress port of a rule >>2.1 VF rep --> uplink 0 >>2.2 VF rep --> uplink 1 >> >>and we do that in the driver (add/del two HW rules, combine the stat >>results, etc) > > That is up to the driver. If the driver can share block between 2 > devices, he can do that. If he cannot share, it will just report stats > for every device separatelly (2 block cbs registered) and tc will see > them both together. No need to do anything in driver. right >>3. ingress rule on VF rep port with shared tunnel device being the >>egress (encap) >>and where the routing of the underlay (tunnel) goes through LAG. > Same as "2." ok >>4. ingress rule shared tunnel device being the ingress and VF rep port >>being the egress (decap) > I don't follow :( the way tunneling is handled in tc classifier/action is encap: ingress: net port, action1: tunnel key set action2: mirred to shared-tunnel device decap: ingress: shared tunnel device, action1: tunnel key unset action2: mirred to net port type 4 are the decap rules, when we offload it to as HW ACL we stretch the line and the ingress in a HW port too (e.g uplink port in NICs) >>this uses the egdev facility to be offloaded into the our driver, and >>then in the driver >>we will treat it like type 1, two rules need to be installed into HW, >>but now, we can't delegate them >>from the vxlan device b/c it has no direct connection with the bond. > I see another thing we need to sanitize: vxlan rule ingress match action > mirred redirect to lag right, we don't have for NIC but for switch ASIC, I guess it is applicable
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz...@gmail.com wrote: >On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirkowrote: >> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hur...@netronome.com wrote: >>>Allow drivers to register netdev callbacks for tc offload in linux bonds. >>>If a netdev has registered and is a slave of a given bond, then any tc >>>rules offloaded to the bond will be relayed to it if both the bond and the >>>slave permit hw offload. > >>>Because the bond itself is not offloaded, just the rules, we don't care >>>about whether the bond ports are on the same device or whether some of >>>slaves are representor ports and some are not. > >John, I think we must design here for the case where the bond IS offloaded. >E.g some sort of HW LAG. For example, the mlxsw driver supports >LAG offload and support tcflower offload, we need to see how these >two live together, mlx5 supports tcflower offload and we are working on >bond offload, etc. > >>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); >> >> Please, no "bond" specific calls from drivers. That would be wrong. >> The idea behing block callbacks was that anyone who is interested could >> register to receive those. In this case, slave device is interested. >> So it should register to receive block callbacks in the same way as if >> the block was directly on top of the slave device. The only thing you >> need to handle is to propagate block bind/unbind from master down to the >> slaves. > >Jiri, > >This sounds nice for the case where one install ingress tc rules on >the bond (lets >call them type 1, see next) > >One obstacle pointed by my colleague, Rabie, is that when the upper layer >issues stat call on the filter, they will get two replies, this can confuse >them >and lead to wrong decisions (aging). I wonder if/how we can set a knob The bonding itself would not do anything on stats update command (TC_CLSFLOWER_STATS for example). Only the slaves would do update. So there will be only reply from slaves. Bond/team is just going to probagare block bind/unbind down. Nothing else. >somewhere that unifies the stats (add packet/bytes, use the latest lastuse). > >Also, lets see what other rules have to be offloaded in that scheme >(call them type 2/3/4) >where one bonded two HW ports > >2. bond being egress port of a rule > >TC rules for overlay networks scheme, e.g in NIC SRIOV >scheme where one bonds the two uplink representors > >Starting with type 2, in our current NIC HW APIs we have to duplicate >these rules >into two rules set to HW: > >2.1 VF rep --> uplink 0 >2.2 VF rep --> uplink 1 > >and we do that in the driver (add/del two HW rules, combine the stat >results, etc) That is up to the driver. If the driver can share block between 2 devices, he can do that. If he cannot share, it will just report stats for every device separatelly (2 block cbs registered) and tc will see them both together. No need to do anything in driver. > >3. ingress rule on VF rep port with shared tunnel device being the >egress (encap) >and where the routing of the underlay (tunnel) goes through LAG. > >in our case, this is like 2.1/2.2 above, offload two rules, combine stats > Same as "2." >4. ingress rule shared tunnel device being the ingress and VF rep port >being the egress (decap) I don't follow :( > >this uses the egdev facility to be offloaded into the our driver, and >then in the driver >we will treat it like type 1, two rules need to be installed into HW, >but now, we can't delegate them >from the vxlan device b/c it has no direct connection with the bond. I see another thing we need to sanitize: vxlan rule ingress match action mirred redirect to lag > >All to all, for the mlx5 use case, seems we have elegant solution only >for type 1. > >I think we should do the elegant solution for the case where it applicable. > >In parallel if/when newer HW APIs are there such that type 2 and 3 can be set >using one HW rule whose dest is the bond, we are good. As for type 4, >need to see >if/how it can be nicer. > >Or.
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
On Wed, Mar 14, 2018 at 3:50 AM, Jakub Kicinskiwrote: > On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote: >> > Starting with type 2, in our current NIC HW APIs we have to duplicate >> > these rules >> > into two rules set to HW: >> > >> > 2.1 VF rep --> uplink 0 >> > 2.2 VF rep --> uplink 1 >> > >> > and we do that in the driver (add/del two HW rules, combine the stat >> > results, etc) > > Ack, I think our HW API also will require us to duplicate the rules > today, but IMHO we should implement some common helper module in the > core that would work for any block sharing rather than bond specific > solution. To be clear, you refer to the case where the bond is the egress device of the rule? For the case the bond is the ingress device, RU OK with the approach Jiri suggested to propagate the tc setup ndo call into the lower devices? so they are bind/unbinding for any block the upper is. This approach is applicable for bond/team/vlan devices for both NIC and Switch ASIC (or NPU...) drivers. You want to make a helper out of this?
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote: > > Starting with type 2, in our current NIC HW APIs we have to duplicate > > these rules > > into two rules set to HW: > > > > 2.1 VF rep --> uplink 0 > > 2.2 VF rep --> uplink 1 > > > > and we do that in the driver (add/del two HW rules, combine the stat > > results, etc) Ack, I think our HW API also will require us to duplicate the rules today, but IMHO we should implement some common helper module in the core that would work for any block sharing rather than bond specific solution. > > 3. ingress rule on VF rep port with shared tunnel device being the > > egress (encap) > > and where the routing of the underlay (tunnel) goes through LAG. > > > > in our case, this is like 2.1/2.2 above, offload two rules, combine stats > > > > 4. ingress rule shared tunnel device being the ingress and VF rep port > > being the egress (decap) > > > > this uses the egdev facility to be offloaded into the our driver, and > > then in the driver > > we will treat it like type 1, two rules need to be installed into HW, > > but now, we can't delegate them > > from the vxlan device b/c it has no direct connection with the bond. Let's get rid of the egdev crutch first then :]
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
On Tue, Mar 13, 2018 at 5:51 PM, Or Gerlitzwrote: Sorry ppl, I added MLNX alias (asap_direct_...@mellanox.com) which is not open to outer posts, please remove it from your replies, otherwise it will bump you back.. Or. > On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko wrote: >> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hur...@netronome.com wrote: >>>Allow drivers to register netdev callbacks for tc offload in linux bonds. >>>If a netdev has registered and is a slave of a given bond, then any tc >>>rules offloaded to the bond will be relayed to it if both the bond and the >>>slave permit hw offload. > >>>Because the bond itself is not offloaded, just the rules, we don't care >>>about whether the bond ports are on the same device or whether some of >>>slaves are representor ports and some are not. > > John, I think we must design here for the case where the bond IS offloaded. > E.g some sort of HW LAG. For example, the mlxsw driver supports > LAG offload and support tcflower offload, we need to see how these > two live together, mlx5 supports tcflower offload and we are working on > bond offload, etc. > >>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); >> >> Please, no "bond" specific calls from drivers. That would be wrong. >> The idea behing block callbacks was that anyone who is interested could >> register to receive those. In this case, slave device is interested. >> So it should register to receive block callbacks in the same way as if >> the block was directly on top of the slave device. The only thing you >> need to handle is to propagate block bind/unbind from master down to the >> slaves. > > Jiri, > > This sounds nice for the case where one install ingress tc rules on > the bond (lets > call them type 1, see next) > > One obstacle pointed by my colleague, Rabie, is that when the upper layer > issues stat call on the filter, they will get two replies, this can confuse > them > and lead to wrong decisions (aging). I wonder if/how we can set a knob > somewhere that unifies the stats (add packet/bytes, use the latest lastuse). > > Also, lets see what other rules have to be offloaded in that scheme > (call them type 2/3/4) > where one bonded two HW ports > > 2. bond being egress port of a rule > > TC rules for overlay networks scheme, e.g in NIC SRIOV > scheme where one bonds the two uplink representors > > Starting with type 2, in our current NIC HW APIs we have to duplicate > these rules > into two rules set to HW: > > 2.1 VF rep --> uplink 0 > 2.2 VF rep --> uplink 1 > > and we do that in the driver (add/del two HW rules, combine the stat > results, etc) > > 3. ingress rule on VF rep port with shared tunnel device being the > egress (encap) > and where the routing of the underlay (tunnel) goes through LAG. > > in our case, this is like 2.1/2.2 above, offload two rules, combine stats > > 4. ingress rule shared tunnel device being the ingress and VF rep port > being the egress (decap) > > this uses the egdev facility to be offloaded into the our driver, and > then in the driver > we will treat it like type 1, two rules need to be installed into HW, > but now, we can't delegate them > from the vxlan device b/c it has no direct connection with the bond. > > All to all, for the mlx5 use case, seems we have elegant solution only > for type 1. > > I think we should do the elegant solution for the case where it applicable. > > In parallel if/when newer HW APIs are there such that type 2 and 3 can be set > using one HW rule whose dest is the bond, we are good. As for type 4, > need to see > if/how it can be nicer. > > Or.
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirkowrote: > Mon, Mar 05, 2018 at 02:28:30PM CET, john.hur...@netronome.com wrote: >>Allow drivers to register netdev callbacks for tc offload in linux bonds. >>If a netdev has registered and is a slave of a given bond, then any tc >>rules offloaded to the bond will be relayed to it if both the bond and the >>slave permit hw offload. >>Because the bond itself is not offloaded, just the rules, we don't care >>about whether the bond ports are on the same device or whether some of >>slaves are representor ports and some are not. John, I think we must design here for the case where the bond IS offloaded. E.g some sort of HW LAG. For example, the mlxsw driver supports LAG offload and support tcflower offload, we need to see how these two live together, mlx5 supports tcflower offload and we are working on bond offload, etc. >>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); > > Please, no "bond" specific calls from drivers. That would be wrong. > The idea behing block callbacks was that anyone who is interested could > register to receive those. In this case, slave device is interested. > So it should register to receive block callbacks in the same way as if > the block was directly on top of the slave device. The only thing you > need to handle is to propagate block bind/unbind from master down to the > slaves. Jiri, This sounds nice for the case where one install ingress tc rules on the bond (lets call them type 1, see next) One obstacle pointed by my colleague, Rabie, is that when the upper layer issues stat call on the filter, they will get two replies, this can confuse them and lead to wrong decisions (aging). I wonder if/how we can set a knob somewhere that unifies the stats (add packet/bytes, use the latest lastuse). Also, lets see what other rules have to be offloaded in that scheme (call them type 2/3/4) where one bonded two HW ports 2. bond being egress port of a rule TC rules for overlay networks scheme, e.g in NIC SRIOV scheme where one bonds the two uplink representors Starting with type 2, in our current NIC HW APIs we have to duplicate these rules into two rules set to HW: 2.1 VF rep --> uplink 0 2.2 VF rep --> uplink 1 and we do that in the driver (add/del two HW rules, combine the stat results, etc) 3. ingress rule on VF rep port with shared tunnel device being the egress (encap) and where the routing of the underlay (tunnel) goes through LAG. in our case, this is like 2.1/2.2 above, offload two rules, combine stats 4. ingress rule shared tunnel device being the ingress and VF rep port being the egress (decap) this uses the egdev facility to be offloaded into the our driver, and then in the driver we will treat it like type 1, two rules need to be installed into HW, but now, we can't delegate them from the vxlan device b/c it has no direct connection with the bond. All to all, for the mlx5 use case, seems we have elegant solution only for type 1. I think we should do the elegant solution for the case where it applicable. In parallel if/when newer HW APIs are there such that type 2 and 3 can be set using one HW rule whose dest is the bond, we are good. As for type 4, need to see if/how it can be nicer. Or.
Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
Mon, Mar 05, 2018 at 02:28:30PM CET, john.hur...@netronome.com wrote: >Allow drivers to register netdev callbacks for tc offload in linux bonds. >If a netdev has registered and is a slave of a given bond, then any tc >rules offloaded to the bond will be relayed to it if both the bond and the >slave permit hw offload. > >Because the bond itself is not offloaded, just the rules, we don't care >about whether the bond ports are on the same device or whether some of >slaves are representor ports and some are not. > >Signed-off-by: John Hurley>--- > drivers/net/bonding/bond_main.c | 195 +++- > include/net/bonding.h | 7 ++ > 2 files changed, 201 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index e6415f6..d9e41cf 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c [...] >+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); Please, no "bond" specific calls from drivers. That would be wrong. The idea behing block callbacks was that anyone who is interested could register to receive those. In this case, slave device is interested. So it should register to receive block callbacks in the same way as if the block was directly on top of the slave device. The only thing you need to handle is to propagate block bind/unbind from master down to the slaves.