Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure
On Mon, Nov 26, 2018 at 08:33:36PM +0100, Pablo Neira Ayuso wrote: > Hi Marcelo, Hello! > > On Thu, Nov 22, 2018 at 07:08:32PM -0200, Marcelo Ricardo Leitner wrote: > > On Thu, Nov 22, 2018 at 02:22:20PM -0200, Marcelo Ricardo Leitner wrote: > > > On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote: > > > > Hi, > > > > > > > > This patchset is the third iteration [1] [2] [3] to introduce a kernel > > > > intermediate (IR) to express ACL hardware offloads. > > > > > > On v2 cover letter you had: > > > > > > """ > > > However, cost of this layer is very small, adding 1 million rules via > > > tc -batch, perf shows: > > > > > > 0.06% tc [kernel.vmlinux][k] tc_setup_flow_action > > > """ > > > > > > The above doesn't include time spent on children calls and I'm worried > > > about the new allocation done by flow_rule_alloc(), as it can impact > > > rule insertion rate. I'll run some tests here and report back. > > > > I'm seeing +60ms on 1.75s (~3.4%) to add 40k flower rules on ingress > > with skip_hw and tc in batch mode, with flows like: > > > > filter add dev p6p2 parent : protocol ip prio 1 flower skip_hw > > src_mac ec:13:db:00:00:00 dst_mac ec:14:c2:00:00:00 src_ip > > 56.0.0.0 dst_ip 55.0.0.0 action drop > > > > Only 20ms out of those 60ms were consumed within fl_change() calls > > (considering children calls), though. > > > > Do you see something similar? I used current net-next (d59da3fbfe3f) > > and with this patchset applied. > > I see lots of send() and recv() in tc -batch via strace, using this > example rule, repeating it N times: > > filter add dev eth0 parent : protocol ip pref 1 flower dst_mac > f4:52:14:10:df:92 action mirred egress redirect dev eth1 > > This is taking ~8 seconds for 40k rules from my old laptop [*], this > is already not too fast (without my patchset). On a E5-2643 v3 @ 3.40GHz I see a total of 1.17s with an old iproute (4.11) (more below). > > I remember we discussed about adding support for real batching for tc > - probably we can probably do this transparently by assuming that if the > skbuff length mismatches nlmsghdr->len field, then we enter the batch > mode from the kernel. This would require to update iproute2 to use > libmnl batching routines, or code that follows similar approach > otherwise. Yes, I believe you're referring to commit 485d0c6001c4aa134b99c86913d6a7089b7b2ab0 Author: Chris Mi Date: Fri Jan 12 14:13:16 2018 +0900 tc: Add batchsize feature for filter and actions Which is present in 4.16. It does transparent batching on app side. With tc from today's tip, I get 1.05s for 40k rules, both with this patchset applied. > > [*] 0.5 seconds in nft (similar ruleset), this is using netlink batching. Nice. Cheers, Marcelo
Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure
Hi Marcelo, On Thu, Nov 22, 2018 at 07:08:32PM -0200, Marcelo Ricardo Leitner wrote: > On Thu, Nov 22, 2018 at 02:22:20PM -0200, Marcelo Ricardo Leitner wrote: > > On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote: > > > Hi, > > > > > > This patchset is the third iteration [1] [2] [3] to introduce a kernel > > > intermediate (IR) to express ACL hardware offloads. > > > > On v2 cover letter you had: > > > > """ > > However, cost of this layer is very small, adding 1 million rules via > > tc -batch, perf shows: > > > > 0.06% tc [kernel.vmlinux][k] tc_setup_flow_action > > """ > > > > The above doesn't include time spent on children calls and I'm worried > > about the new allocation done by flow_rule_alloc(), as it can impact > > rule insertion rate. I'll run some tests here and report back. > > I'm seeing +60ms on 1.75s (~3.4%) to add 40k flower rules on ingress > with skip_hw and tc in batch mode, with flows like: > > filter add dev p6p2 parent : protocol ip prio 1 flower skip_hw > src_mac ec:13:db:00:00:00 dst_mac ec:14:c2:00:00:00 src_ip > 56.0.0.0 dst_ip 55.0.0.0 action drop > > Only 20ms out of those 60ms were consumed within fl_change() calls > (considering children calls), though. > > Do you see something similar? I used current net-next (d59da3fbfe3f) > and with this patchset applied. I see lots of send() and recv() in tc -batch via strace, using this example rule, repeating it N times: filter add dev eth0 parent : protocol ip pref 1 flower dst_mac f4:52:14:10:df:92 action mirred egress redirect dev eth1 This is taking ~8 seconds for 40k rules from my old laptop [*], this is already not too fast (without my patchset). I remember we discussed about adding support for real batching for tc - probably we can probably do this transparently by assuming that if the skbuff length mismatches nlmsghdr->len field, then we enter the batch mode from the kernel. This would require to update iproute2 to use libmnl batching routines, or code that follows similar approach otherwise. [*] 0.5 seconds in nft (similar ruleset), this is using netlink batching.
Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure
On Thu, Nov 22, 2018 at 02:22:20PM -0200, Marcelo Ricardo Leitner wrote: > On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote: > > Hi, > > > > This patchset is the third iteration [1] [2] [3] to introduce a kernel > > intermediate (IR) to express ACL hardware offloads. > > On v2 cover letter you had: > > """ > However, cost of this layer is very small, adding 1 million rules via > tc -batch, perf shows: > > 0.06% tc [kernel.vmlinux][k] tc_setup_flow_action > """ > > The above doesn't include time spent on children calls and I'm worried > about the new allocation done by flow_rule_alloc(), as it can impact > rule insertion rate. I'll run some tests here and report back. I'm seeing +60ms on 1.75s (~3.4%) to add 40k flower rules on ingress with skip_hw and tc in batch mode, with flows like: filter add dev p6p2 parent : protocol ip prio 1 flower skip_hw src_mac ec:13:db:00:00:00 dst_mac ec:14:c2:00:00:00 src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop Only 20ms out of those 60ms were consumed within fl_change() calls (considering children calls), though. Do you see something similar? I used current net-next (d59da3fbfe3f) and with this patchset applied.
Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure
On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote: > Hi, > > This patchset is the third iteration [1] [2] [3] to introduce a kernel > intermediate (IR) to express ACL hardware offloads. On v2 cover letter you had: """ However, cost of this layer is very small, adding 1 million rules via tc -batch, perf shows: 0.06% tc [kernel.vmlinux][k] tc_setup_flow_action """ The above doesn't include time spent on children calls and I'm worried about the new allocation done by flow_rule_alloc(), as it can impact rule insertion rate. I'll run some tests here and report back.
[PATCH net-next,v3 00/12] add flow_rule infrastructure
Hi, This patchset is the third iteration [1] [2] [3] to introduce a kernel intermediate (IR) to express ACL hardware offloads. This round addresses feedback from Jiri Pirko: * Add net/core/flow_offload.c and include/net/flow_offload.h. * Add flow_rule_alloc() helper function. * Remove _key postfix and _KEY_ infix in flow_action definitions. * Use enum flow_action_mangle_base for consistency. * Rename key field to entries and num_keys to num_entries. * Rename struct flow_action_key to flow_action_entry. * Use placeholder in struct flow_action to store array of actions from flow_rule_alloc(). * Add tcf_exts_num_actions() and pass it to flow_rule_alloc() to calculate the size of the array of actions. * Rename to struct flow_stats and to function flow_stats_update(). * Add struct ethtool_rx_flow_rule, keep placeholder to private dissector information. * Pass struct flow_rule *rule to all parser functions in qlogic/qede driver. This also fixes a bug reported by Manish Chopra, in the ethtool_rx_spec to flow_rule translator. Making all these changes have been an exercise to review the existing infrastructure, to understand what has been done and to propose improvements to the _great work_ that core drivers developers have done so far to introduce HW offloads through the existing frontend APIs. I still have more feedback and technical ideas that I'm very much looking forward to discuss with them in the future. Main goal of this patchset is to avoid code duplication for driver developers. There are no netfilter changes coming in this batch. I would like to explore Netfilter hardware offloads in the future. Thanks a lot for reviewing! [1] https://lwn.net/Articles/766695/ [2] https://marc.info/?l=linux-netdev&m=154233253114506&w=2 [3] https://marc.info/?l=linux-netdev&m=154258780717036&w=2 Pablo Neira Ayuso (12): flow_dissector: add flow_rule and flow_match structures and use them net/mlx5e: support for two independent packet edit actions flow_dissector: add flow action infrastructure cls_api: add translator to flow_action representation cls_flower: add statistics retrieval infrastructure and use it drivers: net: use flow action infrastructure cls_flower: don't expose TC actions to drivers anymore flow_dissector: add wake-up-on-lan and queue to flow_action flow_dissector: add basic ethtool_rx_flow_spec to flow_rule structure translator dsa: bcm_sf2: use flow_rule infrastructure qede: place ethtool_rx_flow_spec after code after TC flower codebase qede: use ethtool_rx_flow_rule() to remove duplicated parser code drivers/net/dsa/bcm_sf2_cfp.c | 109 +-- drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 252 +++ .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 450 ++--- drivers/net/ethernet/intel/i40e/i40e_main.c| 178 ++--- drivers/net/ethernet/intel/iavf/iavf_main.c| 195 +++--- drivers/net/ethernet/intel/igb/igb_main.c | 64 +- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 743 ++--- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 2 +- .../net/ethernet/mellanox/mlxsw/spectrum_flower.c | 258 --- drivers/net/ethernet/netronome/nfp/flower/action.c | 196 +++--- drivers/net/ethernet/netronome/nfp/flower/match.c | 417 ++-- .../net/ethernet/netronome/nfp/flower/offload.c| 150 ++--- drivers/net/ethernet/qlogic/qede/qede_filter.c | 560 ++-- include/linux/ethtool.h| 10 + include/net/flow_offload.h | 199 ++ include/net/pkt_cls.h | 18 +- net/core/Makefile | 2 +- net/core/ethtool.c | 189 ++ net/core/flow_offload.c| 153 + net/sched/cls_api.c| 116 net/sched/cls_flower.c | 69 +- 21 files changed, 2339 insertions(+), 1991 deletions(-) create mode 100644 include/net/flow_offload.h create mode 100644 net/core/flow_offload.c -- 2.11.0