Re: [PATCH net 0/4] net/sched: forbid 'goto_chain' on fallback actions
On 2018-10-20 5:33 p.m., Davide Caratti wrote: the following command: # tc actions add action police rate 1mbit burst 1k conform-exceed \ > pass / goto chain 42 generates a NULL pointer dereference when packets exceed the configured rate. Similarly, the following command: # tc actions add action pass random determ goto chain 42 2 makes the kernel crash with NULL dereference when the first packet does not match the 'pass' action. gact and police allow users to specify a fallback control action, that is stored in the action private data. 'goto chain x' never worked for these cases, since a->goto_chain handle was never initialized. There is only one goto_chain handle per TC action, and it is designed to be non-NULL only if tcf_action contains a 'goto chain' command. So, let's forbid 'goto chain' on fallback actions. Patch 1/4 and 2/4 change the .init() functions of police and gact, to let them return an error when users try to set 'goto chain x' in the fallback action. Patch 3/4 and 4/4 add TDC selftest coverage to this new behavior. For the series, Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net v2] net/sched: act_gact: properly init 'goto chain'
;tcf_action) +#endif + static inline bool __is_tcf_gact_act(const struct tc_action *a, int act, bool is_ext) { @@ -26,8 +33,8 @@ static inline bool __is_tcf_gact_act(const struct tc_action *a, int act, return false; gact = to_gact(a); - if ((!is_ext && gact->tcf_action == act) || - (is_ext && TC_ACT_EXT_CMP(gact->tcf_action, act))) + if ((!is_ext && GACT_PRIMARY_ACTION(gact) == act) || + (is_ext && TC_ACT_EXT_CMP(GACT_PRIMARY_ACTION(gact), act))) return true; #endif diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index cd1d9bd32ef9..49b32650efb9 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -31,7 +31,7 @@ static int gact_net_rand(struct tcf_gact *gact) { smp_rmb(); /* coupled with smp_wmb() in tcf_gact_init() */ if (prandom_u32() % gact->tcfg_pval) - return gact->tcf_action; + return gact->tcfg_caction; return gact->tcfg_paction; } @@ -41,7 +41,7 @@ static int gact_determ(struct tcf_gact *gact) smp_rmb(); /* coupled with smp_wmb() in tcf_gact_init() */ if (pack % gact->tcfg_pval) - return gact->tcf_action; + return gact->tcfg_caction; return gact->tcfg_paction; } @@ -88,6 +88,9 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla, p_parm = nla_data(tb[TCA_GACT_PROB]); if (p_parm->ptype >= MAX_RAND) return -EINVAL; + if (TC_ACT_EXT_CMP(p_parm->paction, TC_ACT_GOTO_CHAIN) && + TC_ACT_EXT_CMP(parm->action, TC_ACT_GOTO_CHAIN)) + return -EINVAL; Rejection is a good solution[1]. Would be helpful to set an ext_ack to something like "only one goto chain is supported currently" I didnt follow why you needed to introduce tcfg_caction... cheers, jamal [1]actions should allow multitude return opcodes, not just two. The proper solution seems to be to just let the caller (cls_api - who is aware of chains) to deal with the chain selection. Sticking them in actions speeds up lookup (and deal with refcnts) but i wonder given this breakage in abstraction whether they belong there...
Re: [PATCH net] net/sched: properly init chain in case of multiple control actions
On 2018-10-13 11:23 a.m., Davide Caratti wrote: On Fri, 2018-10-12 at 13:57 -0700, Cong Wang wrote: Why not just validate the fallback action in each action init()? For example, checking tcfg_paction in tcf_gact_init(). I don't see the need of making it generic. hello Cong, once again thanks for looking at this. what you say is doable, and I evaluated doing it before proposing this patch. But I felt unconfortable, because I needed to pass struct tcf_proto *tp in tcf_gact_init() to initialize a->goto_chain with the chain_idx encoded in the fallback action. So, I would have changed all the init() functions in all TC actions, just to fix two of them. A (legal?) trick is to let tcf_action store the fallback action when it contains a 'goto chain' command, I just posted a proposal for gact. If you think it's ok, I will test and post the same for act_police. Need some more thought. So the issue here is the goto chain failed the configured chain doesnt exist? cheers, jamal
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On 2018-10-11 2:44 p.m., David Ahern wrote: On 10/11/18 12:05 PM, Jamal Hadi Salim wrote: On 2018-10-11 1:04 p.m., David Ahern wrote: I meant the general API of users passing filter arguments as attributes to the dump (or values in the header) -- KIND, MASTER, device index, etc. This is an existing API and existing capability. which i referred to as use-case driven. It is not unreasonable to optimize for the most common - but every time somebody comes up with something new you need to patch the kernel. I disagree with your overall premise of bpf the end-all hammer. It is a tool but not the only tool. For starters, you are proposing building the message, run the filter on it, and potentially back the message up to drop the recently added piece because the filter does not want it included. That is still wasting a lot of cpu cycles to build and drop. I am thinking about scale to 1 million routes -- I do not need the dump loop building a message for 1 million entries only to drop 99% of them. That is crazy. My earlier suggestion was for somewhere before the skb is formed. In the vicinity of xxx_fill_info() The "create skb and drop" kind works already today with some acrobatics needed for some cases with cbpf. Is it unfeasible to add an ebpf hook at that point and ask a user supplied code "is this ok to send?" - this is no different than doing a "get by key" operation where key/filter is any arbitrary construction of fields rtm understands (including the ones you provided like table index) that are passed in the user program. Classical "select" mechanism in database tables The way the kernel manages route tables says I should pass in the table id as it is a major differentiator on what is returned. From there lookup the specific table (I need to fix this part per my response to Andrew), and then only walk it. The existing semantics, capabilities that exist for other dump commands is the most efficient for some of these high level, big hammer filters. Sure. What you want gets into the tiniest of details and yes the imagination can go wild with combinations of filter options. So maybe this scanning of post-built messages is reasonable *after* the high level sorting is done. That doesnt require any change. cheers, jamal
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On 2018-10-11 1:04 p.m., David Ahern wrote: You can already filter link dumps by kind. How? By passing in the KIND attribute on a dump request. This type of filtering exists for link dumps, neighbor dumps, fdb dumps. Why is there a push to make route dumps different? Why can't they be consistent and use existing semantics? I think you meant filtering by ifindex in neighbor. note: I would argue that there are already "adhoc" ways of filtering in place, mostly use case driven). Otherwise Sowmini wouldnt have to craft that bpf filter. There are netlink users who have none or some weird filtering involved. There is no arguement that your approach works for rtm. But the rest of the users missing filters will require similar kernel changes. Could this be made generic enough to benefit other netlink users? The problem is there's always one new attribute that would make sense for some use case which requires a kernel change ("send me an event only if you get link down" or "dump all ports with link down"). cheers, jamal
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On 2018-10-11 12:16 p.m., David Ahern wrote: IMO, bpf at the fill_info stage is not appropriate. Somewhere before the skb is formed (and nlmsg is built). If you go as far as constructing it, then cBPF per what Sowmini should work; but there will be constructs which are trickier. skb->sk (hence attached filter) should be available at that point. Classical bpf per Sowmini's example maybe trickier. Better - why dont we have an ebpf hook at this stage and then we dont have to make changes to the kernel when someone adds one more field to the filter? BTW: useful for events as well - not just dumps (as the name fib_dump_filter suggests) you mean kernel notifications on internal events? 1. there is no user socket when notifications are created and the *_fill_info is invoked 2. notifications are global going to potentially many sockets. For these cases the existing sk_filter is appropriate. #2 - netlink being a broad/multicast bus with attached listeners. Yes, you can do it with cBPF but some complexity may occur. Example: if i was interested to netdevice events of "kind = vxlan && admin flag is down" then that is non trivial to do with classical but would be reasonably comfortable to do with ebpf. Note: That filter will work fine for dumps as well since the semantics are the same. the win is: in the future when you just wanna add that one new filter attribute you dont need a kernel patch in and roll in a new production kernel. cheers, jamal
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On 2018-10-11 11:46 a.m., Sowmini Varadhan wrote: On (10/11/18 08:26), Stephen Hemminger wrote: You can do the something like this already with BPF socket filters. But writing BPF for multi-part messages is hard. Indeed. And I was just experimenting with this for ARP just last week. So to handle the caes of "ip neigh show a.b.c.d" without walking through the entire arp table and filtering in userspace, you could add a sk_filter() hook like this: If this could be done a lot earlier aka at xxx_fill_info() bpf would be a very good answer. skb->sk (hence attached filter) should be available at that point. Classical bpf per Sowmini's example maybe trickier. Better - why dont we have an ebpf hook at this stage and then we dont have to make changes to the kernel when someone adds one more field to the filter? BTW: useful for events as well - not just dumps (as the name fib_dump_filter suggests) cheers, jamal
[PATCH net-next v2 10/12] net: sched: cls_u32: get rid of tp_c
From: Al Viro Both hnode ->tp_c and tp_c argument of u32_set_parms() the latter is redundant, the former - never read... Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3ed2c9866b36..3d4c360f9b0c 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -79,7 +79,6 @@ struct tc_u_hnode { struct tc_u_hnode __rcu *next; u32 handle; u32 prio; - struct tc_u_common *tp_c; int refcnt; unsigned intdivisor; struct idr handle_idr; @@ -390,7 +389,6 @@ static int u32_init(struct tcf_proto *tp) tp_c->refcnt++; RCU_INIT_POINTER(root_ht->next, tp_c->hlist); rcu_assign_pointer(tp_c->hlist, root_ht); - root_ht->tp_c = tp_c; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; @@ -761,7 +759,7 @@ static const struct nla_policy u32_policy[TCA_U32_MAX + 1] = { }; static int u32_set_parms(struct net *net, struct tcf_proto *tp, -unsigned long base, struct tc_u_common *tp_c, +unsigned long base, struct tc_u_knode *n, struct nlattr **tb, struct nlattr *est, bool ovr, struct netlink_ext_ack *extack) @@ -782,7 +780,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, } if (handle) { - ht_down = u32_lookup_ht(tp_c, handle); + ht_down = u32_lookup_ht(tp->data, handle); if (!ht_down) { NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); @@ -956,7 +954,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, if (!new) return -ENOMEM; - err = u32_set_parms(net, tp, base, tp_c, new, tb, + err = u32_set_parms(net, tp, base, new, tb, tca[TCA_RATE], ovr, extack); if (err) { @@ -1012,7 +1010,6 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, return err; } } - ht->tp_c = tp_c; ht->refcnt = 1; ht->divisor = divisor; ht->handle = handle; @@ -1123,7 +1120,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, } #endif - err = u32_set_parms(net, tp, base, tp_c, n, tb, tca[TCA_RATE], ovr, + err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; -- 2.11.0
[PATCH net-next v2 07/12] net: sched: cls_u32: clean tc_u_common hashtable
From: Al Viro * calculate key *once*, not for each hash chain element * let tc_u_hash() return the pointer to chain head rather than index - callers are cleaner that way. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index c378168f4562..3f6fba831c57 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -343,19 +343,16 @@ static void *tc_u_common_ptr(const struct tcf_proto *tp) return block->q; } -static unsigned int tc_u_hash(const struct tcf_proto *tp) +static struct hlist_head *tc_u_hash(void *key) { - return hash_ptr(tc_u_common_ptr(tp), U32_HASH_SHIFT); + return tc_u_common_hash + hash_ptr(key, U32_HASH_SHIFT); } -static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) +static struct tc_u_common *tc_u_common_find(void *key) { struct tc_u_common *tc; - unsigned int h; - - h = tc_u_hash(tp); - hlist_for_each_entry(tc, _u_common_hash[h], hnode) { - if (tc->ptr == tc_u_common_ptr(tp)) + hlist_for_each_entry(tc, tc_u_hash(key), hnode) { + if (tc->ptr == key) return tc; } return NULL; @@ -364,10 +361,8 @@ static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) static int u32_init(struct tcf_proto *tp) { struct tc_u_hnode *root_ht; - struct tc_u_common *tp_c; - unsigned int h; - - tp_c = tc_u_common_find(tp); + void *key = tc_u_common_ptr(tp); + struct tc_u_common *tp_c = tc_u_common_find(key); root_ht = kzalloc(sizeof(*root_ht), GFP_KERNEL); if (root_ht == NULL) @@ -385,12 +380,11 @@ static int u32_init(struct tcf_proto *tp) kfree(root_ht); return -ENOBUFS; } - tp_c->ptr = tc_u_common_ptr(tp); + tp_c->ptr = key; INIT_HLIST_NODE(_c->hnode); idr_init(_c->handle_idr); - h = tc_u_hash(tp); - hlist_add_head(_c->hnode, _u_common_hash[h]); + hlist_add_head(_c->hnode, tc_u_hash(key)); } tp_c->refcnt++; -- 2.11.0
[PATCH net-next v2 05/12] net: sched: cls_u32: get rid of tc_u_knode ->tp
From: Al Viro not used anymore Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index ef0f2e6ec422..810c49ac1bbe 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -68,7 +68,6 @@ struct tc_u_knode { u32 mask; u32 __percpu*pcpu_success; #endif - struct tcf_proto*tp; struct rcu_work rwork; /* The 'sel' field MUST be the last field in structure to allow for * tc_u32_keys allocated at end of structure. @@ -896,7 +895,6 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto *tp, /* Similarly success statistics must be moved as pointers */ new->pcpu_success = n->pcpu_success; #endif - new->tp = tp; memcpy(>sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key)); if (tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE)) { @@ -1112,7 +1110,6 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, n->handle = handle; n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0; n->flags = flags; - n->tp = tp; err = tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE); if (err < 0) -- 2.11.0
[PATCH net-next v2 02/12] net: sched: cls_u32: disallow linking to root hnode
From: Al Viro Operation makes no sense. Nothing will actually break if we do so (depth limit in u32_classify() will prevent infinite loops), but according to maintainers it's best prohibited outright. NOTE: doing so guarantees that u32_destroy() will trigger the call of u32_destroy_hnode(); we might want to make that unconditional. Test: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent : protocol ip prio 100 u32 \ link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff should fail with Error: cls_u32: Not linking to root node Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 4 1 file changed, 4 insertions(+) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 622f4657da94..3357331a80a2 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -797,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); return -EINVAL; } + if (ht_down->is_root) { + NL_SET_ERR_MSG_MOD(extack, "Not linking to root node"); + return -EINVAL; + } ht_down->refcnt++; } -- 2.11.0
[PATCH net-next v2 06/12] net: sched: cls_u32: get rid of tc_u_common ->rcu
From: Al Viro unused Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 810c49ac1bbe..c378168f4562 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -98,7 +98,6 @@ struct tc_u_common { int refcnt; struct idr handle_idr; struct hlist_node hnode; - struct rcu_head rcu; }; static inline unsigned int u32_hash_fold(__be32 key, -- 2.11.0
[PATCH net-next v2 01/12] net: sched: cls_u32: mark root hnode explicitly
From: Al Viro ... and produce consistent error on attempt to delete such. Existing check in u32_delete() is inconsistent - after tc qdisc add dev eth0 ingress tc filter add dev eth0 parent : protocol ip prio 100 handle 1: u32 \ divisor 1 tc filter add dev eth0 parent : protocol ip prio 200 handle 2: u32 \ divisor 1 both tc filter delete dev eth0 parent : protocol ip prio 100 handle 801: u32 and tc filter delete dev eth0 parent : protocol ip prio 100 handle 800: u32 will fail (at least with refcounting fixes), but the former will complain about an attempt to remove a busy table, while the latter will recognize it as root and yield "Not allowed to delete root node" instead. The problem with the existing check is that several tcf_proto instances might share the same tp->data and handle-to-hnode lookup will be the same for all of them. So comparing an hnode to be deleted with tp->root won't catch the case when one tp is used to try deleting the root of another. Solution is trivial - mark the root hnodes explicitly upon allocation and check for that. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index f218ccf1e2d9..622f4657da94 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -84,6 +84,7 @@ struct tc_u_hnode { int refcnt; unsigned intdivisor; struct idr handle_idr; + boolis_root; struct rcu_head rcu; u32 flags; /* The 'ht' field MUST be the last field in structure to allow for @@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp) root_ht->refcnt++; root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000; root_ht->prio = tp->prio; + root_ht->is_root = true; idr_init(_ht->handle_idr); if (tp_c == NULL) { @@ -692,7 +694,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, goto out; } - if (root_ht == ht) { + if (ht->is_root) { NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node"); return -EINVAL; } -- 2.11.0
[PATCH net-next v2 12/12] net: sched: cls_u32: simplify the hell out u32_delete() emptiness check
From: Al Viro Now that we have the knode count, we can instantly check if any hnodes are non-empty. And that kills the check for extra references to root hnode - those could happen only if there was a knode to carry such a link. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 48 +--- 1 file changed, 1 insertion(+), 47 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 61593bee08db..ac79a40a0392 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -627,17 +627,6 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, return -ENOENT; } -static bool ht_empty(struct tc_u_hnode *ht) -{ - unsigned int h; - - for (h = 0; h <= ht->divisor; h++) - if (rcu_access_pointer(ht->ht[h])) - return false; - - return true; -} - static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack) { struct tc_u_common *tp_c = tp->data; @@ -675,13 +664,9 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, struct netlink_ext_ack *extack) { struct tc_u_hnode *ht = arg; - struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); struct tc_u_common *tp_c = tp->data; int ret = 0; - if (ht == NULL) - goto out; - if (TC_U32_KEY(ht->handle)) { u32_remove_hw_knode(tp, (struct tc_u_knode *)ht, extack); ret = u32_delete_key(tp, (struct tc_u_knode *)ht); @@ -702,38 +687,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, } out: - *last = true; - if (root_ht) { - if (root_ht->refcnt > 1) { - *last = false; - goto ret; - } - if (root_ht->refcnt == 1) { - if (!ht_empty(root_ht)) { - *last = false; - goto ret; - } - } - } - - if (tp_c->refcnt > 1) { - *last = false; - goto ret; - } - - if (tp_c->refcnt == 1) { - struct tc_u_hnode *ht; - - for (ht = rtnl_dereference(tp_c->hlist); -ht; -ht = rtnl_dereference(ht->next)) - if (!ht_empty(ht)) { - *last = false; - break; - } - } - -ret: + *last = tp_c->refcnt == 1 && tp_c->knodes == 0; return ret; } -- 2.11.0
[PATCH net-next v2 04/12] net: sched: cls_u32: get rid of unused argument of u32_destroy_key()
From: Al Viro Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index ce55eea448a0..ef0f2e6ec422 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -405,8 +405,7 @@ static int u32_init(struct tcf_proto *tp) return 0; } -static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n, - bool free_pf) +static int u32_destroy_key(struct tc_u_knode *n, bool free_pf) { struct tc_u_hnode *ht = rtnl_dereference(n->ht_down); @@ -440,7 +439,7 @@ static void u32_delete_key_work(struct work_struct *work) struct tc_u_knode, rwork); rtnl_lock(); - u32_destroy_key(key->tp, key, false); + u32_destroy_key(key, false); rtnl_unlock(); } @@ -457,7 +456,7 @@ static void u32_delete_key_freepf_work(struct work_struct *work) struct tc_u_knode, rwork); rtnl_lock(); - u32_destroy_key(key->tp, key, true); + u32_destroy_key(key, true); rtnl_unlock(); } @@ -600,7 +599,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, if (tcf_exts_get_net(>exts)) tcf_queue_work(>rwork, u32_delete_key_freepf_work); else - u32_destroy_key(n->tp, n, true); + u32_destroy_key(n, true); } } } @@ -971,13 +970,13 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, tca[TCA_RATE], ovr, extack); if (err) { - u32_destroy_key(tp, new, false); + u32_destroy_key(new, false); return err; } err = u32_replace_hw_knode(tp, new, flags, extack); if (err) { - u32_destroy_key(tp, new, false); + u32_destroy_key(new, false); return err; } -- 2.11.0
[PATCH net-next v2 08/12] net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode
From: Al Viro the only thing we used ht for was ht->tp_c and callers can get that without going through ->tp_c at all; start with lifting that into the callers, next commits will massage those, eventually removing ->tp_c altogether. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3f6fba831c57..53f34f8cde8b 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -761,7 +761,7 @@ static const struct nla_policy u32_policy[TCA_U32_MAX + 1] = { }; static int u32_set_parms(struct net *net, struct tcf_proto *tp, -unsigned long base, struct tc_u_hnode *ht, +unsigned long base, struct tc_u_common *tp_c, struct tc_u_knode *n, struct nlattr **tb, struct nlattr *est, bool ovr, struct netlink_ext_ack *extack) @@ -782,7 +782,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, } if (handle) { - ht_down = u32_lookup_ht(ht->tp_c, handle); + ht_down = u32_lookup_ht(tp_c, handle); if (!ht_down) { NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); @@ -957,7 +957,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, return -ENOMEM; err = u32_set_parms(net, tp, base, - rtnl_dereference(n->ht_up), new, tb, + rtnl_dereference(n->ht_up)->tp_c, new, tb, tca[TCA_RATE], ovr, extack); if (err) { @@ -1124,7 +1124,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, } #endif - err = u32_set_parms(net, tp, base, ht, n, tb, tca[TCA_RATE], ovr, + err = u32_set_parms(net, tp, base, ht->tp_c, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; -- 2.11.0
[PATCH net-next v2 03/12] net: sched: cls_u32: make sure that divisor is a power of 2
From: Al Viro Tested by modifying iproute2 to allow sending a divisor > 255 Tested-by: Jamal Hadi Salim Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3357331a80a2..ce55eea448a0 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -994,7 +994,11 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, if (tb[TCA_U32_DIVISOR]) { unsigned int divisor = nla_get_u32(tb[TCA_U32_DIVISOR]); - if (--divisor > 0x100) { + if (!is_power_of_2(divisor)) { + NL_SET_ERR_MSG_MOD(extack, "Divisor is not a power of 2"); + return -EINVAL; + } + if (divisor-- > 0x100) { NL_SET_ERR_MSG_MOD(extack, "Exceeded maximum 256 hash buckets"); return -EINVAL; } -- 2.11.0
[PATCH net-next v2 11/12] net: sched: cls_u32: keep track of knodes count in tc_u_common
From: Al Viro allows to simplify u32_delete() considerably Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3d4c360f9b0c..61593bee08db 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -97,6 +97,7 @@ struct tc_u_common { int refcnt; struct idr handle_idr; struct hlist_node hnode; + longknodes; }; static inline unsigned int u32_hash_fold(__be32 key, @@ -452,6 +453,7 @@ static void u32_delete_key_freepf_work(struct work_struct *work) static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key) { + struct tc_u_common *tp_c = tp->data; struct tc_u_knode __rcu **kp; struct tc_u_knode *pkp; struct tc_u_hnode *ht = rtnl_dereference(key->ht_up); @@ -462,6 +464,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key) kp = >next, pkp = rtnl_dereference(*kp)) { if (pkp == key) { RCU_INIT_POINTER(*kp, key->next); + tp_c->knodes--; tcf_unbind_filter(tp, >res); idr_remove(>handle_idr, key->handle); @@ -576,6 +579,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n, static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, struct netlink_ext_ack *extack) { + struct tc_u_common *tp_c = tp->data; struct tc_u_knode *n; unsigned int h; @@ -583,6 +587,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, while ((n = rtnl_dereference(ht->ht[h])) != NULL) { RCU_INIT_POINTER(ht->ht[h], rtnl_dereference(n->next)); + tp_c->knodes--; tcf_unbind_filter(tp, >res); u32_remove_hw_knode(tp, n, extack); idr_remove(>handle_idr, n->handle); @@ -1141,6 +1146,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, RCU_INIT_POINTER(n->next, pins); rcu_assign_pointer(*ins, n); + tp_c->knodes++; *arg = n; return 0; } -- 2.11.0
[PATCH net-next v2 00/12] net: sched: cls_u32 Various improvements
From: Jamal Hadi Salim Various improvements from Al. Changes from version 1: Add missing commit Al Viro (11): net: sched: cls_u32: mark root hnode explicitly net: sched: cls_u32: disallow linking to root hnode net: sched: cls_u32: make sure that divisor is a power of 2 net: sched: cls_u32: get rid of unused argument of u32_destroy_key() net: sched: cls_u32: get rid of tc_u_knode ->tp net: sched: cls_u32: get rid of tc_u_common ->rcu net: sched: cls_u32: clean tc_u_common hashtable net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data net: sched: cls_u32: get rid of tp_c net: sched: cls_u32: keep track of knodes count in tc_u_common net: sched: cls_u32: simplify the hell out u32_delete() emptiness check net/sched/cls_u32.c | 121 +--- 1 file changed, 38 insertions(+), 83 deletions(-) -- 2.11.0
[PATCH net-next v2 09/12] net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data
From: Al Viro It must be tc_u_common associated with that tp (i.e. tp->data). Proof: * both ->ht_up and ->tp_c are assign-once * ->tp_c of anything inserted into tp_c->hlist is tp_c * hnodes never get reinserted into the lists or moved between those, so anything found by u32_lookup_ht(tp->data, ...) will have ->tp_c equal to tp->data. * tp->root->tp_c == tp->data. * ->ht_up of anything inserted into hnode->ht[...] is equal to hnode. * knodes never get reinserted into hash chains or moved between those, so anything returned by u32_lookup_key(ht, ...) will have ->ht_up equal to ht. * any knode returned by u32_get(tp, ...) will have ->ht_up->tp_c point to tp->data Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 53f34f8cde8b..3ed2c9866b36 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -956,8 +956,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, if (!new) return -ENOMEM; - err = u32_set_parms(net, tp, base, - rtnl_dereference(n->ht_up)->tp_c, new, tb, + err = u32_set_parms(net, tp, base, tp_c, new, tb, tca[TCA_RATE], ovr, extack); if (err) { @@ -1124,7 +1123,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, } #endif - err = u32_set_parms(net, tp, base, ht->tp_c, n, tb, tca[TCA_RATE], ovr, + err = u32_set_parms(net, tp, base, tp_c, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; -- 2.11.0
Re: [PATCH net-next 02/11] net: sched: cls_u32: make sure that divisor is a power of 2
On 2018-10-08 4:46 a.m., Sergei Shtylyov wrote: Hello! On 07.10.2018 19:38, Jamal Hadi Salim wrote: From: Al Viro Tested by modifying iproute2 to to allow One "to" is enough, no? :-) Will fix in updated version. cheers, jamal
Re: [PATCH net-next 00/11] net: sched: cls_u32 Various improvements
On 2018-10-08 2:11 a.m., Al Viro wrote: On Sun, Oct 07, 2018 at 10:55:52PM -0700, David Miller wrote: From: Al Viro Date: Mon, 8 Oct 2018 06:45:15 +0100 Er... Both are due to missing in the very beginning of the series (well, on top of "net: sched: cls_u32: fix hnode refcounting") commit All dependencies like this must be explicitly stated. And in such situations you actually should wait for the dependency to get into 'net', eventually get merged into 'net-next', and then you can submit the stuff that depends upon it. Not the way this was done. Point (and this commit is simply missing - it's not that it went into net). FWIW, this was simply "this is what the breakage is caused by", not an attempt to amend the submission or anything like that... My bad. I was wondering why it compiled for me. There is no dependency on the single patch that went to net. I will submit an updated version. cheers, jamal
[PATCH net-next 08/11] net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data
From: Al Viro It must be tc_u_common associated with that tp (i.e. tp->data). Proof: * both ->ht_up and ->tp_c are assign-once * ->tp_c of anything inserted into tp_c->hlist is tp_c * hnodes never get reinserted into the lists or moved between those, so anything found by u32_lookup_ht(tp->data, ...) will have ->tp_c equal to tp->data. * tp->root->tp_c == tp->data. * ->ht_up of anything inserted into hnode->ht[...] is equal to hnode. * knodes never get reinserted into hash chains or moved between those, so anything returned by u32_lookup_key(ht, ...) will have ->ht_up equal to ht. * any knode returned by u32_get(tp, ...) will have ->ht_up->tp_c point to tp->data Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 53f34f8cde8b..3ed2c9866b36 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -956,8 +956,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, if (!new) return -ENOMEM; - err = u32_set_parms(net, tp, base, - rtnl_dereference(n->ht_up)->tp_c, new, tb, + err = u32_set_parms(net, tp, base, tp_c, new, tb, tca[TCA_RATE], ovr, extack); if (err) { @@ -1124,7 +1123,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, } #endif - err = u32_set_parms(net, tp, base, ht->tp_c, n, tb, tca[TCA_RATE], ovr, + err = u32_set_parms(net, tp, base, tp_c, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; -- 2.11.0
[PATCH net-next 00/11] net: sched: cls_u32 Various improvements
From: Jamal Hadi Salim Various improvements from Al. Al Viro (11): net: sched: cls_u32: disallow linking to root hnode net: sched: cls_u32: make sure that divisor is a power of 2 net: sched: cls_u32: get rid of unused argument of u32_destroy_key() net: sched: cls_u32: get rid of tc_u_knode ->tp net: sched: cls_u32: get rid of tc_u_common ->rcu net: sched: cls_u32: clean tc_u_common hashtable net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data net: sched: cls_u32: keep track of knodes count in tc_u_common net: sched: cls_u32: simplify the hell out u32_delete() emptiness check net: sched: cls_u32: get rid of tp_c net/sched/cls_u32.c | 117 1 file changed, 35 insertions(+), 82 deletions(-) -- 2.11.0
[PATCH net-next 05/11] net: sched: cls_u32: get rid of tc_u_common ->rcu
From: Al Viro unused Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 810c49ac1bbe..c378168f4562 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -98,7 +98,6 @@ struct tc_u_common { int refcnt; struct idr handle_idr; struct hlist_node hnode; - struct rcu_head rcu; }; static inline unsigned int u32_hash_fold(__be32 key, -- 2.11.0
[PATCH net-next 04/11] net: sched: cls_u32: get rid of tc_u_knode ->tp
From: Al Viro not used anymore Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index ef0f2e6ec422..810c49ac1bbe 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -68,7 +68,6 @@ struct tc_u_knode { u32 mask; u32 __percpu*pcpu_success; #endif - struct tcf_proto*tp; struct rcu_work rwork; /* The 'sel' field MUST be the last field in structure to allow for * tc_u32_keys allocated at end of structure. @@ -896,7 +895,6 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto *tp, /* Similarly success statistics must be moved as pointers */ new->pcpu_success = n->pcpu_success; #endif - new->tp = tp; memcpy(>sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key)); if (tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE)) { @@ -1112,7 +1110,6 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, n->handle = handle; n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0; n->flags = flags; - n->tp = tp; err = tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE); if (err < 0) -- 2.11.0
[PATCH net-next 09/11] net: sched: cls_u32: get rid of tp_c
From: Al Viro Both hnode ->tp_c and tp_c argument of u32_set_parms() the latter is redundant, the former - never read... Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3ed2c9866b36..3d4c360f9b0c 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -79,7 +79,6 @@ struct tc_u_hnode { struct tc_u_hnode __rcu *next; u32 handle; u32 prio; - struct tc_u_common *tp_c; int refcnt; unsigned intdivisor; struct idr handle_idr; @@ -390,7 +389,6 @@ static int u32_init(struct tcf_proto *tp) tp_c->refcnt++; RCU_INIT_POINTER(root_ht->next, tp_c->hlist); rcu_assign_pointer(tp_c->hlist, root_ht); - root_ht->tp_c = tp_c; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; @@ -761,7 +759,7 @@ static const struct nla_policy u32_policy[TCA_U32_MAX + 1] = { }; static int u32_set_parms(struct net *net, struct tcf_proto *tp, -unsigned long base, struct tc_u_common *tp_c, +unsigned long base, struct tc_u_knode *n, struct nlattr **tb, struct nlattr *est, bool ovr, struct netlink_ext_ack *extack) @@ -782,7 +780,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, } if (handle) { - ht_down = u32_lookup_ht(tp_c, handle); + ht_down = u32_lookup_ht(tp->data, handle); if (!ht_down) { NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); @@ -956,7 +954,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, if (!new) return -ENOMEM; - err = u32_set_parms(net, tp, base, tp_c, new, tb, + err = u32_set_parms(net, tp, base, new, tb, tca[TCA_RATE], ovr, extack); if (err) { @@ -1012,7 +1010,6 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, return err; } } - ht->tp_c = tp_c; ht->refcnt = 1; ht->divisor = divisor; ht->handle = handle; @@ -1123,7 +1120,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, } #endif - err = u32_set_parms(net, tp, base, tp_c, n, tb, tca[TCA_RATE], ovr, + err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; -- 2.11.0
[PATCH net-next 01/11] net: sched: cls_u32: disallow linking to root hnode
From: Al Viro Operation makes no sense. Nothing will actually break if we do so (depth limit in u32_classify() will prevent infinite loops), but according to maintainers it's best prohibited outright. NOTE: doing so guarantees that u32_destroy() will trigger the call of u32_destroy_hnode(); we might want to make that unconditional. Test: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent : protocol ip prio 100 u32 \ link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff should fail with Error: cls_u32: Not linking to root node Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 4 1 file changed, 4 insertions(+) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 622f4657da94..3357331a80a2 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -797,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); return -EINVAL; } + if (ht_down->is_root) { + NL_SET_ERR_MSG_MOD(extack, "Not linking to root node"); + return -EINVAL; + } ht_down->refcnt++; } -- 2.11.0
[PATCH net-next 06/11] net: sched: cls_u32: clean tc_u_common hashtable
From: Al Viro * calculate key *once*, not for each hash chain element * let tc_u_hash() return the pointer to chain head rather than index - callers are cleaner that way. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index c378168f4562..3f6fba831c57 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -343,19 +343,16 @@ static void *tc_u_common_ptr(const struct tcf_proto *tp) return block->q; } -static unsigned int tc_u_hash(const struct tcf_proto *tp) +static struct hlist_head *tc_u_hash(void *key) { - return hash_ptr(tc_u_common_ptr(tp), U32_HASH_SHIFT); + return tc_u_common_hash + hash_ptr(key, U32_HASH_SHIFT); } -static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) +static struct tc_u_common *tc_u_common_find(void *key) { struct tc_u_common *tc; - unsigned int h; - - h = tc_u_hash(tp); - hlist_for_each_entry(tc, _u_common_hash[h], hnode) { - if (tc->ptr == tc_u_common_ptr(tp)) + hlist_for_each_entry(tc, tc_u_hash(key), hnode) { + if (tc->ptr == key) return tc; } return NULL; @@ -364,10 +361,8 @@ static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) static int u32_init(struct tcf_proto *tp) { struct tc_u_hnode *root_ht; - struct tc_u_common *tp_c; - unsigned int h; - - tp_c = tc_u_common_find(tp); + void *key = tc_u_common_ptr(tp); + struct tc_u_common *tp_c = tc_u_common_find(key); root_ht = kzalloc(sizeof(*root_ht), GFP_KERNEL); if (root_ht == NULL) @@ -385,12 +380,11 @@ static int u32_init(struct tcf_proto *tp) kfree(root_ht); return -ENOBUFS; } - tp_c->ptr = tc_u_common_ptr(tp); + tp_c->ptr = key; INIT_HLIST_NODE(_c->hnode); idr_init(_c->handle_idr); - h = tc_u_hash(tp); - hlist_add_head(_c->hnode, _u_common_hash[h]); + hlist_add_head(_c->hnode, tc_u_hash(key)); } tp_c->refcnt++; -- 2.11.0
[PATCH net-next 07/11] net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode
From: Al Viro the only thing we used ht for was ht->tp_c and callers can get that without going through ->tp_c at all; start with lifting that into the callers, next commits will massage those, eventually removing ->tp_c altogether. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3f6fba831c57..53f34f8cde8b 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -761,7 +761,7 @@ static const struct nla_policy u32_policy[TCA_U32_MAX + 1] = { }; static int u32_set_parms(struct net *net, struct tcf_proto *tp, -unsigned long base, struct tc_u_hnode *ht, +unsigned long base, struct tc_u_common *tp_c, struct tc_u_knode *n, struct nlattr **tb, struct nlattr *est, bool ovr, struct netlink_ext_ack *extack) @@ -782,7 +782,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, } if (handle) { - ht_down = u32_lookup_ht(ht->tp_c, handle); + ht_down = u32_lookup_ht(tp_c, handle); if (!ht_down) { NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); @@ -957,7 +957,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, return -ENOMEM; err = u32_set_parms(net, tp, base, - rtnl_dereference(n->ht_up), new, tb, + rtnl_dereference(n->ht_up)->tp_c, new, tb, tca[TCA_RATE], ovr, extack); if (err) { @@ -1124,7 +1124,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, } #endif - err = u32_set_parms(net, tp, base, ht, n, tb, tca[TCA_RATE], ovr, + err = u32_set_parms(net, tp, base, ht->tp_c, n, tb, tca[TCA_RATE], ovr, extack); if (err == 0) { struct tc_u_knode __rcu **ins; -- 2.11.0
[PATCH net-next 10/11] net: sched: cls_u32: keep track of knodes count in tc_u_common
From: Al Viro allows to simplify u32_delete() considerably Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3d4c360f9b0c..61593bee08db 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -97,6 +97,7 @@ struct tc_u_common { int refcnt; struct idr handle_idr; struct hlist_node hnode; + longknodes; }; static inline unsigned int u32_hash_fold(__be32 key, @@ -452,6 +453,7 @@ static void u32_delete_key_freepf_work(struct work_struct *work) static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key) { + struct tc_u_common *tp_c = tp->data; struct tc_u_knode __rcu **kp; struct tc_u_knode *pkp; struct tc_u_hnode *ht = rtnl_dereference(key->ht_up); @@ -462,6 +464,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key) kp = >next, pkp = rtnl_dereference(*kp)) { if (pkp == key) { RCU_INIT_POINTER(*kp, key->next); + tp_c->knodes--; tcf_unbind_filter(tp, >res); idr_remove(>handle_idr, key->handle); @@ -576,6 +579,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n, static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, struct netlink_ext_ack *extack) { + struct tc_u_common *tp_c = tp->data; struct tc_u_knode *n; unsigned int h; @@ -583,6 +587,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, while ((n = rtnl_dereference(ht->ht[h])) != NULL) { RCU_INIT_POINTER(ht->ht[h], rtnl_dereference(n->next)); + tp_c->knodes--; tcf_unbind_filter(tp, >res); u32_remove_hw_knode(tp, n, extack); idr_remove(>handle_idr, n->handle); @@ -1141,6 +1146,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, RCU_INIT_POINTER(n->next, pins); rcu_assign_pointer(*ins, n); + tp_c->knodes++; *arg = n; return 0; } -- 2.11.0
[PATCH net-next 02/11] net: sched: cls_u32: make sure that divisor is a power of 2
From: Al Viro Tested by modifying iproute2 to to allow sending a divisor > 255 Tested-by: Jamal Hadi Salim Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3357331a80a2..ce55eea448a0 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -994,7 +994,11 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, if (tb[TCA_U32_DIVISOR]) { unsigned int divisor = nla_get_u32(tb[TCA_U32_DIVISOR]); - if (--divisor > 0x100) { + if (!is_power_of_2(divisor)) { + NL_SET_ERR_MSG_MOD(extack, "Divisor is not a power of 2"); + return -EINVAL; + } + if (divisor-- > 0x100) { NL_SET_ERR_MSG_MOD(extack, "Exceeded maximum 256 hash buckets"); return -EINVAL; } -- 2.11.0
[PATCH net-next 11/11] net: sched: cls_u32: simplify the hell out u32_delete() emptiness check
From: Al Viro Now that we have the knode count, we can instantly check if any hnodes are non-empty. And that kills the check for extra references to root hnode - those could happen only if there was a knode to carry such a link. Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 48 +--- 1 file changed, 1 insertion(+), 47 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 61593bee08db..ac79a40a0392 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -627,17 +627,6 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, return -ENOENT; } -static bool ht_empty(struct tc_u_hnode *ht) -{ - unsigned int h; - - for (h = 0; h <= ht->divisor; h++) - if (rcu_access_pointer(ht->ht[h])) - return false; - - return true; -} - static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack) { struct tc_u_common *tp_c = tp->data; @@ -675,13 +664,9 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, struct netlink_ext_ack *extack) { struct tc_u_hnode *ht = arg; - struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); struct tc_u_common *tp_c = tp->data; int ret = 0; - if (ht == NULL) - goto out; - if (TC_U32_KEY(ht->handle)) { u32_remove_hw_knode(tp, (struct tc_u_knode *)ht, extack); ret = u32_delete_key(tp, (struct tc_u_knode *)ht); @@ -702,38 +687,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, } out: - *last = true; - if (root_ht) { - if (root_ht->refcnt > 1) { - *last = false; - goto ret; - } - if (root_ht->refcnt == 1) { - if (!ht_empty(root_ht)) { - *last = false; - goto ret; - } - } - } - - if (tp_c->refcnt > 1) { - *last = false; - goto ret; - } - - if (tp_c->refcnt == 1) { - struct tc_u_hnode *ht; - - for (ht = rtnl_dereference(tp_c->hlist); -ht; -ht = rtnl_dereference(ht->next)) - if (!ht_empty(ht)) { - *last = false; - break; - } - } - -ret: + *last = tp_c->refcnt == 1 && tp_c->knodes == 0; return ret; } -- 2.11.0
[PATCH net-next 03/11] net: sched: cls_u32: get rid of unused argument of u32_destroy_key()
From: Al Viro Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index ce55eea448a0..ef0f2e6ec422 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -405,8 +405,7 @@ static int u32_init(struct tcf_proto *tp) return 0; } -static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n, - bool free_pf) +static int u32_destroy_key(struct tc_u_knode *n, bool free_pf) { struct tc_u_hnode *ht = rtnl_dereference(n->ht_down); @@ -440,7 +439,7 @@ static void u32_delete_key_work(struct work_struct *work) struct tc_u_knode, rwork); rtnl_lock(); - u32_destroy_key(key->tp, key, false); + u32_destroy_key(key, false); rtnl_unlock(); } @@ -457,7 +456,7 @@ static void u32_delete_key_freepf_work(struct work_struct *work) struct tc_u_knode, rwork); rtnl_lock(); - u32_destroy_key(key->tp, key, true); + u32_destroy_key(key, true); rtnl_unlock(); } @@ -600,7 +599,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, if (tcf_exts_get_net(>exts)) tcf_queue_work(>rwork, u32_delete_key_freepf_work); else - u32_destroy_key(n->tp, n, true); + u32_destroy_key(n, true); } } } @@ -971,13 +970,13 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, tca[TCA_RATE], ovr, extack); if (err) { - u32_destroy_key(tp, new, false); + u32_destroy_key(new, false); return err; } err = u32_replace_hw_knode(tp, new, flags, extack); if (err) { - u32_destroy_key(tp, new, false); + u32_destroy_key(new, false); return err; } -- 2.11.0
[PATCH net 1/1] net: sched: cls_u32: fix hnode refcounting
From: Al Viro cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it *is* busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well. AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1. * in u32_destroy() we always drop the former; in u32_destroy_hnode() - the latter. That way we have *all* references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root). Reproducer: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent : protocol ip prio 100 handle 1: \ u32 divisor 1 tc filter add dev eth0 parent : protocol ip prio 200 handle 2: \ u32 divisor 1 tc filter add dev eth0 parent : protocol ip prio 100 \ handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \ plus 0 eat match ip protocol 6 ff tc filter delete dev eth0 parent : protocol ip prio 200 tc filter change dev eth0 parent : protocol ip prio 100 \ handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \ eat match ip protocol 6 ff tc filter delete dev eth0 parent : protocol ip prio 100 Signed-off-by: Al Viro Signed-off-by: Jamal Hadi Salim --- net/sched/cls_u32.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index f218ccf1e2d9..b2c3406a2cf2 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -398,6 +398,7 @@ static int u32_init(struct tcf_proto *tp) rcu_assign_pointer(tp_c->hlist, root_ht); root_ht->tp_c = tp_c; + root_ht->refcnt++; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; return 0; @@ -610,7 +611,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, struct tc_u_hnode __rcu **hn; struct tc_u_hnode *phn; - WARN_ON(ht->refcnt); + WARN_ON(--ht->refcnt); u32_clear_hnode(tp, ht, extack); @@ -649,7 +650,7 @@ static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack) WARN_ON(root_ht == NULL); - if (root_ht && --root_ht->refcnt == 0) + if (root_ht && --root_ht->refcnt == 1) u32_destroy_hnode(tp, root_ht, extack); if (--tp_c->refcnt == 0) { @@ -698,7 +699,6 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, } if (ht->refcnt == 1) { - ht->refcnt--; u32_destroy_hnode(tp, ht, extack); } else { NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter"); @@ -708,11 +708,11 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, out: *last = true; if (root_ht) { - if (root_ht->refcnt > 1) { + if (root_ht->refcnt > 2) { *last = false; goto ret; } - if (root_ht->refcnt == 1) { + if (root_ht->refcnt == 2) { if (!ht_empty(root_ht)) { *last = false; goto ret; -- 2.11.0
Re: Offloaded u32 classifier tables WAS (Re: [PATCH net 00/13] cls_u32 cleanups and fixes.
On 2018-09-10 8:25 a.m., Jamal Hadi Salim wrote: On 2018-09-09 11:48 a.m., Al Viro wrote: BTW, shouldn't we issue u32_clear_hw_hnode() every time we destroy an hnode? It's done on u32_delete(), it's done (for root ht) on u32_destroy(), but it's not done for any other hnodes when you remove the entire (not shared) filter. Looks fishy... What you are saying makes sense, but that doesnt seem like a new thing. All hardware offload examples I have seen use root tables only[1]. So I am not sure if they use any other tables although the intel hardware at least seems very capable. Look at ixgbe_main.c for example. Theres an explicit assumption that root is always 0x800 but oresence of "..but presence of other tables can be handled" cheers, jamal
Offloaded u32 classifier tables WAS (Re: [PATCH net 00/13] cls_u32 cleanups and fixes.
On 2018-09-09 11:48 a.m., Al Viro wrote: BTW, shouldn't we issue u32_clear_hw_hnode() every time we destroy an hnode? It's done on u32_delete(), it's done (for root ht) on u32_destroy(), but it's not done for any other hnodes when you remove the entire (not shared) filter. Looks fishy... What you are saying makes sense, but that doesnt seem like a new thing. All hardware offload examples I have seen use root tables only[1]. So I am not sure if they use any other tables although the intel hardware at least seems very capable. Look at ixgbe_main.c for example. Theres an explicit assumption that root is always 0x800 but oresence of +Cc some of the NIC vendor folks.. cheers, jamal [1] Here's a script posted by someone at Intel(Sridhar?) a while back that adds 2 filters, one with skip-sw and the other with skip-hw flag. --- # add ingress qdisc tc qdisc add dev p4p1 ingress # enable hw tc offload. ethtool -K p4p1 hw-tc-offload on # add u32 filter with skip-sw flag. tc filter add dev p4p1 parent : protocol ip prio 99 \ handle 800:0:1 u32 ht 800: flowid 800:1 \ skip-sw \ match ip src 192.168.1.0/24 \ action drop # add u32 filter with skip-hw flag. tc filter add dev p4p1 parent : protocol ip prio 99 \ handle 800:0:2 u32 ht 800: flowid 800:2 \ skip-hw \ match ip src 192.168.2.0/24 \ action drop
Re: [PATCH net 00/13] cls_u32 cleanups and fixes.
On 2018-09-09 10:15 a.m., Al Viro wrote: [..] Umm... Interesting - TCA_U32_SEL is not the only thing that gets ignored there; TCA_U32_MARK gets the same treatment. And then there's a lovely question what to do with n->pf - it's an array of n->sel.nkeys counters, and apparently we want (at least in common cases) to avoid resetting those. *If* we declare that ->nkeys mismatch means failure, it's all relatively easy to implement. Alternatively, we could declare that selector change means resetting the stats. Preferences? I am now conflicted. I have sample scripts which showed that at one point that worked. All of them seem to be indicating the nkeys and stats never changed. So that could be a starting point; however, if a policy changes the match tuples then it is no longer the same rule imo. Note: you can change the "actions" - of which the most primitive is "set classid x:y" without changing what is being matched. i.e changing the classid in that example would work. cheers, jamal
Re: [PATCH net 00/13] cls_u32 cleanups and fixes.
Since you have the momentum here: i noticed something unusual while i was trying to craft a test that would vet some of your changes. This has nothing to do with your changes, same happens on my stock debian laptop with kernel: 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27) Looking at git - possibly introduced around the time u32 lockless was being introduced and maybe even earlier than that. Unfortunately i dont have time to dig further. To reproduce what i am referring to, here's a setup: $tc filter add dev eth0 parent : protocol ip prio 102 u32 \ classid 1:2 match ip src 192.168.8.0/8 $tc filter replace dev eth0 parent : protocol ip prio 102 \ handle 800:0:800 u32 classid 1:2 match ip src 1.1.0.0/24 u32_change() code path should have allowed changing of the keynode. cheers, jamal
Re: [PATCH net 13/13] net: sched: cls_u32: simplify the hell out u32_delete() emptiness check
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro Now that we have the knode count, we can instantly check if any hnodes are non-empty. And that kills the check for extra references to root hnode - those could happen only if there was a knode to carry such a link. Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 12/13] net: sched: cls_u32: keep track of knodes count in tc_u_common
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro allows to simplify u32_delete() considerably Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 11/13] net: sched: cls_u32: get rid of hnode ->tp_c and tp_c argument of u32_set_parms()
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 10/13] net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 09/13] net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro the only thing we used ht for was ht->tp_c and callers can get that without going through ->tp_c at all; start with lifting that into the callers, next commits will massage those, eventually removing ->tp_c altogether. Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 08/13] net: sched: cls_u32: clean tc_u_common hashtable
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro * calculate key *once*, not for each hash chain element * let tc_u_hash() return the pointer to chain head rather than index - callers are cleaner that way. Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 07/13] net: sched: cls_u32: get rid of tc_u_common ->rcu
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro unused Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 06/13] net: sched: cls_u32: get rid of tc_u_knode ->tp
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro not used anymore Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 05/13] net: sched: cls_u32: get rid of unused argument of u32_destroy_key()
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 04/13] net: sched: cls_u32: make sure that divisor is a power of 2
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro > Signed-off-by: Al Viro Test is by hacking user space iproute2 to allow sending a divisor > 255 Tested-by: Jamal Hadi Salim Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 02/13] net: sched: cls_u32: mark root hnode explicitly
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 03/13] net: sched: cls_u32: disallow linking to root hnode
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro Operation makes no sense. Nothing will actually break if we do so (depth limit in u32_classify() will prevent infinite loops), but according to maintainers it's best prohibited outright. NOTE: doing so guarantees that u32_destroy() will trigger the call of u32_destroy_hnode(); we might want to make that unconditional. Test: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent : protocol ip prio 100 u32 \ link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff should fail with Error: cls_u32: Not linking to root node Signed-off-by: Al Viro Tested-by: Jamal Hadi Salim Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net 01/13] net: sched: cls_u32: fix hnode refcounting
On 2018-09-08 9:31 p.m., Al Viro wrote: From: Al Viro cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it *is* busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well. AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1. * in u32_destroy() we always drop the former; in u32_destroy_hnode() - the latter. That way we have *all* references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root). Cc: sta...@vger.kernel.org Signed-off-by: Al Viro per Cong's earlier remark - the reproducer going in the changelog would be nice to show i.e this you posted earlier, otherwise: Tested-by: Jamal Hadi Salim Acked-by: Jamal Hadi Salim Reminder: -- tc qdisc add dev eth0 ingress tc filter add dev eth0 parent : protocol ip prio 100 handle 1: u32 divisor 1 tc filter add dev eth0 parent : protocol ip prio 200 handle 2: u32 divisor 1 tc filter add dev eth0 parent : protocol ip prio 100 handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff tc filter delete dev eth0 parent : protocol ip prio 200 tc filter change dev eth0 parent : protocol ip prio 100 handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff tc filter delete dev eth0 parent ffff: protocol ip prio 100 --- cheers, jamal
Re: [PATCH 1/7] fix hnode refcounting
To clarify with an example i used to test your patches: #0 add ingress filter $TC qdisc add dev $P ingress #1 add filter $TC filter add dev $P parent : protocol ip prio 10 \ u32 match ip protocol 1 0xff #2 display $TC filter ls dev $P parent : #3 try to delete root $TC filter delete dev $P parent : protocol ip prio 10 \ handle 800: u32 #4 nothing changes.. $TC filter ls dev $P parent : #5 delete filter $TC filter delete dev $P parent : protocol ip prio 10 \ handle 800:0:800 u32 #6 filter gone but hash table still there.. $TC filter ls dev $P parent : #7 delete tp $TC filter delete dev $P parent : protocol ip prio 10 \ u32 #8 now it is gone.. $TC filter ls dev $P parent : your patches show #6 filter as still active. We want it to look like #8 Hope this helps. cheers, jamal
Re: [PATCH 1/7] fix hnode refcounting
On 2018-09-06 10:35 p.m., Al Viro wrote: On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote: [..] Argh... Unfortunately, there's this: in u32_delete() we have if (root_ht) { if (root_ht->refcnt > 1) { *last = false; goto ret; } if (root_ht->refcnt == 1) { if (!ht_empty(root_ht)) { *last = false; goto ret; } } } and that would need to be updated. It is not detrimental as you have it right now but you are right an adjustment is needed... Deleting of a root directly should not be allowed. But you can flush a whole tp. Consider this: -- sudo tc qdisc add dev $P ingress sudo tc filter add dev $P parent : protocol ip prio 10 \ u32 match ip protocol 1 0xff Which creates root ht 800 You shouldnt be allowed to do this: -- tc filter delete dev $P parent : protocol ip prio 10 handle 800: u32 --- But you can delete the tp entirely as such: --- tc filter delete dev $P parent : protocol ip prio 10 u32 -- The later will go via the destroy() path and flush all filters. You should also be able to delete individual filters. ex: $tc filter del dev $P parent : prio 10 handle 800:0:800 u32 Where that code you are referring to is important is when the last filter deleted - we need the caller to know and it destroys root. i.e you should return last=true when the last filter is deleted so root gets auto deleted (just like it was autocreated) However, that logics is bloody odd to start with. First of all, root_ht has come from struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); and the only place where it's ever modified is rcu_assign_pointer(tp->root, root_ht); in u32_init(), where we'd bloody well checked that root_ht is non-NULL (see if (root_ht == NULL) return -ENOBUFS; upstream of that place) and where that assignment is inevitable on the way to returning 0. No matter what, if tp has passed u32_init() it will have non-NULL ->root, forever. And there is no way for tcf_proto to be seen outside of tcf_proto_create() without ->init() having returned 0 - it gets freed before anyone sees it. Yes, the check for root_ht is not necessary - but the check for the last filter (and testing for last) is needed. So this 'if (root_ht)' can't be false. What's more, what the hell is the whole thing checking? We are in u32_delete(). It's called (as ->delete()) from tfilter_del_notify(), which is called from tc_del_tfilter(). If we return 0 with *last true, we follow up calling tcf_proto_destroy(). OK, let's look at the logics in there: * if there are links to root hnode => false * if there's no links to root hnode and it has knodes => false (BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed) * if there is a tcf_proto sharing tp->data => false (i.e. any filters with different prio - don't bother) * if tp is the only one with reference to tp->data and there are *any* knodes => false. Any extra links can come only from knodes in a non-empty hnode. And it's not a common case. Shouldn't thIe whole thing be * shared tp->data => false * any non-empty hnode => false instead? Perhaps even with the knode counter in tp->data, avoiding any loops in there, as well as the entire ht_empty()... Now, in the very beginning of u32_delete() we have this: struct tc_u_hnode *ht = arg; if (ht == NULL) goto out; OK, but the call of ->delete() is err = tp->ops->delete(tp, fh, last, extack); and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify(). Which is called in if (!fh) { ... } else { bool last; err = tfilter_del_notify(net, skb, n, tp, block, q, parent, fh, false, , extack); How can we ever get there with NULL fh? Try: tc filter delete dev $P parent : protocol ip prio 10 u32 tcm handle is 0, so will hit that code path. The whole thing makes very little sense; looks like it used to live in u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp check from ->destroy() to ->delete()"), but looking at the rationale in that commit... I don't see how it fixes anything - sure, now we remove tcf_proto from the list before calling ->destroy(). Without any RCU delays in between. How could it possibly solve any issues with ->classify() called in parallel with ->destroy()? cls_u32 (at least these days) does try to survive u32_destroy() in parallel with u32_classify(); if any other classifiers do
Re: [PATCH 2/7] mark root hnode explicitly
On 2018-09-06 6:59 a.m., Al Viro wrote: On Thu, Sep 06, 2018 at 06:34:00AM -0400, Jamal Hadi Salim wrote: On 2018-09-06 6:28 a.m., Jamal Hadi Salim wrote: [..] Point, and that one is IMO enough to give up on using ->flags for that. How about simply diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 3f985f29ef30..d14048e38b5c 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -84,6 +84,7 @@ struct tc_u_hnode { int refcnt; unsigned intdivisor; struct idr handle_idr; + boolis_root; struct rcu_head rcu; u32 flags; /* The 'ht' field MUST be the last field in structure to allow for @@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp) root_ht->refcnt++; root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000; root_ht->prio = tp->prio; + root_ht->is_root = true; idr_init(_ht->handle_idr); if (tp_c == NULL) { @@ -693,7 +695,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, goto out; } - if (root_ht == ht) { + if (ht->is_root) { NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node"); return -EINVAL; } @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp, NL_SET_ERR_MSG_MOD(extack, "Link hash table not found"); return -EINVAL; } + if (ht_down->is_root) { + NL_SET_ERR_MSG_MOD(extack, "Not linking to root node"); + return -EINVAL; + } ht_down->refcnt++; Looks good to me ;-> cheers, jamal
Re: [PATCH 2/7] mark root hnode explicitly
And a bunch of indentations... cheers, jamal diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 6d45ec4c218c..cb3bee12af78 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -485,7 +485,8 @@ static void u32_clear_hw_hnode(struct tcf_proto *tp, struct tc_u_hnode *h, struct tc_cls_u32_offload cls_u32 = {}; tc_cls_common_offload_init(_u32.common, tp, - h->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack); + h->flags & ~TCA_CLS_FLAGS_U32_ROOT, + extack); cls_u32.command = TC_CLSU32_DELETE_HNODE; cls_u32.hnode.divisor = h->divisor; cls_u32.hnode.handle = h->handle; @@ -1215,7 +1216,7 @@ static int u32_reoffload_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, int err; tc_cls_common_offload_init(_u32.common, tp, - ht->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack); + ht->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack); cls_u32.command = add ? TC_CLSU32_NEW_HNODE : TC_CLSU32_DELETE_HNODE; cls_u32.hnode.divisor = ht->divisor; cls_u32.hnode.handle = ht->handle;
Re: [PATCH 7/7] clean tc_u_common hashtable
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro * calculate key *once*, not for each hash chain element * let tc_u_hash() return the pointer to chain head rather than index - callers are cleaner that way. Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 6/7] get rid of tc_u_common ->rcu
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro unused Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 5/7] get rid of tc_u_knode ->tp
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro not used anymore Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 4/7] get rid of unused argument of u32_destroy_key()
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 2/7] mark root hnode explicitly
On 2018-09-06 6:28 a.m., Jamal Hadi Salim wrote: On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro ... and disallow deleting or linking to such Signed-off-by: Al Viro Same comment as other one in regards to subject Since the flag space is coming from htnode which is exposed via uapi it makes sense to keep this one here because it is for private use; but a comment in include/uapi/linux/pkt_cls.h that this flag or maybe a set of bits is reserved for internal use. Otherwise: Acked-by: Jamal Hadi Salim Sorry, additional comment: It makes sense to reject user space attempt to set TCA_CLS_FLAGS_U32_ROOT So my suggestion is to update tc_flags_valid() to check for this. cheers, jamal
Re: [PATCH 3/7] make sure that divisor is a power of 2
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro Signed-off-by: Al Viro Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 2/7] mark root hnode explicitly
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro ... and disallow deleting or linking to such Signed-off-by: Al Viro Same comment as other one in regards to subject Since the flag space is coming from htnode which is exposed via uapi it makes sense to keep this one here because it is for private use; but a comment in include/uapi/linux/pkt_cls.h that this flag or maybe a set of bits is reserved for internal use. Otherwise: Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH 1/7] fix hnode refcounting
On 2018-09-05 3:04 p.m., Al Viro wrote: From: Al Viro cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it *is* busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well. AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1. * in u32_destroy() we always drop the former; in u32_destroy_hnode() - the latter. That way we have *all* references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root). Cc: sta...@vger.kernel.org Signed-off-by: Al Viro For networking patches, subject should be reflective of tree and subsystem. Example for this one: "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting" Also useful to have a cover letter summarizing the patchset in 0/7. Otherwise Acked-by: Jamal Hadi Salim cheers, jamal
Re: broken behaviour of TC filter delete
On 2018-08-25 9:02 a.m., Jiri Pirko wrote: Fri, Aug 24, 2018 at 08:11:07PM CEST, xiyou.wangc...@gmail.com wrote: ENOENT seems to be more logical to return when there's no more filter to delete. Yeah, at least we should keep ENOENT for compatibility. The bug here is chain 0 is gone after the last filter is gone, so when you delete the filter again, it treats it as you specify chain 0 which does not exist, so it hits EINVAL before ENOENT. I understand. My concern is about consistency with other chains. Perhaps -ENOENT for all chains in this case would be doable. What do you think? ENOENT with extack describing whether chain or filter not found. cheers, jamal
Re: [PATCH net-next] net: sched: act_ife: disable bh when taking ife_mod_lock
On 2018-08-13 1:20 p.m., Vlad Buslov wrote: Lockdep reports deadlock for following locking scenario in ife action: Task one: 1) Executes ife action update. 2) Takes tcfa_lock. 3) Waits on ife_mod_lock which is already taken by task two. Task two: 1) Executes any path that obtains ife_mod_lock without disabling bh (any path that takes ife_mod_lock while holding tcfa_lock has bh disabled) like loading a meta module, or creating new action. 2) Takes ife_mod_lock. 3) Task is preempted by rate estimator timer. 4) Timer callback waits on tcfa_lock which is taken by task one. In described case tasks deadlock because they take same two locks in different order. To prevent potential deadlock reported by lockdep, always disable bh when obtaining ife_mod_lock. Looks like your recent changes on net-next exposed this. Acked-by: Jamal Hadi Salim cheers, jamal
[PATCH net-next 04/13] net: sched: act_gact method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_gact.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index bfccd34a3968..52a3e474d822 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -133,8 +133,8 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla, return ret; } -static int tcf_gact(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_gact_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_gact *gact = to_gact(a); int action = READ_ONCE(gact->tcf_action); @@ -254,7 +254,7 @@ static struct tc_action_ops act_gact_ops = { .kind = "gact", .type = TCA_ACT_GACT, .owner = THIS_MODULE, - .act= tcf_gact, + .act= tcf_gact_act, .stats_update = tcf_gact_stats_update, .dump = tcf_gact_dump, .init = tcf_gact_init, -- 2.11.0
[PATCH net-next 11/13] net: sched: act_skbmod method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_skbmod.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c index e9c86ade3b40..d6a1af0c4171 100644 --- a/net/sched/act_skbmod.c +++ b/net/sched/act_skbmod.c @@ -24,7 +24,7 @@ static unsigned int skbmod_net_id; static struct tc_action_ops act_skbmod_ops; #define MAX_EDIT_LEN ETH_HLEN -static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a, +static int tcf_skbmod_act(struct sk_buff *skb, const struct tc_action *a, struct tcf_result *res) { struct tcf_skbmod *d = to_skbmod(a); @@ -270,7 +270,7 @@ static struct tc_action_ops act_skbmod_ops = { .kind = "skbmod", .type = TCA_ACT_SKBMOD, .owner = THIS_MODULE, - .act= tcf_skbmod_run, + .act= tcf_skbmod_act, .dump = tcf_skbmod_dump, .init = tcf_skbmod_init, .cleanup= tcf_skbmod_cleanup, -- 2.11.0
[PATCH net-next 07/13] net: sched: act_pedit method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_pedit.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c index 3f62da72ab6a..8a7a7cb94e83 100644 --- a/net/sched/act_pedit.c +++ b/net/sched/act_pedit.c @@ -288,8 +288,8 @@ static int pedit_skb_hdr_offset(struct sk_buff *skb, return ret; } -static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a, -struct tcf_result *res) +static int tcf_pedit_act(struct sk_buff *skb, const struct tc_action *a, +struct tcf_result *res) { struct tcf_pedit *p = to_pedit(a); int i; @@ -471,7 +471,7 @@ static struct tc_action_ops act_pedit_ops = { .kind = "pedit", .type = TCA_ACT_PEDIT, .owner = THIS_MODULE, - .act= tcf_pedit, + .act= tcf_pedit_act, .dump = tcf_pedit_dump, .cleanup= tcf_pedit_cleanup, .init = tcf_pedit_init, -- 2.11.0
[PATCH net-next 10/13] net: sched: act_skbedit method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_skbedit.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index a6db47ebec11..926d7bc4a89d 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -33,8 +33,8 @@ static unsigned int skbedit_net_id; static struct tc_action_ops act_skbedit_ops; -static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_skbedit *d = to_skbedit(a); struct tcf_skbedit_params *params; @@ -310,7 +310,7 @@ static struct tc_action_ops act_skbedit_ops = { .kind = "skbedit", .type = TCA_ACT_SKBEDIT, .owner = THIS_MODULE, - .act= tcf_skbedit, + .act= tcf_skbedit_act, .dump = tcf_skbedit_dump, .init = tcf_skbedit_init, .cleanup= tcf_skbedit_cleanup, -- 2.11.0
[PATCH net-next 00/13] net: sched: actions rename for grep-ability and consistency
From: Jamal Hadi Salim Having a structure (example tcf_mirred) and a function with the same name is not good for readability or grepability. This long overdue patchset improves it and make sure there is consistency across all actions Jamal Hadi Salim (13): net: sched: act_connmark method rename for grep-ability and consistency net: sched: act_bpf method rename for grep-ability and consistency net: sched: act_sum method rename for grep-ability and consistency net: sched: act_gact method rename for grep-ability and consistency net: sched: act_ipt method rename for grep-ability and consistency net: sched: act_nat method rename for grep-ability and consistency net: sched: act_pedit method rename for grep-ability and consistency net: sched: act_police method rename for grep-ability and consistency net: sched: act_simple method rename for grep-ability and consistency net: sched: act_skbedit method rename for grep-ability and consistency net: sched: act_skbmod method rename for grep-ability and consistency net: sched: act_vlan method rename for grep-ability and consistency net: sched: act_mirred method rename for grep-ability and consistency net/sched/act_bpf.c | 6 +++--- net/sched/act_connmark.c | 6 +++--- net/sched/act_csum.c | 6 +++--- net/sched/act_gact.c | 6 +++--- net/sched/act_ipt.c | 8 net/sched/act_mirred.c | 6 +++--- net/sched/act_nat.c | 6 +++--- net/sched/act_pedit.c| 6 +++--- net/sched/act_police.c | 16 net/sched/act_simple.c | 6 +++--- net/sched/act_skbedit.c | 6 +++--- net/sched/act_skbmod.c | 4 ++-- net/sched/act_vlan.c | 6 +++--- 13 files changed, 44 insertions(+), 44 deletions(-) -- 2.11.0
[PATCH net-next 03/13] net: sched: act_sum method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_csum.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index f01c59ba6d12..5596fae4e478 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -555,8 +555,8 @@ static int tcf_csum_ipv6(struct sk_buff *skb, u32 update_flags) return 0; } -static int tcf_csum(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_csum *p = to_tcf_csum(a); struct tcf_csum_params *params; @@ -670,7 +670,7 @@ static struct tc_action_ops act_csum_ops = { .kind = "csum", .type = TCA_ACT_CSUM, .owner = THIS_MODULE, - .act= tcf_csum, + .act= tcf_csum_act, .dump = tcf_csum_dump, .init = tcf_csum_init, .cleanup= tcf_csum_cleanup, -- 2.11.0
[PATCH net-next 01/13] net: sched: act_connmark method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_connmark.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index 2f9bc833d046..54c0bf54f2ac 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -31,8 +31,8 @@ static unsigned int connmark_net_id; static struct tc_action_ops act_connmark_ops; -static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_connmark_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { const struct nf_conntrack_tuple_hash *thash; struct nf_conntrack_tuple tuple; @@ -209,7 +209,7 @@ static struct tc_action_ops act_connmark_ops = { .kind = "connmark", .type = TCA_ACT_CONNMARK, .owner = THIS_MODULE, - .act= tcf_connmark, + .act= tcf_connmark_act, .dump = tcf_connmark_dump, .init = tcf_connmark_init, .walk = tcf_connmark_walker, -- 2.11.0
[PATCH net-next 13/13] net: sched: act_mirred method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_mirred.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 327be257033d..8ec216001077 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -190,8 +190,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return ret; } -static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_mirred *m = to_mirred(a); struct sk_buff *skb2 = skb; @@ -406,7 +406,7 @@ static struct tc_action_ops act_mirred_ops = { .kind = "mirred", .type = TCA_ACT_MIRRED, .owner = THIS_MODULE, - .act= tcf_mirred, + .act= tcf_mirred_act, .stats_update = tcf_stats_update, .dump = tcf_mirred_dump, .cleanup= tcf_mirred_release, -- 2.11.0
[PATCH net-next 06/13] net: sched: act_nat method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_nat.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c index 4dd9188a72fd..822e903bfc25 100644 --- a/net/sched/act_nat.c +++ b/net/sched/act_nat.c @@ -93,8 +93,8 @@ static int tcf_nat_init(struct net *net, struct nlattr *nla, struct nlattr *est, return ret; } -static int tcf_nat(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_nat_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_nat *p = to_tcf_nat(a); struct iphdr *iph; @@ -311,7 +311,7 @@ static struct tc_action_ops act_nat_ops = { .kind = "nat", .type = TCA_ACT_NAT, .owner = THIS_MODULE, - .act= tcf_nat, + .act= tcf_nat_act, .dump = tcf_nat_dump, .init = tcf_nat_init, .walk = tcf_nat_walker, -- 2.11.0
[PATCH net-next 12/13] net: sched: act_vlan method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_vlan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c index 5bde17fe3608..d1f5028384c9 100644 --- a/net/sched/act_vlan.c +++ b/net/sched/act_vlan.c @@ -22,8 +22,8 @@ static unsigned int vlan_net_id; static struct tc_action_ops act_vlan_ops; -static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_vlan_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_vlan *v = to_vlan(a); struct tcf_vlan_params *p; @@ -307,7 +307,7 @@ static struct tc_action_ops act_vlan_ops = { .kind = "vlan", .type = TCA_ACT_VLAN, .owner = THIS_MODULE, - .act= tcf_vlan, + .act= tcf_vlan_act, .dump = tcf_vlan_dump, .init = tcf_vlan_init, .cleanup= tcf_vlan_cleanup, -- 2.11.0
[PATCH net-next 09/13] net: sched: act_simple method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_simple.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c index 18e4452574cd..e616523ba3c1 100644 --- a/net/sched/act_simple.c +++ b/net/sched/act_simple.c @@ -28,8 +28,8 @@ static unsigned int simp_net_id; static struct tc_action_ops act_simp_ops; #define SIMP_MAX_DATA 32 -static int tcf_simp(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_simp_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { struct tcf_defact *d = to_defact(a); @@ -207,7 +207,7 @@ static struct tc_action_ops act_simp_ops = { .kind = "simple", .type = TCA_ACT_SIMP, .owner = THIS_MODULE, - .act= tcf_simp, + .act= tcf_simp_act, .dump = tcf_simp_dump, .cleanup= tcf_simp_release, .init = tcf_simp_init, -- 2.11.0
[PATCH net-next 02/13] net: sched: act_bpf method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_bpf.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index 9e8a33f9fee3..9b30e62805c7 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -34,8 +34,8 @@ struct tcf_bpf_cfg { static unsigned int bpf_net_id; static struct tc_action_ops act_bpf_ops; -static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act, - struct tcf_result *res) +static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, + struct tcf_result *res) { bool at_ingress = skb_at_tc_ingress(skb); struct tcf_bpf *prog = to_bpf(act); @@ -406,7 +406,7 @@ static struct tc_action_ops act_bpf_ops __read_mostly = { .kind = "bpf", .type = TCA_ACT_BPF, .owner = THIS_MODULE, - .act= tcf_bpf, + .act= tcf_bpf_act, .dump = tcf_bpf_dump, .cleanup= tcf_bpf_cleanup, .init = tcf_bpf_init, -- 2.11.0
[PATCH net-next 08/13] net: sched: act_police method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_police.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/net/sched/act_police.c b/net/sched/act_police.c index 88c16d80c1cf..06f0742db593 100644 --- a/net/sched/act_police.c +++ b/net/sched/act_police.c @@ -56,7 +56,7 @@ struct tc_police_compat { static unsigned int police_net_id; static struct tc_action_ops act_police_ops; -static int tcf_act_police_walker(struct net *net, struct sk_buff *skb, +static int tcf_police_walker(struct net *net, struct sk_buff *skb, struct netlink_callback *cb, int type, const struct tc_action_ops *ops, struct netlink_ext_ack *extack) @@ -73,7 +73,7 @@ static const struct nla_policy police_policy[TCA_POLICE_MAX + 1] = { [TCA_POLICE_RESULT] = { .type = NLA_U32 }, }; -static int tcf_act_police_init(struct net *net, struct nlattr *nla, +static int tcf_police_init(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **a, int ovr, int bind, bool rtnl_held, struct netlink_ext_ack *extack) @@ -203,7 +203,7 @@ static int tcf_act_police_init(struct net *net, struct nlattr *nla, return err; } -static int tcf_act_police(struct sk_buff *skb, const struct tc_action *a, +static int tcf_police_act(struct sk_buff *skb, const struct tc_action *a, struct tcf_result *res) { struct tcf_police *police = to_police(a); @@ -267,7 +267,7 @@ static int tcf_act_police(struct sk_buff *skb, const struct tc_action *a, return police->tcf_action; } -static int tcf_act_police_dump(struct sk_buff *skb, struct tc_action *a, +static int tcf_police_dump(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { unsigned char *b = skb_tail_pointer(skb); @@ -335,10 +335,10 @@ static struct tc_action_ops act_police_ops = { .kind = "police", .type = TCA_ID_POLICE, .owner = THIS_MODULE, - .act= tcf_act_police, - .dump = tcf_act_police_dump, - .init = tcf_act_police_init, - .walk = tcf_act_police_walker, + .act= tcf_police_act, + .dump = tcf_police_dump, + .init = tcf_police_init, + .walk = tcf_police_walker, .lookup = tcf_police_search, .delete = tcf_police_delete, .size = sizeof(struct tcf_police), -- 2.11.0
[PATCH net-next 05/13] net: sched: act_ipt method rename for grep-ability and consistency
From: Jamal Hadi Salim Signed-off-by: Jamal Hadi Salim --- net/sched/act_ipt.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c index e149f0e66cb6..51f235bbeb5b 100644 --- a/net/sched/act_ipt.c +++ b/net/sched/act_ipt.c @@ -222,8 +222,8 @@ static int tcf_xt_init(struct net *net, struct nlattr *nla, bind); } -static int tcf_ipt(struct sk_buff *skb, const struct tc_action *a, - struct tcf_result *res) +static int tcf_ipt_act(struct sk_buff *skb, const struct tc_action *a, + struct tcf_result *res) { int ret = 0, result = 0; struct tcf_ipt *ipt = to_ipt(a); @@ -348,7 +348,7 @@ static struct tc_action_ops act_ipt_ops = { .kind = "ipt", .type = TCA_ACT_IPT, .owner = THIS_MODULE, - .act= tcf_ipt, + .act= tcf_ipt_act, .dump = tcf_ipt_dump, .cleanup= tcf_ipt_release, .init = tcf_ipt_init, @@ -406,7 +406,7 @@ static struct tc_action_ops act_xt_ops = { .kind = "xt", .type = TCA_ACT_XT, .owner = THIS_MODULE, - .act= tcf_ipt, + .act= tcf_ipt_act, .dump = tcf_ipt_dump, .cleanup= tcf_ipt_release, .init = tcf_xt_init, -- 2.11.0
Re: [PATCH net-next v5 1/4] net/sched: user-space can't set unknown tcfa_action values
On 31/07/18 10:40 AM, Paolo Abeni wrote: If we choose to reject unknown opcodes, such user-space configuration will fail. I think that is a good thing. The kernel should not be accepting things it doesnt understand. This is a good opportunity to enforce that. What would happen before this patch is that configurations using such TC_ACT_ value would be successful. This is why I proposed to keep the fixup. Note: Such behavior can only occur if tc(user space) allows you to pass illegitimate values which today can only happen when you have a new user space but older kernel (with "old" starting with your current changes). iow, fixing a policy in a kernel which has no support for TC_ACT_ to translate intent to be TC_ACT_OK/PIPE is problematic (as i was showing earlier). I initially thought the kernel behavior in the above scenario would match exactly TC_ACT_UNSPEC processing, but as you noted with the example in your previous email, TC_ACT_UNSPEC processing is actually a bit different. I worry: I dont think we can get a good default for most use cases. No point in maintaining faulty expectations (because IMO: the user will - eventually - fix their scripts if they dont see expected behavior). cheers, jamal
Re: [PATCH net-next v5 1/4] net/sched: user-space can't set unknown tcfa_action values
On 31/07/18 05:41 AM, Paolo Abeni wrote: Before this patch, the kernel exposed the same behaviour for negative value of 'bar', while, for positive 'bar' values, the overall behaviour was more complex (some classifier always stops with unknown positive action value, others go to lower prio). > Overall, the kernel behavior should be more well-defined now, but yes, there is a change of behavior under some circumstances. What about instead mapping undefined/unknown actions value to TC_ACT_OK (still at initialization time)? > this is what is already done by tcf_action_exec() for faulty opcodes/graphs and by tcf_ipt() and would handle the above example more conistently. I think PIPE maybe more reasonable. You still continue the action graph even on such a mis-configuration. But let me see if i can make a convincing arguement for rejecting at init time: I would be _very suprised_ if someone is depending on a misconfiguration such as this in a deployment because they would get different results than what they are expecting and sooner or later fix it after a lot of debugging and cursing. Your patch helps them notice it sooner. And a rejection even much much sooner. With a rejection you dont get to execute a "fixup" the kernel assumes for you. BTW, I asked this earlier and Jiri said it was addressed in patch 2. I just looked again and i may be missing something basic: Lets say tomorrow in a new kernel we add new TC_ACT_XXX that then gets exposed to uapi - so user space tc is updated. You then use the new tc specifying TC_ACT_XXX policy on kernel with your changes. If i read correctly because TC_ACT_XXX is out of bounds for current kernel(which has your changes) you will fix it to be UNSPEC, no? cheers, jamal
Re: [PATCH net-next v5 1/4] net/sched: user-space can't set unknown tcfa_action values
On 30/07/18 12:41 PM, Paolo Abeni wrote: On Mon, 2018-07-30 at 10:03 -0400, Jamal Hadi Salim wrote: On 30/07/18 08:30 AM, Paolo Abeni wrote: } + if (!tcf_action_valid(a->tcfa_action)) { + NL_SET_ERR_MSG(extack, "invalid action value, using TC_ACT_UNSPEC instead"); + a->tcfa_action = TC_ACT_UNSPEC; + } + return a; I think it would make a lot more sense to just reject the entry than changing it underneath the user to a default value. Least element of suprise. I fear that would break existing (bad) users ?!? This way, such users are notified they are doing something uncorrect, but still continue to work. By "bad users" I think you mean someone setting a policy expecting one behavior but getting a different one? If yes, that policy was already wrong/buggy. As an example, if i configured: match xxx action foo action goo action bar action gah where action goo has a bad opcode If you "fix it" with TC_ACT_UNSPEC then basically the above policy is now equivalent to: match xxx action foo action goo Infact if there was a lower prio rule in the chain then lookup will continue there and produce even stranger results. cheers, jamal The patch can be changed to reject bad actions, if there is agreement, but it would not look as the safest way to me.
Re: [PATCH net-next v5 1/4] net/sched: user-space can't set unknown tcfa_action values
On 30/07/18 08:30 AM, Paolo Abeni wrote: } + if (!tcf_action_valid(a->tcfa_action)) { + NL_SET_ERR_MSG(extack, "invalid action value, using TC_ACT_UNSPEC instead"); + a->tcfa_action = TC_ACT_UNSPEC; + } + return a; I think it would make a lot more sense to just reject the entry than changing it underneath the user to a default value. Least element of suprise. cheers, jamal
Re: [PATCH net-next v3 4/5] net/tc: introduce TC_ACT_REINJECT.
On 25/07/18 01:09 PM, Marcelo Ricardo Leitner wrote: On Wed, Jul 25, 2018 at 09:48:16AM -0700, Cong Wang wrote: On Wed, Jul 25, 2018 at 5:27 AM Jamal Hadi Salim wrote: Those changes were there from the beginning (above patch did not introduce them). IIRC, the reason was to distinguish between policy intended drops and drops because of errors. There must be a limit for "overlimit" to make sense. There is no limit in mirred action's context, probably there is only such a limit in act_police. So, all rest should not touch overlimit. +1 I agree we should at least record drop count(unrelated patch though). we should keep overlimit (for no other reason other than this has been around for at least 15 years). On why "overlimit"? It is just a name for a counter that is useless for most actions (but was still being transfered to user space). It is the closest counter to counting "this failed because of runtime errors" as opposed to "user asked us to drop this". Probably a good alternative is to make a very small stats v3 structure (we have migrated stats structures before) and extend for each action/classifier/qdisc to add its extra counters using XSTATS. Note: If you are _mirroring_ packets - incrementing the drop counter is misleading because the packet is not dropped by the system. i.e the qdisc will not record it as dropped; it should for redirect policy. It is useful to be able to tell them apart when you are collecting analytics just for actions. (if youve worked on a massive amount of machines you'll appreciate being able to debug by looking at counters that reduce ambiguity). cheers, jamal
Re: [PATCH net-next v3 4/5] net/tc: introduce TC_ACT_REINJECT.
On 25/07/18 10:24 AM, Paolo Abeni wrote: On Wed, 2018-07-25 at 08:27 -0400, Jamal Hadi Salim wrote: Those changes were there from the beginning (above patch did not introduce them). IIRC, the reason was to distinguish between policy intended drops and drops because of errors. Double-checking to avoid misinterepration on my side: you are ok with keeping the 'overlimits' increment, right? Yes. cheers, jamal
Re: [PATCH net-next v3 4/5] net/tc: introduce TC_ACT_REINJECT.
On 25/07/18 04:29 AM, Paolo Abeni wrote: On Tue, 2018-07-24 at 13:50 -0700, Cong Wang wrote: [..] I fail to understand why overlimit is increased in your case here. I guess you want to increase 'drops' instead. Hmm, actually the current mirred code increases overlimit too. But I still don't think it makes sense. Yep, I chose to increment 'overlimits' to preserve the current mirred semantic. AFAICS, that was first introduced with: commit 8919bc13e8d92c5b082c5c0321567383a071f5bc Author: Jamal Hadi Salim Date: Mon Aug 15 05:25:40 2011 + net_sched: fix port mirror/redirect stats reporting Likely increasing 'drops' would be "better", but I'm unsure we can change this established behavior without affecting some user. Those changes were there from the beginning (above patch did not introduce them). IIRC, the reason was to distinguish between policy intended drops and drops because of errors. cheers, jamal
Re: [PATCH net-next v3 4/5] net/tc: introduce TC_ACT_REINJECT.
+Cc Shmulik Paolo - please also run the tdc tests (and add anymore if you feel they dont do coverage to your changes) On 24/07/18 04:06 PM, Paolo Abeni wrote: This is similar TC_ACT_REDIRECT, but with a slightly different semantic: - on ingress the mirred skbs are passed to the target device network stack without any additional check not scrubbing. - the rcu-protected stats provided via the tcf_result struct are updated on error conditions. This new tcfa_action value is not exposed to the user-space and can be used only internally by clsact. v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce a new action type instead v2 -> v3: - rename the new action value TC_ACT_REINJECT, update the helper accordingly - take care of uncloned reinjected packets in XDP generic hook Signed-off-by: Paolo Abeni --- include/net/pkt_cls.h | 3 +++ include/net/sch_generic.h | 19 +++ net/core/dev.c| 6 +- 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 2081e4219f81..36ccfe2a303a 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -7,6 +7,9 @@ #include #include +/* TC action not accessible from user space */ +#define TC_ACT_REINJECT(TC_ACT_VALUE_MAX + 1) Lets say in the future we add a new opcode. Will old kernel, new iproute2 (new value) work? Maybe use a negative number below -1. I am honestly still unclear about introducing this new code. Could you not use the result code to carry sufficient info to indicate the intent? cheers, jamal + /* Basic packet classifier frontend definitions. */ struct tcf_walker { diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 056dc1083aa3..95e81a70f549 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -235,6 +235,12 @@ struct tcf_result { u32 classid; }; const struct tcf_proto *goto_tp; + + /* used by the TC_ACT_REINJECT action */ + struct { + boolingress; + struct gnet_stats_queue *qstats; + }; }; }; @@ -1091,4 +1097,17 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp, void mini_qdisc_pair_init(struct mini_Qdisc_pair *miniqp, struct Qdisc *qdisc, struct mini_Qdisc __rcu **p_miniq); +static inline void skb_tc_reinject(struct sk_buff *skb, struct tcf_result *res) +{ + struct gnet_stats_queue *stats = res->qstats; + int ret; + + if (res->ingress) + ret = netif_receive_skb(skb); + else + ret = dev_queue_xmit(skb); + if (ret && stats) + qstats_overlimit_inc(res->qstats); +} + #endif diff --git a/net/core/dev.c b/net/core/dev.c index 14a748ee8cc9..826ec74fe1d9 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4252,7 +4252,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, /* Reinjected packets coming from act_mirred or similar should * not get XDP generic processing. */ - if (skb_cloned(skb)) + if (skb_cloned(skb) || skb->tc_redirected) return XDP_PASS; /* XDP packets must be linear and must have sufficient headroom @@ -4602,6 +4602,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, __skb_push(skb, skb->mac_len); skb_do_redirect(skb); return NULL; + case TC_ACT_REINJECT: + /* this does not scrub the packet, and updates stats on error */ + skb_tc_reinject(skb, _res); + return NULL; default: break; }
Re: [PATCH net-next v3 3/5] tc/act: remove unneeded RCU lock in action callback
On 24/07/18 04:06 PM, Paolo Abeni wrote: Each lockless action currently does its own RCU locking in ->act(). This is allows using plain RCU accessor, even if the context is really RCU BH. This change drops the per action RCU lock, replace the accessors with _bh variant, cleans up a bit the surronding code and documents the RCU status in the relevant header. No functional nor performance change is intended. The goal of this patch is clarifying that the RCU critical section used by the tc actions extends up to the classifier's caller. This and 2/5 seems to stand on their own merit. cheers, jamal
Re: [PATCH net-next v3 1/5] tc/act: user space can't use TC_ACT_REDIRECT directly
On 24/07/18 04:06 PM, Paolo Abeni wrote: --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -895,6 +895,11 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp, } } + if (a->tcfa_action == TC_ACT_REDIRECT) { + net_warn_ratelimited("TC_ACT_REDIRECT can't be used directly"); + a->tcfa_action = TC_ACT_UNSPEC; + } + Why not just reject the rule instead of changing the code underneath the user? cheers, jamal
Re: [PATCH net-next v3 5/5] act_mirred: use TC_ACT_REINJECT when possible
On 24/07/18 05:15 PM, Cong Wang wrote: On Tue, Jul 24, 2018 at 1:07 PM Paolo Abeni wrote: + + /* let's the caller reinject the packet, if possible */ + if (skb_at_tc_ingress(skb)) { + res->ingress = want_ingress; + res->qstats = this_cpu_ptr(m->common.cpu_qstats); + return TC_ACT_REINJECT; + } Looks good to me, but here we no longer return user-specified return value here, I am sure it is safe for TC_ACT_STOLEN, but I am not sure if it is safe for other values, like TC_ACT_RECLASSIFY. Jamal, is there any use case of returning !TC_ACT_STOLEN for ingress redirections? I cant think of one off top of my head. There maybe a future use case where it is not so - maybe just allow to return the user programmed action? that value will always be TC_ACT_STOLEN if the rule was specified via iproute2/tc. cheers, jamal
Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
On 29/06/18 04:39 AM, Jiri Pirko wrote: Fri, Jun 29, 2018 at 12:25:53AM CEST, xiyou.wangc...@gmail.com wrote: On Thu, Jun 28, 2018 at 6:10 AM Jiri Pirko wrote: Add a template of type flower allowing to insert rules matching on last 2 bytes of destination mac address: # tc chaintemplate add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF The template is now showed in the list: # tc chaintemplate show dev dummy0 ingress chaintemplate flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 Add another template, this time for chain number 22: # tc chaintemplate add dev dummy0 ingress proto ip chain 22 flower dst_ip 0.0.0.0/16 # tc chaintemplate show dev dummy0 ingress chaintemplate flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 chaintemplate flower chain 22 eth_type ipv4 dst_ip 0.0.0.0/16 So, if I want to check the template of a chain, I have to use 'tc chaintemplate... chain X'. If I want to check the filters in a chain, I have to use 'tc filter show chain X'. If you introduce 'tc chain', it would just need one command: `tc chain show ... X` which could list its template first and followed by filters in this chain, something like: # tc chain show dev eth0 chain X template: # could be none filter1 ... filter2 ... Isn't it more elegant? Well, that is just another iproute2 command. It would use the same kernel uapi. Filters+templates. Sure, why not. Can be easily introduced. Let's do it in a follow-up iproute2 patch. Half a dozen or 6 - take your pick, really. I would call the template an attribute as opposed to a stand alone object i.e A chain of filters may have a template. If you have to introduce a new object then Sridhar's suggested syntax seems appealing. cheers, jamal
Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
On 28/06/18 09:22 AM, Jiri Pirko wrote: Thu, Jun 28, 2018 at 03:13:30PM CEST, j...@mojatatu.com wrote: On 26/06/18 03:59 AM, Jiri Pirko wrote: From: Jiri Pirko For the TC clsact offload these days, some of HW drivers need to hold a magic ball. The reason is, with the first inserted rule inside HW they need to guess what fields will be used for the matching. If later on this guess proves to be wrong and user adds a filter with a different field to match, there's a problem. Mlxsw resolves it now with couple of patterns. Those try to cover as many match fields as possible. This aproach is far from optimal, both performance-wise and scale-wise. Also, there is a combination of filters that in certain order won't succeed. Most of the time, when user inserts filters in chain, he knows right away how the filters are going to look like - what type and option will they have. For example, he knows that he will only insert filters of type flower matching destination IP address. He can specify a template that would cover all the filters in the chain. Is this just restricted to hardware offload? Example it will make sense for u32 in s/ware as well (i.e flexible TCAM like TCAM based classification). i.e it is possible that rules the user enters end up being worst case a linked list lookup, yes? And allocating space for a tuple that is not in use is a waste of space. I'm afraid I don't understand clearly what you say. Well - I was trying to understand what you said ;-> I think what you are getting at is two issues: a) space in the tcams - if the user is just going to enter rules which use one tuple (dst ip for example) the hardware would be better off told that this is the case so it doesnt allocate space in anticipation that someone is going to specify src ip later on. b) lookup speed in tcams - without the template hint a selection of rules may end up looking like a linked list which is not optimal for lookup This is not restricted to hw offload. The templates apply to all filters, no matter if they are offloaded or not. Do you save anything with flower(in s/w) if you only added a template with say dst ip/mask? I can see it will make sense for u32 which is more flexible and protocol independent. cheers, jamal
Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
On 26/06/18 03:59 AM, Jiri Pirko wrote: From: Jiri Pirko For the TC clsact offload these days, some of HW drivers need to hold a magic ball. The reason is, with the first inserted rule inside HW they need to guess what fields will be used for the matching. If later on this guess proves to be wrong and user adds a filter with a different field to match, there's a problem. Mlxsw resolves it now with couple of patterns. Those try to cover as many match fields as possible. This aproach is far from optimal, both performance-wise and scale-wise. Also, there is a combination of filters that in certain order won't succeed. > Most of the time, when user inserts filters in chain, he knows right away how the filters are going to look like - what type and option will they have. For example, he knows that he will only insert filters of type flower matching destination IP address. He can specify a template that would cover all the filters in the chain. Is this just restricted to hardware offload? Example it will make sense for u32 in s/ware as well (i.e flexible TCAM like TCAM based classification). i.e it is possible that rules the user enters end up being worst case a linked list lookup, yes? And allocating space for a tuple that is not in use is a waste of space. If yes, then I would reword the above as something like: For very flexible classifiers such as TCAM based ones, one could add arbitrary tuple rules which tend to be inefficient both from a space and lookup performance. One approach, taken by Mlxsw, is to assume a multi filter tuple arrangement which is inefficient from a space perspective when the user-specified rules dont make use of pre-provisioned tuple space. Typically users already know what tuples are of interest to them: for example for ipv4 route lookup purposes they may just want to lookup destination IP with a specified mask etc. This feature allows user to provide good hints to the classifier to optimize. This patchset is providing the possibility to user to provide such template to kernel and propagate it all the way down to device drivers. See the examples below. Create dummy device with clsact first: # ip link add type dummy # tc qdisc add dev dummy0 clsact There is no template assigned by default: # tc filter template show dev dummy0 ingress Add a template of type flower allowing to insert rules matching on last 2 bytes of destination mac address: # tc filter template add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF The template is now showed in the list: # tc filter template show dev dummy0 ingress filter flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 Add another template, this time for chain number 22: # tc filter template add dev dummy0 ingress proto ip chain 22 flower dst_ip 0.0.0.0/16 # tc filter template show dev dummy0 ingress filter flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 filter flower chain 22 eth_type ipv4 dst_ip 0.0.0.0/16 Add a filter that fits the template: # tc filter add dev dummy0 ingress proto ip flower dst_mac aa:bb:cc:dd:ee:ff/00:00:00:00:00:0F action drop Addition of filters that does not fit the template would fail: # tc filter add dev dummy0 ingress proto ip flower dst_mac aa:11:22:33:44:55/00:00:00:FF:00:00 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 # tc filter add dev dummy0 ingress proto ip flower dst_ip 10.0.0.1 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 Additions of filters to chain 22: # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/8 action drop # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/24 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 Removal of a template from non-empty chain would fail: # tc filter template del dev dummy0 ingress Error: The chain is not empty, unable to delete template. We have an error talking to the kernel, -1 Once the chain is flushed, the template could be removed: # tc filter del dev dummy0 ingress # tc filter template del dev dummy0 ingress BTW: unlike the other comments on this - I think the syntax above is fine ;-> Chain are already either explicitly or are implicitly (case of chain 0) specified. Assuming that one cant add a new template to a chain if it already has at least one filter (even if no template has been added). I like it - it may help making u32 more friendly to humans in some cases. cheers, jamal
Re: [PATCH v2 net-next] net/sched: add skbprio scheduler
On 23/06/18 04:47 PM, Nishanth Devarajan wrote: [..] + /* Drop the packet at the tail of the lowest priority qdisc. */ + lp_qdisc = >qdiscs[lp]; + to_drop = __skb_dequeue_tail(lp_qdisc); + BUG_ON(!to_drop); + qdisc_qstats_backlog_dec(sch, to_drop); + qdisc_drop(to_drop, sch, to_free); + Maybe also increase overlimit stat here? It will keep track of low prio things dropped because you were congested. Such a stat helps when debugging or collecting analytics. Per Alex's comment, how about: --- Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets according to their skb->priority field. Under congestion, already-enqueued lower priority packets will be dropped to make space available for higher priority packets. Skbprio was conceived as a solution for denial-of-service defenses that need to route packets with different priorities as a means to overcome DoS attacks as described in paper ... cheers, jamal
Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit
On 19/06/18 08:39 AM, Michel Machado wrote: Notice that, different from skbmod, there's no field parm->flags in skbedit. Skbedit infers the flags in d->flags from the presence of the parameters of each of its actions. But SKBEDIT_F_INHERITDSFIELD has no parameter and adding field parm->flags breaks backward compatibility with user space as pointed out by Marcelo Ricardo Leitner. Our solution was to add TCA_SKBEDIT_FLAGS, so SKBEDIT_F_INHERITDSFIELD and future flag-only actions can be added. Ok, that makes sense - thanks. I am not so sure about using 64 bits (32 bits would have been fine to match the size of the kernel flags), but other than that LGTM. Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit
Hi Qiaobin, Per my previous comments, why do we need the TCA_SKBEDIT_FLAGS TLV? Isnt SKBEDIT_F_INHERITDSFIELD sufficient? i.e in tcf_skbedit_init() check for d->flags_F_INHERITDSFIELD then set skb->priority and flags|=SKBEDIT_F_INHERITDSFIELD Side note: Infact the whole flags setting in the init function seems to be a redundant given all the TLVs have d->flags also set by user space. But thats a different patch. cheers, jamal On 12/06/18 11:42 AM, Fu, Qiaobin wrote: The new action inheritdsfield copies the field DS of IPv4 and IPv6 packets into skb->priority. This enables later classification of packets based on the DS field. v4: *Not allow setting flags other than the expected ones. *Allow dumping the pure flags. Original idea by Jamal Hadi Salim Signed-off-by: Qiaobin Fu Reviewed-by: Michel Machado --- Note that the motivation for this patch is found in the following discussion: https://www.spinics.net/lists/netdev/msg501061.html --- diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h index fbcfe27a4e6c..6de6071ebed6 100644 --- a/include/uapi/linux/tc_act/tc_skbedit.h +++ b/include/uapi/linux/tc_act/tc_skbedit.h @@ -30,6 +30,7 @@ #define SKBEDIT_F_MARK0x4 #define SKBEDIT_F_PTYPE 0x8 #define SKBEDIT_F_MASK0x10 +#define SKBEDIT_F_INHERITDSFIELD 0x20 struct tc_skbedit { tc_gen; @@ -45,6 +46,7 @@ enum { TCA_SKBEDIT_PAD, TCA_SKBEDIT_PTYPE, TCA_SKBEDIT_MASK, + TCA_SKBEDIT_FLAGS, __TCA_SKBEDIT_MAX }; #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1) diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index 6138d1d71900..9adbcfa3f5fe 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -23,6 +23,9 @@ #include #include #include +#include +#include +#include #include #include @@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, if (d->flags & SKBEDIT_F_PRIORITY) skb->priority = d->priority; + if (d->flags & SKBEDIT_F_INHERITDSFIELD) { + int wlen = skb_network_offset(skb); + + switch (tc_skb_protocol(skb)) { + case htons(ETH_P_IP): + wlen += sizeof(struct iphdr); + if (!pskb_may_pull(skb, wlen)) + goto err; + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2; + break; + + case htons(ETH_P_IPV6): + wlen += sizeof(struct ipv6hdr); + if (!pskb_may_pull(skb, wlen)) + goto err; + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2; + break; + } + } if (d->flags & SKBEDIT_F_QUEUE_MAPPING && skb->dev->real_num_tx_queues > d->queue_mapping) skb_set_queue_mapping(skb, d->queue_mapping); @@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, spin_unlock(>tcf_lock); return d->tcf_action; + +err: + spin_unlock(>tcf_lock); + return TC_ACT_SHOT; } static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = { @@ -62,6 +88,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = { [TCA_SKBEDIT_MARK] = { .len = sizeof(u32) }, [TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) }, [TCA_SKBEDIT_MASK] = { .len = sizeof(u32) }, + [TCA_SKBEDIT_FLAGS] = { .len = sizeof(u64) }, }; static int tcf_skbedit_init(struct net *net, struct nlattr *nla, @@ -73,6 +100,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, struct tc_skbedit *parm; struct tcf_skbedit *d; u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL; + u64 *pure_flags = NULL; u16 *queue_mapping = NULL, *ptype = NULL; bool exists = false; int ret = 0, err; @@ -114,6 +142,12 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, mask = nla_data(tb[TCA_SKBEDIT_MASK]); } + if (tb[TCA_SKBEDIT_FLAGS] != NULL) { + pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]); + if (*pure_flags & SKBEDIT_F_INHERITDSFIELD) + flags |= SKBEDIT_F_INHERITDSFIELD; + } + parm = nla_data(tb[TCA_SKBEDIT_PARMS]); exists = tcf_idr_check(tn, parm->index, a, bind); @@ -178,6 +212,7 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a, .action = d->tcf_action, }; struct tcf_t t; + u64 pure_flags = 0; if (nla_put(skb, TCA_SKBEDIT_PARMS, size
Re: [PATCH net-next v4 00/11] Modify action API for implementing lockless actions
On 31/05/18 08:38 AM, Vlad Buslov wrote: Hi Jamal, On current net-next I still have action with single reference after last step: ~$ sudo $TC -s actions ls action skbedit total acts 1 action order 0: skbedit mark 1 pipe index 1 ref 2 bind 1 installed 47 sec used 47 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ~$ sudo $TC filter del dev lo parent : protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 ~$ sudo $TC -s actions ls action skbedit total acts 1 action order 0: skbedit mark 1 pipe index 1 ref 1 bind 0 installed 80 sec used 80 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Which branch are you testing on? You are correct - this is how it works now (I was thinking of a very old version before Cong made some changes a while back). Just vet this continues to work as above. cheers, jamal
Re: [PATCH net-next v4 00/11] Modify action API for implementing lockless actions
Hi Vlad, Can you try one simple test below with these patches? #create an action sudo $TC actions add action skbedit mark 1 pipe # sudo $TC qdisc del dev lo parent : sudo $TC qdisc add dev lo ingress # bind action to filter sudo $TC filter add dev lo parent : protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action skbedit index 1 #now delete that action multiple times while it is still bound sudo $TC actions del action skbedit index 1 sudo $TC actions del action skbedit index 1 sudo $TC actions del action skbedit index 1 #check the refcount and bindcount sudo $TC -s actions ls action skbedit #delete the filter (which should remove the bindcnt) sudo $TC filter del dev lo parent : protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 #check the refcount and bindcount sudo $TC -s actions ls action skbedit Current behavior: i believe the action is gone in this last step. Your patches may change behavior so that the action action is still around. I dont think this is a big deal, but just wanted to be sure it is not something more unexpected. cheers, jamal
Re: [PATCH] net: sched: split tc_ctl_tfilter into three handlers
On 28/05/18 12:02 PM, Jamal Hadi Salim wrote: On 27/05/18 03:55 PM, Vlad Buslov wrote: tc_ctl_tfilter handles three netlink message types: RTM_NEWTFILTER, RTM_DELTFILTER, RTM_GETTFILTER. However, implementation of this function involves a lot of branching on specific message type because most of the code is message-specific. This significantly complicates adding new functionality and doesn't provide much benefit of code reuse. Split tc_ctl_tfilter to three standalone functions that handle filter new, delete and get requests. The only truly protocol independent part of tc_ctl_tfilter is code that looks up queue, class, and block. Refactor this code to standalone tcf_block_find function that is used by all three new handlers. Signed-off-by: Vlad Buslov <vla...@mellanox.com> FWIW, I like this separation - makes things more maintainable and readable (we should do it in act_api as well). Sorry - meant to point out that this belongs to net-next not net. cheers, jamal