Re: [PATCH bpf-next 4/7] nfp: bpf: copy range info for all operands of all ALU operations
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski wrote: > From: Jiong Wang > > NFP verifier hook is coping range information of the shift amount for > indirect shift operation so optimized shift sequences could be generated. > > We want to use range info to do more things. For example, to decide whether > multiplication and divide are supported on the given range. > > This patch simply let NFP verifier hook to copy range info for all operands > of all ALU operands. > > Signed-off-by: Jiong Wang > Reviewed-by: Jakub Kicinski Acked-by: Song Liu > --- > drivers/net/ethernet/netronome/nfp/bpf/main.h | 33 +++ > .../net/ethernet/netronome/nfp/bpf/offload.c | 4 ++- > .../net/ethernet/netronome/nfp/bpf/verifier.c | 6 +++- > 3 files changed, 20 insertions(+), 23 deletions(-) > > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h > b/drivers/net/ethernet/netronome/nfp/bpf/main.h > index 5975a19c28cb..c985d0ac61a3 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h > +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h > @@ -265,6 +265,8 @@ struct nfp_bpf_reg_state { > * @arg2: arg2 for call instructions > * @umin_src: copy of core verifier umin_value for src opearnd. > * @umax_src: copy of core verifier umax_value for src operand. > + * @umin_dst: copy of core verifier umin_value for dst opearnd. > + * @umax_dst: copy of core verifier umax_value for dst operand. > * @off: index of first generated machine instruction (in nfp_prog.prog) > * @n: eBPF instruction number > * @flags: eBPF instruction extra optimization flags > @@ -300,12 +302,15 @@ struct nfp_insn_meta { > struct bpf_reg_state arg1; > struct nfp_bpf_reg_state arg2; > }; > - /* We are interested in range info for some operands, > -* for example, the shift amount which is kept in src operand. > + /* We are interested in range info for operands of ALU > +* operations. For example, shift amount, multiplicand and > +* multiplier etc. > */ > struct { > u64 umin_src; > u64 umax_src; > + u64 umin_dst; > + u64 umax_dst; > }; > }; > unsigned int off; > @@ -339,6 +344,11 @@ static inline u8 mbpf_mode(const struct nfp_insn_meta > *meta) > return BPF_MODE(meta->insn.code); > } > > +static inline bool is_mbpf_alu(const struct nfp_insn_meta *meta) > +{ > + return mbpf_class(meta) == BPF_ALU64 || mbpf_class(meta) == BPF_ALU; > +} > + > static inline bool is_mbpf_load(const struct nfp_insn_meta *meta) > { > return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_LDX | BPF_MEM); > @@ -384,25 +394,6 @@ static inline bool is_mbpf_xadd(const struct > nfp_insn_meta *meta) > return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD); > } > > -static inline bool is_mbpf_indir_shift(const struct nfp_insn_meta *meta) > -{ > - u8 code = meta->insn.code; > - bool is_alu, is_shift; > - u8 opclass, opcode; > - > - opclass = BPF_CLASS(code); > - is_alu = opclass == BPF_ALU64 || opclass == BPF_ALU; > - if (!is_alu) > - return false; > - > - opcode = BPF_OP(code); > - is_shift = opcode == BPF_LSH || opcode == BPF_RSH || opcode == > BPF_ARSH; > - if (!is_shift) > - return false; > - > - return BPF_SRC(code) == BPF_X; > -} > - > /** > * struct nfp_prog - nfp BPF program > * @bpf: backpointer to the bpf app priv structure > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c > b/drivers/net/ethernet/netronome/nfp/bpf/offload.c > index 856a0003bb75..78f44c4d95b4 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c > +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c > @@ -190,8 +190,10 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct > bpf_insn *prog, > > meta->insn = prog[i]; > meta->n = i; > - if (is_mbpf_indir_shift(meta)) > + if (is_mbpf_alu(meta)) { > meta->umin_src = U64_MAX; > + meta->umin_dst = U64_MAX; > + } > > list_add_tail(&meta->l, &nfp_prog->insns); > } > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > index e862b739441f..7bd9666bd8ff 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > @@ -551,12 +551,16 @@ nfp_verify_insn(struct bpf_verifier_env *env, int > insn_idx, int prev_insn_idx) > if (is_mbpf_xadd(meta)) > return nfp_bpf_check_xadd(nfp_prog, meta, env); > > - if (is_mbpf_indir_shift(meta)) { > + if (is_mbpf_alu(meta)) { >
Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicin...@netronome.com wrote: >On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote: >> From: Jiri Pirko >> >> For the TC clsact offload these days, some of HW drivers need >> to hold a magic ball. The reason is, with the first inserted rule inside >> HW they need to guess what fields will be used for the matching. If >> later on this guess proves to be wrong and user adds a filter with a >> different field to match, there's a problem. Mlxsw resolves it now with >> couple of patterns. Those try to cover as many match fields as possible. >> This aproach is far from optimal, both performance-wise and scale-wise. >> Also, there is a combination of filters that in certain order won't >> succeed. >> >> Most of the time, when user inserts filters in chain, he knows right away >> how the filters are going to look like - what type and option will they >> have. For example, he knows that he will only insert filters of type >> flower matching destination IP address. He can specify a template that >> would cover all the filters in the chain. > >Perhaps it's lack of sleep, but this paragraph threw me a little off >the track. IIUC the goal of this set is to provide a way to inform the >HW about expected matches before any rule is programmed into the HW. >Not before any rule is added to a particular chain. One can just use >the first rule in the chain to make a guess about the chain, but thanks >to this set user can configure *all* chains before any rules are added. The template is per-chain. User can use template for chain x and not-use it for chain y. Up to him. > >And that's needed because once any rule is added the tcam config can no >longer be easily modified? Yes.
Re: [patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers
Tue, Jun 26, 2018 at 07:00:50AM CEST, jakub.kicin...@netronome.com wrote: >On Mon, 25 Jun 2018 23:01:45 +0200, Jiri Pirko wrote: >> From: Jiri Pirko >> >> Introduce a couple of flower offload commands in order to propagate >> template creation/destruction events down to device drivers. >> Drivers may use this information to prepare HW in an optimal way >> for future filter insertions. >> >> Signed-off-by: Jiri Pirko > >> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c >> index d64d43843a3a..276ba25a09c3 100644 >> --- a/net/sched/cls_flower.c >> +++ b/net/sched/cls_flower.c >> @@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct >> tcf_walker *arg) >> } >> } >> >> +static void fl_hw_create_tmplt(struct tcf_chain *chain, >> + struct fl_flow_tmplt *tmplt, >> + struct netlink_ext_ack *extack) >> +{ >> +struct tc_cls_flower_offload cls_flower = {}; >> +struct tcf_block *block = chain->block; >> +struct tcf_exts dummy_exts = { 0, }; >> + >> +cls_flower.common.chain_index = chain->index; > >Did you skip extack on purpose? Oh, the extack is leftover. I will remove it in v2. > >> +cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE; >> +cls_flower.cookie = (unsigned long) tmplt; >> +cls_flower.dissector = &tmplt->dissector; >> +cls_flower.mask = &tmplt->mask; >> +cls_flower.key = &tmplt->dummy_key; >> +cls_flower.exts = &dummy_exts; >> + >> +/* We don't care if driver (any of them) fails to handle this >> + * call. It serves just as a hint for it. >> + */ >> +tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER, >> + &cls_flower, false); >> +}
Re: [PATCH bpf-next 2/7] lib: reciprocal_div: implement the improved algorithm on the paper mentioned
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski wrote: > From: Jiong Wang > > The new added "reciprocal_value_adv" implements the advanced version of the > algorithm described in Figure 4.2 of the paper except when dividend has MSB > set which would require u128 divide on host and actually could be easily > handled before calling the new "reciprocal_value_adv". > > The advanced version requires more complex calculation to get the > reciprocal multiplier and other control variables, but then could reduce > the required emulation operations. > > It makes no sense to use this advanced version for host divide emulation, > those extra complexities for calculating multiplier etc could completely > waive our saving on emulation operations. > > However, it makes sense to use it for JIT divide code generation (for > example eBPF JIT backends) for which we are willing to trade performance of > JITed code with that of host. As shown by the following pseudo code, the > required emulation operations could go down from 6 (the basic version) to 3 > or 4. > > To use the result of "reciprocal_value_adv", suppose we want to calculate > n/d, the C-style pseudo code will be the following, it could be easily > changed to real code generation for other JIT targets. > > struct reciprocal_value_adv rvalue; > u8 pre_shift, exp; > > if (d >= (1u << 31)) { > result = n >= d; > return; > } > rvalue = reciprocal_value_adv(d, 32) > exp = rvalue.exp; > if (rvalue.is_wide_m && !(d & 1)) { > pre_shift = fls(d & -d) - 1; > rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift); > } else { > pre_shift = 0; > } > > // code generation starts. > if (imm == 1 << exp) { > result = n >> exp; > } else if (rvalue.is_wide_m) { > // pre_shift must be zero when reached here. > t = (n * rvalue.m) >> 32; > result = n - t; > result >>= 1; > result += t; > result >>= rvalue.sh - 1; > } else { > if (pre_shift) > result = n >> pre_shift; > result = ((u64)result * rvalue.m) >> 32; > result >>= rvalue.sh; > } > > Signed-off-by: Jiong Wang > Reviewed-by: Jakub Kicinski > --- > include/linux/reciprocal_div.h | 65 ++ > lib/reciprocal_div.c | 37 +++ > 2 files changed, 102 insertions(+) > > diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h > index e031e9f2f9d8..5a695e4697d3 100644 > --- a/include/linux/reciprocal_div.h > +++ b/include/linux/reciprocal_div.h > @@ -25,6 +25,9 @@ struct reciprocal_value { > u8 sh1, sh2; > }; > > +/* "reciprocal_value" and "reciprocal_divide" together implement the basic > + * version of the algorithm described in Figure 4.1 of the paper. > + */ > struct reciprocal_value reciprocal_value(u32 d); > > static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R) > @@ -33,4 +36,66 @@ static inline u32 reciprocal_divide(u32 a, struct > reciprocal_value R) > return (t + ((a - t) >> R.sh1)) >> R.sh2; > } > > +struct reciprocal_value_adv { > + u32 m; > + u8 sh, exp; > + bool is_wide_m; > +}; > + > +/* "reciprocal_value_adv" implements the advanced version of the algorithm > + * described in Figure 4.2 of the paper except when dividend has MSB set > which > + * would require u128 divide on host and actually could be easily handled > before > + * calling "reciprocal_value_adv". > + * > + * The advanced version requires more complex calculation to get the > reciprocal > + * multiplier and other control variables, but then could reduce the required > + * emulation operations. > + * > + * It makes no sense to use this advanced version for host divide emulation, > + * those extra complexities for calculating multiplier etc could completely > + * waive our saving on emulation operations. > + * > + * However, it makes sense to use it for JIT divide code generation for which > + * we are willing to trade performance of JITed code with that of host. As > shown > + * by the following pseudo code, the required emulation operations could go > down > + * from 6 (the basic version) to 3 or 4. > + * > + * To use the result of "reciprocal_value_adv", suppose we want to calculate > + * n/d: > + * > + * struct reciprocal_value_adv rvalue; > + * u8 pre_shift, exp; > + * > + * if (d >= (1u << 31)) { > + * result = n >= d; > + * return; > + * } > + * rvalue = reciprocal_value_adv(d, 32) > + * exp = rvalue.exp; > + * if (rvalue.is_wide_m && !(d & 1)) { > + * pre_shift = fls(d & -d) - 1; > + * rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift); > + * } else { > + * pre_shift = 0; > + * } > + * > + * // code generation starts. > + * if (imm == 1 << exp) { > + * result = n >> exp; > + * } else if (rvalue.is_wide_m) { > + * // pre_shift must be zero when reached here. > + * t = (n * rvalue.m) >> 32; > + * result = n - t; > + * result >>= 1; > + * res
Re: [PATCH bpf-next 3/7] nfp: bpf: rename umin/umax to umin_src/umax_src
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski wrote: > From: Jiong Wang > > The two fields are a copy of umin and umax info of bpf_insn->src_reg > generated by verifier. > > Rename to make their meaning clear. > > Signed-off-by: Jiong Wang > Reviewed-by: Jakub Kicinski Acked-by: Song Liu > --- > drivers/net/ethernet/netronome/nfp/bpf/jit.c | 12 ++-- > drivers/net/ethernet/netronome/nfp/bpf/main.h | 10 +- > drivers/net/ethernet/netronome/nfp/bpf/offload.c | 2 +- > drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 4 ++-- > 4 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c > b/drivers/net/ethernet/netronome/nfp/bpf/jit.c > index 33111739b210..4a629e9b5c0f 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c > +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c > @@ -1772,8 +1772,8 @@ static int shl_reg64(struct nfp_prog *nfp_prog, struct > nfp_insn_meta *meta) > u8 dst, src; > > dst = insn->dst_reg * 2; > - umin = meta->umin; > - umax = meta->umax; > + umin = meta->umin_src; > + umax = meta->umax_src; > if (umin == umax) > return __shl_imm64(nfp_prog, dst, umin); > > @@ -1881,8 +1881,8 @@ static int shr_reg64(struct nfp_prog *nfp_prog, struct > nfp_insn_meta *meta) > u8 dst, src; > > dst = insn->dst_reg * 2; > - umin = meta->umin; > - umax = meta->umax; > + umin = meta->umin_src; > + umax = meta->umax_src; > if (umin == umax) > return __shr_imm64(nfp_prog, dst, umin); > > @@ -1995,8 +1995,8 @@ static int ashr_reg64(struct nfp_prog *nfp_prog, struct > nfp_insn_meta *meta) > u8 dst, src; > > dst = insn->dst_reg * 2; > - umin = meta->umin; > - umax = meta->umax; > + umin = meta->umin_src; > + umax = meta->umax_src; > if (umin == umax) > return __ashr_imm64(nfp_prog, dst, umin); > > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h > b/drivers/net/ethernet/netronome/nfp/bpf/main.h > index 654fe7823e5e..5975a19c28cb 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h > +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h > @@ -263,8 +263,8 @@ struct nfp_bpf_reg_state { > * @func_id: function id for call instructions > * @arg1: arg1 for call instructions > * @arg2: arg2 for call instructions > - * @umin: copy of core verifier umin_value. > - * @umax: copy of core verifier umax_value. > + * @umin_src: copy of core verifier umin_value for src opearnd. > + * @umax_src: copy of core verifier umax_value for src operand. > * @off: index of first generated machine instruction (in nfp_prog.prog) > * @n: eBPF instruction number > * @flags: eBPF instruction extra optimization flags > @@ -301,11 +301,11 @@ struct nfp_insn_meta { > struct nfp_bpf_reg_state arg2; > }; > /* We are interested in range info for some operands, > -* for example, the shift amount. > +* for example, the shift amount which is kept in src operand. > */ > struct { > - u64 umin; > - u64 umax; > + u64 umin_src; > + u64 umax_src; > }; > }; > unsigned int off; > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c > b/drivers/net/ethernet/netronome/nfp/bpf/offload.c > index 7eae4c0266f8..856a0003bb75 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c > +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c > @@ -191,7 +191,7 @@ nfp_prog_prepare(struct nfp_prog *nfp_prog, const struct > bpf_insn *prog, > meta->insn = prog[i]; > meta->n = i; > if (is_mbpf_indir_shift(meta)) > - meta->umin = U64_MAX; > + meta->umin_src = U64_MAX; > > list_add_tail(&meta->l, &nfp_prog->insns); > } > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > index 4bfeba7b21b2..e862b739441f 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c > @@ -555,8 +555,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int > insn_idx, int prev_insn_idx) > const struct bpf_reg_state *sreg = > cur_regs(env) + meta->insn.src_reg; > > - meta->umin = min(meta->umin, sreg->umin_value); > - meta->umax = max(meta->umax, sreg->umax_value); > + meta->umin_src = min(meta->umin_src, sreg->umin_value); > + meta->umax_src = max(meta->umax_src, sreg->umax_value); > } > > return 0; > -- > 2.17.1 >
Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization
On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski wrote: > From: Jiong Wang > > Map read has been supported on NFP, this patch enables optimization for > memcpy from map to packet. > > This patch also fixed one latent bug which will cause copying from > unexpected address once memcpy for map pointer enabled. > > Reported-by: Mary Pham > Reported-by: David Beckett > Signed-off-by: Jiong Wang > Reviewed-by: Jakub Kicinski > --- > drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c > b/drivers/net/ethernet/netronome/nfp/bpf/jit.c > index 8a92088df0d7..33111739b210 100644 > --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c > +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c > @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, > struct nfp_insn_meta *meta) > xfer_num = round_up(len, 4) / 4; > > if (src_40bit_addr) > - addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base, > + addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, > &src_base, > &off); Did this break other cases before this patch? I am sorry if this is a dumb question. I don't think I fully understand addr40_offset(). Song > > /* Setup PREV_ALU fields to override memory read length. */ > @@ -3299,7 +3299,8 @@ curr_pair_is_memcpy(struct nfp_insn_meta *ld_meta, > if (!is_mbpf_load(ld_meta) || !is_mbpf_store(st_meta)) > return false; > > - if (ld_meta->ptr.type != PTR_TO_PACKET) > + if (ld_meta->ptr.type != PTR_TO_PACKET && > + ld_meta->ptr.type != PTR_TO_MAP_VALUE) > return false; > > if (st_meta->ptr.type != PTR_TO_PACKET) > -- > 2.17.1 >
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On 06/26/2018 01:50 AM, Sowmini Varadhan wrote: If a socket is bound, I guess the scope_id should be used. So if a socket is not bound to a link local address and the socket is used to sent to a link local peer, it should fail. PF_RDS sockets *MUST* alwasy be bound. See Documentation/networking/rds.txt: " Sockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change." Also, rds_sendmsg checks this (from net-next, your version has the equivalent ipv6_addr_any etc check): if (daddr == 0 || rs->rs_bound_addr == 0) { release_sock(sk); ret = -ENOTCONN; /* XXX not a great errno */ goto out; } I think you misunderstood what I wrote. The above is in response to your original question: -- > And this is even more confusing because the fastpath in rds_sendmsg > does not take the bound_scope_id into consideration at all: > 1213 if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr)) > 1214 conn = rs->rs_conn; > 1215 else { > 1216 conn = rds_conn_create_outgoing( /* .. */, scope_id) > so if I erroneously passed a msg_name on a connected rds socket, what > would happen? (see also question about rds_connect() itself, below) -- My answer to this is that if a socket is not bound to a link local address (meaning it is bound to a non-link local address) and it is used to send to a link local peer, I think it should fail. This is consistent with the scope_id check I mentioned in the previous mail. If the socket is not bound to a link local address, the bound_scope_id is 0. So if the socket is used to send to a link local address (which has a non-zero scope_id), the check will catch it and fail the call. A new conn should not be created in this case. -- K. Poon ka-cheong.p...@oracle.com
Re: [patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers
On Mon, 25 Jun 2018 23:01:45 +0200, Jiri Pirko wrote: > From: Jiri Pirko > > Introduce a couple of flower offload commands in order to propagate > template creation/destruction events down to device drivers. > Drivers may use this information to prepare HW in an optimal way > for future filter insertions. > > Signed-off-by: Jiri Pirko > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c > index d64d43843a3a..276ba25a09c3 100644 > --- a/net/sched/cls_flower.c > +++ b/net/sched/cls_flower.c > @@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct > tcf_walker *arg) > } > } > > +static void fl_hw_create_tmplt(struct tcf_chain *chain, > +struct fl_flow_tmplt *tmplt, > +struct netlink_ext_ack *extack) > +{ > + struct tc_cls_flower_offload cls_flower = {}; > + struct tcf_block *block = chain->block; > + struct tcf_exts dummy_exts = { 0, }; > + > + cls_flower.common.chain_index = chain->index; Did you skip extack on purpose? > + cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE; > + cls_flower.cookie = (unsigned long) tmplt; > + cls_flower.dissector = &tmplt->dissector; > + cls_flower.mask = &tmplt->mask; > + cls_flower.key = &tmplt->dummy_key; > + cls_flower.exts = &dummy_exts; > + > + /* We don't care if driver (any of them) fails to handle this > + * call. It serves just as a hint for it. > + */ > + tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER, > + &cls_flower, false); > +}
Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote: > From: Jiri Pirko > > For the TC clsact offload these days, some of HW drivers need > to hold a magic ball. The reason is, with the first inserted rule inside > HW they need to guess what fields will be used for the matching. If > later on this guess proves to be wrong and user adds a filter with a > different field to match, there's a problem. Mlxsw resolves it now with > couple of patterns. Those try to cover as many match fields as possible. > This aproach is far from optimal, both performance-wise and scale-wise. > Also, there is a combination of filters that in certain order won't > succeed. > > Most of the time, when user inserts filters in chain, he knows right away > how the filters are going to look like - what type and option will they > have. For example, he knows that he will only insert filters of type > flower matching destination IP address. He can specify a template that > would cover all the filters in the chain. Perhaps it's lack of sleep, but this paragraph threw me a little off the track. IIUC the goal of this set is to provide a way to inform the HW about expected matches before any rule is programmed into the HW. Not before any rule is added to a particular chain. One can just use the first rule in the chain to make a guess about the chain, but thanks to this set user can configure *all* chains before any rules are added. And that's needed because once any rule is added the tcam config can no longer be easily modified?
Re: [PATCHv2 net-next] sctp: add support for SCTP_REUSE_PORT sockopt
On Mon, Jun 25, 2018 at 9:30 PM, Marcelo Ricardo Leitner wrote: > On Mon, Jun 25, 2018 at 10:06:46AM +0800, Xin Long wrote: >> This feature is actually already supported by sk->sk_reuse which can be >> set by socket level opt SO_REUSEADDR. But it's not working exactly as >> RFC6458 demands in section 8.1.27, like: >> >> - This option only supports one-to-one style SCTP sockets >> - This socket option must not be used after calling bind() >> or sctp_bindx(). >> >> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs. >> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not >> work in linux. >> >> To separate it from the socket level version, this patch adds 'reuse' in >> sctp_sock and it works pretty much as sk->sk_reuse, but with some extra >> setup limitations that are needed when it is being enabled. >> >> "It should be noted that the behavior of the socket-level socket option >> to reuse ports and/or addresses for SCTP sockets is unspecified", so it >> leaves SO_REUSEADDR as is for the compatibility. >> >> Note that the name SCTP_REUSE_PORT is kind of confusing, it is identical >> to SO_REUSEADDR with some extra restriction, so here it uses 'reuse' in >> sctp_sock instead of 'reuseport'. As for sk->sk_reuseport support for >> SCTP, it will be added in another patch. > > To help changelog readers later, please update to something like: > > """\ > Note that the name SCTP_REUSE_PORT is somewhat confusing, as its > functionality is nearly identical to SO_REUSEADDR, but with some > extra restrictions. Here it uses 'reuse' in sctp_sock instead of > 'reuseport'. As for sk->sk_reuseport support for SCTP, it will be > added in another patch. > """ > > Makes sense, can you note the difference? Sure, will post v3. thanks.
Re: [PATCH net-next 1/5] ipv4: add __ip_queue_xmit() that supports tos param
On Mon, Jun 25, 2018 at 7:13 PM, Neil Horman wrote: > On Mon, Jun 25, 2018 at 04:26:54PM +0900, David Miller wrote: >> From: Xin Long >> Date: Mon, 25 Jun 2018 10:14:33 +0800 >> >> > +EXPORT_SYMBOL(__ip_queue_xmit); >> > + >> > +int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl) >> > +{ >> > + return __ip_queue_xmit(sk, skb, fl, inet_sk(sk)->tos); >> > +} >> > EXPORT_SYMBOL(ip_queue_xmit); >> >> Maybe better to only export __ip_queue_xmit() and make ip_queue_xmit() an >> inline function in net/ip.h? >> > I concur. No need to export both here > will do that.
Re: [PATCH net-next 3/5] sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams
On Tue, Jun 26, 2018 at 12:31 AM, Marcelo Ricardo Leitner wrote: > Hi, > > On Tue, Jun 26, 2018 at 01:12:00AM +0900, 吉藤英明 wrote: >> Hi, >> >> 2018-06-25 22:03 GMT+09:00 Marcelo Ricardo Leitner >> : >> > On Mon, Jun 25, 2018 at 07:28:47AM -0400, Neil Horman wrote: >> >> On Mon, Jun 25, 2018 at 04:31:26PM +0900, David Miller wrote: >> >> > From: Xin Long >> >> > Date: Mon, 25 Jun 2018 10:14:35 +0800 >> >> > >> >> > > struct sctp_paddrparams { >> >> > > @@ -773,6 +775,8 @@ struct sctp_paddrparams { >> >> > > __u32 spp_pathmtu; >> >> > > __u32 spp_sackdelay; >> >> > > __u32 spp_flags; >> >> > > + __u32 spp_ipv6_flowlabel; >> >> > > + __u8spp_dscp; >> >> > > } __attribute__((packed, aligned(4))); >> >> > >> >> > I don't think you can change the size of this structure like this. >> >> > >> >> > This check in sctp_setsockopt_peer_addr_params(): >> >> > >> >> > if (optlen != sizeof(struct sctp_paddrparams)) >> >> > return -EINVAL; >> >> > >> >> > is going to trigger in old kernels when executing programs >> >> > built against the new struct definition. >> > >> > That will happen, yes, but do we really care about being future-proof >> > here? I mean: if we also update such check(s) to support dealing with >> > smaller-than-supported structs, newer kernels will be able to run >> > programs built against the old struct, and the new one; while building >> > using newer headers and running on older kernel may fool the >> > application in other ways too (like enabling support for something >> > that is available on newer kernel and that is not present in the older >> > one). >> >> We should not break existing apps. >> We still accept apps of pre-2.4 era without sin6_scope_id >> (e.g., net/ipv6/af_inet6.c:inet6_bind()). > > Yes. That's what I tried to say. That is supporting an old app built > with old kernel headers and running on a newer kernel, and not the > other way around (an app built with fresh headers and running on an > old kernel). To make it, I will update the check like: diff --git a/net/sctp/socket.c b/net/sctp/socket.c index 1df5d07..c949d8c 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -2715,13 +2715,18 @@ static int sctp_setsockopt_peer_addr_params(struct sock *sk, struct sctp_sock*sp = sctp_sk(sk); int error; int hb_change, pmtud_change, sackdelay_change; + int plen = sizeof(params); + int old_plen = plen - sizeof(u32) * 2; - if (optlen != sizeof(struct sctp_paddrparams)) + if (optlen != plen && optlen != old_plen) return -EINVAL; if (copy_from_user(¶ms, optval, optlen)) return -EFAULT; + if (optlen == old_plen) + params.spp_flags &= ~(SPP_DSCP | SPP_IPV6_FLOWLABEL); + /* Validate flags and value parameters. */ hb_change= params.spp_flags & SPP_HB; pmtud_change = params.spp_flags & SPP_PMTUD; @@ -5591,10 +5596,13 @@ static int sctp_getsockopt_peer_addr_params(struct sock *sk, int len, struct sctp_transport *trans = NULL; struct sctp_association *asoc = NULL; struct sctp_sock*sp = sctp_sk(sk); + int plen = sizeof(params); + int old_plen = plen - sizeof(u32) * 2; - if (len < sizeof(struct sctp_paddrparams)) + if (len < old_plen) return -EINVAL; - len = sizeof(struct sctp_paddrparams); + + len = len >= plen ? plen : old_plen; if (copy_from_user(¶ms, optval, len)) return -EFAULT; does it look ok to you?
Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
On Mon, Jun 25, 2018 at 03:34:38PM -0600, Jason Gunthorpe wrote: > On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote: > > From: Leon Romanovsky > > > > Hi, > > > > This is bunch of patches trigged by running syzkaller internally. > > > > I'm sending them based on rdma-next mainly for two reasons: > > 1, Most of the patches fix the old issues and it doesn't matter when > > they will hit the Linus's tree: now or later in a couple of weeks > > during merge window. > > 2. They interleave with code cleanup, mlx5-next patches and Michael's > > feedback on flow counters series. > > > > Thanks > > > > Leon Romanovsky (12): > > RDMA/uverbs: Protect from attempts to create flows on unsupported QP > > RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow > > I applied these two to for-rc > > > RDMA/uverbs: Check existence of create_flow callback > > RDMA/verbs: Drop kernel variant of create_flow > > RDMA/verbs: Drop kernel variant of destroy_flow > > net/mlx5: Rate limit errors in command interface > > RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR > > RDMA/umem: Don't check for negative return value of dma_map_sg_attrs() > > RDMA/uverbs: Remove redundant check > > These to for-next Jason, We would like to see patch "[PATCH mlx5-next 05/12] net/mlx5: Rate limit errors in command interface" in out mlx5-next. Is it possible at this point to drop it from for-next, so I'll be able to take it into mlx5-next? Thanks > > > overflow.h: Add arithmetic shift helper > > RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq > > RDMA/mlx5: Reuse existed shift_overlow helper > > And these will have to be respun. > > Thanks, > Jason signature.asc Description: PGP signature
[PATCH net v2 0/2] nfp: MPLS and shared blocks TC offload fixes
Hi! This series brings two fixes to TC filter/action offload code. Pieter fixes matching MPLS packets when the match is purely on the MPLS ethertype and none of the MPLS fields are used. John provides a fix for offload of shared blocks. Unfortunately, with shared blocks there is currently no guarantee that filters which were added by the core will be removed before block unbind. Our simple fix is to not support offload of rules on shared blocks at all, a revert of this fix will be send for -next once the reoffload infrastructure lands. The shared blocks became important as we are trying to use them for bonding offload (managed from user space) and lack of remove calls leads to resource leaks. v2: - fix build error reported by kbuild bot due to missing tcf_block_shared() helper. John Hurley (1): nfp: reject binding to shared blocks Pieter Jansen van Vuuren (1): nfp: flower: fix mpls ether type detection drivers/net/ethernet/netronome/nfp/bpf/main.c | 3 +++ drivers/net/ethernet/netronome/nfp/flower/match.c | 14 ++ .../net/ethernet/netronome/nfp/flower/offload.c| 11 +++ include/net/pkt_cls.h | 5 + 4 files changed, 33 insertions(+) -- 2.17.1
[PATCH net v2 2/2] nfp: reject binding to shared blocks
From: John Hurley TC shared blocks allow multiple qdiscs to be grouped together and filters shared between them. Currently the chains of filters attached to a block are only flushed when the block is removed. If a qdisc is removed from a block but the block still exists, flow del messages are not passed to the callback registered for that qdisc. For the NFP, this presents the possibility of rules still existing in hw when they should be removed. Prevent binding to shared blocks until the kernel can send per qdisc del messages when block unbinds occur. tcf_block_shared() was not used outside of the core until now, so also add an empty implementation for builds with CONFIG_NET_CLS=n. Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks infrastructure") Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski Reviewed-by: Simon Horman --- v2: - add a tcf_block_shared() for !CONFIG_NET_CLS --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 3 +++ drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++ include/net/pkt_cls.h | 5 + 3 files changed, 11 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index fcdfb8e7fdea..6b15e3b11956 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev, if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) return -EOPNOTSUPP; + if (tcf_block_shared(f->block)) + return -EOPNOTSUPP; + switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index 477f584f6d28..525057bee0ed 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device *netdev, if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) return -EOPNOTSUPP; + if (tcf_block_shared(f->block)) + return -EOPNOTSUPP; + switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index a3c1a2c47cd4..20b059574e60 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -111,6 +111,11 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q, { } +static inline bool tcf_block_shared(struct tcf_block *block) +{ + return false; +} + static inline struct Qdisc *tcf_block_q(struct tcf_block *block) { return NULL; -- 2.17.1
[PATCH net v2 1/2] nfp: flower: fix mpls ether type detection
From: Pieter Jansen van Vuuren Previously it was not possible to distinguish between mpls ether types and other ether types. This leads to incorrect classification of offloaded filters that match on mpls ether type. For example the following two filters overlap: # tc filter add dev eth0 parent : \ protocol 0x8847 flower \ action mirred egress redirect dev eth1 # tc filter add dev eth0 parent : \ protocol 0x0800 flower \ action mirred egress redirect dev eth2 The driver now correctly includes the mac_mpls layer where HW stores mpls fields, when it detects an mpls ether type. It also sets the MPLS_Q bit to indicate that the filter should match mpls packets. Fixes: bb055c198d9b ("nfp: add mpls match offloading support") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/match.c | 14 ++ .../net/ethernet/netronome/nfp/flower/offload.c| 8 2 files changed, 22 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c b/drivers/net/ethernet/netronome/nfp/flower/match.c index 91935405f586..84f7a5dbea9d 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/match.c +++ b/drivers/net/ethernet/netronome/nfp/flower/match.c @@ -123,6 +123,20 @@ nfp_flower_compile_mac(struct nfp_flower_mac_mpls *frame, NFP_FLOWER_MASK_MPLS_Q; frame->mpls_lse = cpu_to_be32(t_mpls); + } else if (dissector_uses_key(flow->dissector, + FLOW_DISSECTOR_KEY_BASIC)) { + /* Check for mpls ether type and set NFP_FLOWER_MASK_MPLS_Q +* bit, which indicates an mpls ether type but without any +* mpls fields. +*/ + struct flow_dissector_key_basic *key_basic; + + key_basic = skb_flow_dissector_target(flow->dissector, + FLOW_DISSECTOR_KEY_BASIC, + flow->key); + if (key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_UC) || + key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_MC)) + frame->mpls_lse = cpu_to_be32(NFP_FLOWER_MASK_MPLS_Q); } } diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index c42e64f32333..477f584f6d28 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -264,6 +264,14 @@ nfp_flower_calculate_key_layers(struct nfp_app *app, case cpu_to_be16(ETH_P_ARP): return -EOPNOTSUPP; + case cpu_to_be16(ETH_P_MPLS_UC): + case cpu_to_be16(ETH_P_MPLS_MC): + if (!(key_layer & NFP_FLOWER_LAYER_MAC)) { + key_layer |= NFP_FLOWER_LAYER_MAC; + key_size += sizeof(struct nfp_flower_mac_mpls); + } + break; + /* Will be included in layer 2. */ case cpu_to_be16(ETH_P_8021Q): break; -- 2.17.1
[PATCH net-next] neighbour: force neigh_invalidate when NUD_FAILED update is from admin
From: Roopa Prabhu In systems where neigh gc thresh holds are set to high values, admin deleted neigh entries (eg ip neigh flush or ip neigh del) can linger around in NUD_FAILED state for a long time until periodic gc kicks in. This patch forces neigh_invalidate when NUD_FAILED neigh_update is from an admin. Signed-off-by: Roopa Prabhu --- My testing has not shown any problems with this patch. But i am not sure why historically neigh admin was not considered here: I am assuming that it is because the problem is not very obvious in default low gc threshold deployments. net/core/neighbour.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 8e3fda9..cbe85d8 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1148,7 +1148,8 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, neigh->nud_state = new; err = 0; notify = old & NUD_VALID; - if ((old & (NUD_INCOMPLETE | NUD_PROBE)) && + if (((old & (NUD_INCOMPLETE | NUD_PROBE)) || +(flags & NEIGH_UPDATE_F_ADMIN)) && (new & NUD_FAILED)) { neigh_invalidate(neigh); notify = 1; -- 2.1.4
Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert wrote: > > > On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar > wrote: >> >> Change 'skc_tx_queue_mapping' field in sock_common structure from >> 'int' to 'unsigned short' type with 0 indicating unset and >> a positive queue value being set. This way it is consistent with >> the queue_mapping field in the sk_buff. This will also accommodate >> adding a new 'unsigned short' field in sock_common in the next >> patch for rx_queue_mapping. >> >> Signed-off-by: Amritha Nambiar >> --- >> include/net/sock.h | 10 ++ >> 1 file changed, 6 insertions(+), 4 deletions(-) >> >> diff --git a/include/net/sock.h b/include/net/sock.h >> index b3b7541..009fd30 100644 >> --- a/include/net/sock.h >> +++ b/include/net/sock.h >> @@ -214,7 +214,7 @@ struct sock_common { >> struct hlist_node skc_node; >> struct hlist_nulls_node skc_nulls_node; >> }; >> - int skc_tx_queue_mapping; >> + unsigned short skc_tx_queue_mapping; >> union { >> int skc_incoming_cpu; >> u32 skc_rcv_wnd; >> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk, >> struct sk_buff *skb, >> >> static inline void sk_tx_queue_set(struct sock *sk, int tx_queue) >> { >> - sk->sk_tx_queue_mapping = tx_queue; >> + /* sk_tx_queue_mapping accept only upto a 16-bit value */ >> + WARN_ON((unsigned short)tx_queue > USHRT_MAX); > > > Shouldn't this be USHRT_MAX - 1 ? Actually just a ">=" would probably do as well. > >> + sk->sk_tx_queue_mapping = tx_queue + 1; >> } >> >> static inline void sk_tx_queue_clear(struct sock *sk) >> { >> - sk->sk_tx_queue_mapping = -1; >> >> + sk->sk_tx_queue_mapping = 0; > > > I think it's slightly better to define a new constant like NO_QUEUE_MAPPING > to be USHRT_MAX. That avoids needing to do the arithmetic every time the > value is accessed. >> >> } >> >> static inline int sk_tx_queue_get(const struct sock *sk) >> { >> - return sk ? sk->sk_tx_queue_mapping : -1; >> + return sk ? sk->sk_tx_queue_mapping - 1 : -1; > > > Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed > for this? This doesn't change the result. It was still -1 if the queue mapping is not set. It was just initialized to 0 instead of to -1 so we have to perform the operation to get there. Also in regards to the comment above about needing an extra operation I am not sure it makes much difference. In the case of us starting with 0 as a reserved value I think the instruction count should be about the same. We move the unsigned short into an unsigned in, then decrement, and if the value is non-negative we can assume it is valid. Although maybe I should double check the code to make certain it is doing what I thought it was supposed to be doing. > >> >> >> >> } >> >> static inline void sk_set_socket(struct sock *sk, struct socket *sock) >> >
Re: [PATCH net-next] tcp: add SNMP counter for zero-window drops
From: Yafang Shao Date: Sun, 24 Jun 2018 10:02:54 -0400 > It will be helpful if we could display the drops due to zero window or no > enough window space. > So a new SNMP MIB entry is added to track this behavior. > This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in > /proc/net/netstat in TcpExt line as TCPZeroWindowDrop. > > Signed-off-by: Yafang Shao Applied.
Re: [PATCH v2 net] nfp: cast sizeof() to int when comparing with error code
On 06/26/2018 10:29 AM, Joe Perches wrote: On Tue, 2018-06-26 at 09:16 +0800, Chengguang Xu wrote: sizeof() will return unsigned value so in the error check negative error code will be always larger than sizeof(). [] diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c [] @@ -232,7 +232,7 @@ struct nfp_nffw_info *nfp_nffw_info_open(struct nfp_cpp *cpp) err = nfp_cpp_read(cpp, nfp_resource_cpp_id(state->res), nfp_resource_address(state->res), fwinf, sizeof(*fwinf)); - if (err < sizeof(*fwinf)) + if (err < (int)sizeof(*fwinf)) goto err_release; if (!nffw_res_flg_init_get(fwinf)) The way this is done in several places in the kernel is to test first for < 0 and then test for < sizeof if (err < 0 || err < sizeof(etc...) see net/ceph/ceph_common.c etc... If we need to distinguish the cases <0 and >0 && approach is better. If not I think cast to int will be enough. Thanks, Chengguang.
Re: [PATCH v2 net] nfp: cast sizeof() to int when comparing with error code
On Tue, 2018-06-26 at 09:16 +0800, Chengguang Xu wrote: > sizeof() will return unsigned value so in the error check > negative error code will be always larger than sizeof(). [] > diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c > b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c [] > @@ -232,7 +232,7 @@ struct nfp_nffw_info *nfp_nffw_info_open(struct nfp_cpp > *cpp) > err = nfp_cpp_read(cpp, nfp_resource_cpp_id(state->res), > nfp_resource_address(state->res), > fwinf, sizeof(*fwinf)); > - if (err < sizeof(*fwinf)) > + if (err < (int)sizeof(*fwinf)) > goto err_release; > > if (!nffw_res_flg_init_get(fwinf)) The way this is done in several places in the kernel is to test first for < 0 and then test for < sizeof if (err < 0 || err < sizeof(etc...) see net/ceph/ceph_common.c etc...
[PATCH net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2
From: Rick Farrington Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get speed" command which is sent by the 1.7.2 driver during modprobe. Due to a bug in older firmware (with respect to unknown commands), this unsupported command causes a cascade of errors that ends in a kernel panic. Fix it by making the sending of the "get speed" command conditional on the firmware version. Signed-off-by: Rick Farrington Acked-by: Derek Chickles Signed-off-by: Felix Manlunas --- Note: To avoid checkpatch.pl "WARNING: line over 80 characters", the comma that separates the arguments in the call to strcmp() was placed one line below the usual spot. drivers/net/ethernet/cavium/liquidio/lio_main.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c index 7cb4e75..f83f884 100644 --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c @@ -3671,7 +3671,16 @@ static int setup_nic_devices(struct octeon_device *octeon_dev) OCTEON_CN2350_25GB_SUBSYS_ID || octeon_dev->subsystem_id == OCTEON_CN2360_25GB_SUBSYS_ID) { - liquidio_get_speed(lio); + /* speed control unsupported in f/w older than 1.7.2 */ + if (strcmp(octeon_dev->fw_info.liquidio_firmware_version + , "1.7.2") < 0) { + dev_info(&octeon_dev->pci_dev->dev, +"speed setting not supported by f/w."); + octeon_dev->speed_setting = 25; + octeon_dev->no_speed_setting = 1; + } else { + liquidio_get_speed(lio); + } if (octeon_dev->speed_setting == 0) { octeon_dev->speed_setting = 25;
Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
Hi Amritha, Thank you for the patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Amritha-Nambiar/Symmetric-queue-selection-using-XPS-for-Rx-queues/20180626-070944 config: i386-randconfig-x078-201825 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): In file included from include/net/cls_cgroup.h:19:0, from net/socket.c:98: include/net/sock.h: In function 'sk_rx_queue_get': >> include/net/sock.h:1715:16: error: 'const struct sock' has no member named >> 'sk_rx_queue_mapping' return sk ? sk->sk_rx_queue_mapping - 1 : -1; ^~ vim +1715 include/net/sock.h 1712 1713 static inline int sk_rx_queue_get(const struct sock *sk) 1714 { > 1715 return sk ? sk->sk_rx_queue_mapping - 1 : -1; 1716 } 1717 static inline void sk_set_socket(struct sock *sk, struct socket *sock) 1718 { 1719 sk_tx_queue_clear(sk); 1720 sk->sk_socket = sock; 1721 } 1722 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH net 2/2] nfp: reject binding to shared blocks
Hi John, Thank you for the patch! Yet something to improve: [auto build test ERROR on net/master] url: https://github.com/0day-ci/linux/commits/Jakub-Kicinski/nfp-MPLS-and-shared-blocks-TC-offload-fixes/20180626-042358 config: x86_64-randconfig-u0-06260533 (attached as .config) compiler: gcc-5 (Debian 5.5.0-3) 5.4.1 20171010 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): drivers/net/ethernet/netronome/nfp/bpf/main.c: In function 'nfp_bpf_setup_tc_block': >> drivers/net/ethernet/netronome/nfp/bpf/main.c:205:6: error: implicit >> declaration of function 'tcf_block_shared' >> [-Werror=implicit-function-declaration] if (tcf_block_shared(f->block)) ^ cc1: some warnings being treated as errors vim +/tcf_block_shared +205 drivers/net/ethernet/netronome/nfp/bpf/main.c 196 197 static int nfp_bpf_setup_tc_block(struct net_device *netdev, 198struct tc_block_offload *f) 199 { 200 struct nfp_net *nn = netdev_priv(netdev); 201 202 if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) 203 return -EOPNOTSUPP; 204 > 205 if (tcf_block_shared(f->block)) 206 return -EOPNOTSUPP; 207 208 switch (f->command) { 209 case TC_BLOCK_BIND: 210 return tcf_block_cb_register(f->block, 211 nfp_bpf_setup_tc_block_cb, 212 nn, nn); 213 case TC_BLOCK_UNBIND: 214 tcf_block_cb_unregister(f->block, 215 nfp_bpf_setup_tc_block_cb, 216 nn); 217 return 0; 218 default: 219 return -EOPNOTSUPP; 220 } 221 } 222 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH v2 net] nfp: cast sizeof() to int when comparing with error code
sizeof() will return unsigned value so in the error check negative error code will be always larger than sizeof(). Fixes: a0d8e02c35ff ("nfp: add support for reading nffw info") Signed-off-by: Chengguang Xu Acked-by: Jakub Kicinski --- v2: - Add more information to patch subject and commit log. drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c index cd34097b79f1..37a6d7822a38 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c @@ -232,7 +232,7 @@ struct nfp_nffw_info *nfp_nffw_info_open(struct nfp_cpp *cpp) err = nfp_cpp_read(cpp, nfp_resource_cpp_id(state->res), nfp_resource_address(state->res), fwinf, sizeof(*fwinf)); - if (err < sizeof(*fwinf)) + if (err < (int)sizeof(*fwinf)) goto err_release; if (!nffw_res_flg_init_get(fwinf)) -- 2.17.1
Re: [PATCH v2 net-next 3/4] netdevsim: add ipsec offload testing
On Mon, 25 Jun 2018 16:41:35 -0700, Shannon Nelson wrote: > Implement the IPsec/XFRM offload API for testing. > > Signed-off-by: Shannon Nelson > --- > V2 - addressed formatting comments from Jakub Kicinski Thanks! One more comment below, otherwise: Reviewed-by: Jakub Kicinski > +static void nsim_ipsec_del_sa(struct xfrm_state *xs) > +{ > + struct net_device *dev = xs->xso.dev; > + struct netdevsim *ns = netdev_priv(dev); > + struct nsim_ipsec *ipsec = &ns->ipsec; > + u16 sa_idx; I didn't point this out, but above also breaks the reverse xmas tree rule. Sorry for not mentioning it, the "please fix everywhere" was implicit :( > + sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID; > + if (!ipsec->sa[sa_idx].used) { > + netdev_err(dev, "Invalid SA for delete sa_idx=%d\n", sa_idx); > + return; > + } > + > + memset(&ipsec->sa[sa_idx], 0, sizeof(struct nsim_sa)); > + ipsec->count--; > +} > + > +static bool nsim_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs) > +{ > + struct net_device *dev = xs->xso.dev; > + struct netdevsim *ns = netdev_priv(dev); And here, but you don't use the dev here, so you can just inline the xs->xso.dev into netdev_priv(). > + struct nsim_ipsec *ipsec = &ns->ipsec; > + > + ipsec->ok++; > + > + return true; > +}
Re: [PATCH v2 net-next] net/sched: add skbprio scheduler
On Sun, Jun 24, 2018 at 11:43:07AM -0400, Jamal Hadi Salim wrote: > On 23/06/18 04:47 PM, Nishanth Devarajan wrote: > [..] > > >+/* Drop the packet at the tail of the lowest priority qdisc. */ > >+lp_qdisc = &q->qdiscs[lp]; > >+to_drop = __skb_dequeue_tail(lp_qdisc); > >+BUG_ON(!to_drop); > >+qdisc_qstats_backlog_dec(sch, to_drop); > >+qdisc_drop(to_drop, sch, to_free); > >+ > > Maybe also increase overlimit stat here? It will keep track > of low prio things dropped because you were congested. > Such a stat helps when debugging or collecting analytics. > > Per Alex's comment, how about: > > --- > Skbprio (SKB Priority Queue) is a queueing discipline that > prioritizes packets according to their skb->priority field. > Under congestion, already-enqueued lower priority packets > will be dropped to make space available for higher priority > packets. Skbprio was conceived as a solution for > denial-of-service defenses that need to route packets with > different priorities as a means to overcome DoS attacks > as described in paper ... > > > cheers, > jamal Sounds good, will make some changes in v3. Thanks, Nishanth
Re: [RFC v2 PATCH 0/4] eBPF and struct scatterlist
On 06/19/2018 11:00 AM, Tushar Dave wrote: This follows up on https://patchwork.ozlabs.org/cover/927050/ where the review feedback was to use bpf_skb_load_bytes() to deal with linear and non-linear skbs. While that feedback is valid and correct, the motivation for this work is to allow eBPF based firewalling for kernel modules that do not always get their packet as an sk_buff from their downlink drivers. One such instance of this use-case is RDS, which can be run both over IB (driver RDMA's a scatterlist to the RDS module) or over TCP (TCP passes an sk_buff to the RDS module) This RFC (call it v2) uses exiting socket filter infrastructure and extend it with new eBPF program type that deals with struct scatterlist. For RDS, the integrated approach treats the scatterlist as the common denominator, and allows the application to write a filter for processing a scatterlist. Details: Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the existing socket filter infrastructure for bpf program attach and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER receives struct scatterlist as bpf context contrast to BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF program type allow socket filter to run on packet data that is in form form of struct scatterlist. Patch 2 adds functionality to run BPF_PROG_TYPE_SOCKET_SG_FILTER socket filter program. A bpf helpers bpf_sg_next() is also added so users can retrieve sg elements from scatterlist. Patch 3 adds socket filter eBPF sample program that uses patch 1 and patch 2. The sample program opens an rds socket, attach ebpf program (socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses bpf_sg_next helper to look into sg. For a test, current ebpf program only prints first few bytes from each elements of sg list. Finally, patch 4 allows rds_recv_incoming to invoke socket filter program which deals with scatterlist. It would be really helpful to have your thoughts/comments on my direction. I am planning to put together some complex example for filtering RDS traffic using eBPF, and so it would be good to have feedback before I go too far ahead. Thanks. -Tushar Thanks. -Tushar Tushar Dave (4): eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER ebpf: Add sg_filter_run and sg helper ebpf: Add sample ebpf program for SOCKET_SG_FILTER rds: invoke socket sg filter attached to rds socket include/linux/bpf_types.h | 1 + include/linux/filter.h| 10 + include/uapi/linux/bpf.h | 17 +- kernel/bpf/syscall.c | 1 + kernel/bpf/verifier.c | 1 + net/core/filter.c | 149 - net/rds/ib.c | 1 + net/rds/ib.h | 1 + net/rds/ib_recv.c | 12 ++ net/rds/rds.h | 2 + net/rds/recv.c| 16 ++ net/rds/tcp.c | 2 + net/rds/tcp.h | 2 + net/rds/tcp_recv.c| 38 samples/bpf/Makefile | 3 + samples/bpf/bpf_load.c| 11 +- samples/bpf/rds_filter_kern.c | 78 +++ samples/bpf/rds_filter_user.c | 339 ++ tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h| 17 +- tools/lib/bpf/libbpf.c| 3 + tools/lib/bpf/libbpf.h| 2 + tools/testing/selftests/bpf/bpf_helpers.h | 3 + 23 files changed, 703 insertions(+), 7 deletions(-) create mode 100644 samples/bpf/rds_filter_kern.c create mode 100644 samples/bpf/rds_filter_user.c
Re: [PATCH v2 net-next] net/sched: add skbprio scheduler
On Sat, Jun 23, 2018 at 03:10:32PM -0700, Alexander Duyck wrote: > On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan > wrote: > > net/sched: add skbprio scheduler > > > > Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes > > packets > > according to their skb->priority field. Although Skbprio can be employed in > > any > > scenario in which a higher skb->priority field means a higher priority > > packet, > > Skbprio was concieved as a solution for denial-of-service defenses that > > need to > > route packets with different priorities. > > Really this description is not very good. Reading it I was thinking to > myself "why do we need this, prio already does this". It wasn't until > I read through the code that I figured out that you are basically > adding dropping of lower priority frames. > OK, I'll take Jamal's suggestion on this and write up a new description. > > > > v2 > > *Use skb->priority field rather than DS field. Rename queueing discipline as > > SKB Priority Queue (previously Gatekeeper Priority Queue). > > > > *Queueing discipline is made classful to expose Skbprio's internal priority > > queues. > > > > Signed-off-by: Nishanth Devarajan > > Reviewed-by: Sachin Paryani > > Reviewed-by: Cody Doucette > > Reviewed-by: Michel Machado > > --- > > include/uapi/linux/pkt_sched.h | 15 ++ > > net/sched/Kconfig | 13 ++ > > net/sched/Makefile | 1 + > > net/sched/sch_skbprio.c| 347 > > + > > 4 files changed, 376 insertions(+) > > create mode 100644 net/sched/sch_skbprio.c > > > > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h > > index 37b5096..6fd07e8 100644 > > --- a/include/uapi/linux/pkt_sched.h > > +++ b/include/uapi/linux/pkt_sched.h > > @@ -124,6 +124,21 @@ struct tc_fifo_qopt { > > __u32 limit; /* Queue length: bytes for bfifo, packets for pfifo > > */ > > }; > > > > +/* SKBPRIO section */ > > + > > +/* > > + * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1). > > + * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be > > able > > + * to map one to one the DS field of IPV4 and IPV6 headers. > > + * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY. > > + */ > > + > > +#define SKBPRIO_MAX_PRIORITY 64 > > + > > +struct tc_skbprio_qopt { > > + __u32 limit; /* Queue length in packets. */ > > +}; > > + > > /* PRIO section */ > > > > #define TCQ_PRIO_BANDS 16 > > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > > index a01169f..9ac4b53 100644 > > --- a/net/sched/Kconfig > > +++ b/net/sched/Kconfig > > @@ -240,6 +240,19 @@ config NET_SCH_MQPRIO > > > > If unsure, say N. > > > > +config NET_SCH_SKBPRIO > > + tristate "SKB priority queue scheduler (SKBPRIO)" > > + help > > + Say Y here if you want to use the SKB priority queue > > + scheduler. This schedules packets according to skb->priority, > > + which is useful for request packets in DoS mitigation systems such > > + as Gatekeeper. > > + > > + To compile this driver as a module, choose M here: the module will > > + be called sch_skbprio. > > + > > + If unsure, say N. > > + > > config NET_SCH_CHOKE > > tristate "CHOose and Keep responsive flow scheduler (CHOKE)" > > help > > diff --git a/net/sched/Makefile b/net/sched/Makefile > > index 8811d38..a4d8893 100644 > > --- a/net/sched/Makefile > > +++ b/net/sched/Makefile > > @@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM) += sch_netem.o > > obj-$(CONFIG_NET_SCH_DRR) += sch_drr.o > > obj-$(CONFIG_NET_SCH_PLUG) += sch_plug.o > > obj-$(CONFIG_NET_SCH_MQPRIO) += sch_mqprio.o > > +obj-$(CONFIG_NET_SCH_SKBPRIO) += sch_skbprio.o > > obj-$(CONFIG_NET_SCH_CHOKE)+= sch_choke.o > > obj-$(CONFIG_NET_SCH_QFQ) += sch_qfq.o > > obj-$(CONFIG_NET_SCH_CODEL)+= sch_codel.o > > diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c > > new file mode 100644 > > index 000..5e89446 > > --- /dev/null > > +++ b/net/sched/sch_skbprio.c > > @@ -0,0 +1,347 @@ > > +/* > > + * net/sched/sch_skbprio.c SKB Priority Queue. > > + * > > + * This program is free software; you can redistribute it > > and/or > > + * modify it under the terms of the GNU General Public License > > + * as published by the Free Software Foundation; either version > > + * 2 of the License, or (at your option) any later version. > > + * > > + * Authors:Nishanth Devarajan, > > + * Cody Doucette, > > + * original idea by Michel Machado, Cody Doucette, and Qiaobin > > Fu > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > + > > +/* SKB Priority Queue > > + * = > > + * > > + *
[PATCH net-next 6/6] selftests: forwarding: README: Require diagrams
ASCII art diagrams are well suited for presenting the topology that a test uses while being easy to embed directly in the test file iteslf. They make the information very easy to grasp even for simple topologies, and for more complex ones they are almost essential, as figuring out the interconnects from the script itself proves to be difficult. Therefore state the requirement for topology ASCII art in README. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- tools/testing/selftests/net/forwarding/README | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/net/forwarding/README b/tools/testing/selftests/net/forwarding/README index 4a0964c42860..b8a2af8fcfb7 100644 --- a/tools/testing/selftests/net/forwarding/README +++ b/tools/testing/selftests/net/forwarding/README @@ -46,6 +46,8 @@ Guidelines for Writing Tests o Where possible, reuse an existing topology for different tests instead of recreating the same topology. +o Tests that use anything but the most trivial topologies should include + an ASCII art showing the topology. o Where possible, IPv6 and IPv4 addresses shall conform to RFC 3849 and RFC 5737, respectively. o Where possible, tests shall be written so that they can be reused by -- 2.4.11
[PATCH net-next 3/6] selftests: forwarding: tc_rule_stats_get: Parameterize direction
The GRE multipath tests need stats on an egress counter. Change tc_rule_stats_get() to take direction as an optional argument, with default of ingress. Take the opportunity to change line continuation character from | to \. Move the | to the next line, which indent. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- tools/testing/selftests/net/forwarding/lib.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 911d753c4ff0..f94ea4bafa13 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -383,9 +383,10 @@ tc_rule_stats_get() { local dev=$1; shift local pref=$1; shift + local dir=$1; shift - tc -j -s filter show dev $dev ingress pref $pref | - jq '.[1].options.actions[].stats.packets' + tc -j -s filter show dev $dev ${dir:-ingress} pref $pref \ + | jq '.[1].options.actions[].stats.packets' } mac_get() -- 2.4.11
[PATCH net-next 5/6] selftests: forwarding: Test multipath tunneling
Add a GRE-tunneling test such that there are two tunnels involved, with a multipath route listing both as next hops. Similarly to router_multipath.sh, test that the distribution of traffic to the tunnels honors the configured weights. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../selftests/net/forwarding/gre_multipath.sh | 354 + 1 file changed, 354 insertions(+) create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath.sh diff --git a/tools/testing/selftests/net/forwarding/gre_multipath.sh b/tools/testing/selftests/net/forwarding/gre_multipath.sh new file mode 100755 index ..982cc8c23200 --- /dev/null +++ b/tools/testing/selftests/net/forwarding/gre_multipath.sh @@ -0,0 +1,354 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Test traffic distribution when a wECMP route forwards traffic to two GRE +# tunnels. +# +# +-+ +# | H1 | +# | $h1 + | +# | 192.0.2.1/28 | | +# | 2001:db8:1::1/64 | | +# +---|-+ +# | +# +---|+ +# | SW1 || +# | $ol1 +| +# | 192.0.2.2/28 | +# | 2001:db8:1::2/64 | +# || +# | + g1a (gre) + g1b (gre) | +# |loc=192.0.2.65 loc=192.0.2.81 | +# |rem=192.0.2.66 --. rem=192.0.2.82 --. | +# |tos=inherit | tos=inherit | | +# | .--'| | +# | |.--' | +# | vv| +# | + $ul1.111 (vlan)+ $ul1.222 (vlan)| +# | | 192.0.2.129/28 | 192.0.2.145/28 | +# | \ / | +# |\/ | +# || | +# |+ $ul1 | +# +|---+ +# | +# +|---+ +# | SW2+ $ul2 | +# | ___| | +# |/\ | +# | / \ | +# | + $ul2.111 (vlan)+ $ul2.222 (vlan)| +# | ^ 192.0.2.130/28 ^ 192.0.2.146/28 | +# | ||| +# | |'--. | +# | '--.| | +# | + g2a (gre)| + g2b (gre)| | +# |loc=192.0.2.66 | loc=192.0.2.82 | | +# |rem=192.0.2.65 --' rem=192.0.2.81 --' | +# |tos=inherit tos=inherit| +# || +# | $ol2 +| +# | 192.0.2.17/28 || +# | 2001:db8:2::1/64 || +# +---|+ +# | +# +---|-+ +# | H2| | +# | $h2 + | +# | 192.0.2.18/28 | +# | 2001:db8:2::2/64 | +# +-+ + +ALL_TESTS=" + ping_ipv4 + ping_ipv6 + multipath_ipv4 + multipath_ipv6 + multipath_ipv6_l4 +" + +NUM_NETIFS=6 +source lib.sh + +h1_create() +{ + simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64 + ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2 + ip route add vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2 +} + +h1_destroy() +{ + ip route del vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2 + ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2 + simple_if_fini $h1 192.0.2.1/28 +} + +sw1_create() +{ + simple_if_init $ol1 192.0.2.2/28 2001:db8:1::2/64 + __simple_if_init $ul1 v$ol1 + vlan_create $ul1 111 v$ol1 192.0.2.129/28 + vlan_create $ul1 222 v$ol1 192.0.2.145/28 + + tunnel_create g1a gre 192.0.2.65 192.0.2.66 tos inherit dev v$ol1 + __simple_if_init g1a v$ol1 192.0.2.65/32 + ip route add vrf v$ol1 192.0.2.66/32 via 192.0.2.130 + + tunnel_create g1b gre 192.0.2.81 192.0.2.82 tos inherit dev v$ol1 + __simple_if_init g1b v$ol1 192.0.2.81/32 + ip route add vrf v$ol1 192.0.2.82/32 via 192.0.2.146 + + ip route add vrf v$ol1 192.0.2.16/28 \ + nexthop dev g1a \ + nexthop dev g1b + ip route add vrf v$ol1 2001:db8:2::/64 \ + nexthop dev g1a \ + nexthop dev g1b + + tc qdisc add dev $ul1 clsact + tc filter add dev $ul1 egress pref 111 prot 802.1q \ + flower vlan_id 111 action pass + tc filter add dev $ul1 egress pref 222 prot 802.1q \ + flower vlan_id 222 action pass +} + +sw1_destroy() +{ + tc qdisc del dev $ul1 clsact + + ip route del vrf v$ol1 2001:db8:2::/64 + ip ro
[PATCH net-next 4/6] selftests: forwarding: lib: Extract interface-init functions
The function simple_if_init() does two things: it creates a VRF, then moves an interface into this VRF and configures addresses. The latter comes in handy when adding more interfaces into a VRF later on. The situation is similar for simple_if_fini(). Therefore split the interface remastering and address de/initialization logic to a new pair of helpers __simple_if_init() / __simple_if_fini(), and defer to these helpers from simple_if_init() and simple_if_fini(). Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- tools/testing/selftests/net/forwarding/lib.sh | 32 +-- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index f94ea4bafa13..1dfdf14894e2 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -287,6 +287,29 @@ __addr_add_del() done } +__simple_if_init() +{ + local if_name=$1; shift + local vrf_name=$1; shift + local addrs=("${@}") + + ip link set dev $if_name master $vrf_name + ip link set dev $if_name up + + __addr_add_del $if_name add "${addrs[@]}" +} + +__simple_if_fini() +{ + local if_name=$1; shift + local addrs=("${@}") + + __addr_add_del $if_name del "${addrs[@]}" + + ip link set dev $if_name down + ip link set dev $if_name nomaster +} + simple_if_init() { local if_name=$1 @@ -298,11 +321,8 @@ simple_if_init() array=("${@}") vrf_create $vrf_name - ip link set dev $if_name master $vrf_name ip link set dev $vrf_name up - ip link set dev $if_name up - - __addr_add_del $if_name add "${array[@]}" + __simple_if_init $if_name $vrf_name "${array[@]}" } simple_if_fini() @@ -315,9 +335,7 @@ simple_if_fini() vrf_name=v$if_name array=("${@}") - __addr_add_del $if_name del "${array[@]}" - - ip link set dev $if_name down + __simple_if_fini $if_name "${array[@]}" vrf_destroy $vrf_name } -- 2.4.11
[PATCH net-next 1/6] selftests: forwarding: Move multipath_eval() to lib.sh
This function will be useful for the GRE multipath test that is coming later. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- tools/testing/selftests/net/forwarding/lib.sh | 39 ++ .../selftests/net/forwarding/router_multipath.sh | 39 -- 2 files changed, 39 insertions(+), 39 deletions(-) diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 7b18a53aa556..7fae805147ae 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -557,6 +557,45 @@ tests_run() done } +multipath_eval() +{ + local desc="$1" + local weight_rp12=$2 + local weight_rp13=$3 + local packets_rp12=$4 + local packets_rp13=$5 + local weights_ratio packets_ratio diff + + RET=0 + + if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then + check_err 1 "Packet difference is 0" + log_test "Multipath" + log_info "Expected ratio $weights_ratio" + return + fi + + if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then + weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \ + | bc -l) + packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \ + | bc -l) + else + weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \ + bc -l) + packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | \ + bc -l) + fi + + diff=$(echo $weights_ratio - $packets_ratio | bc -l) + diff=${diff#-} + + test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0 + check_err $? "Too large discrepancy between expected and measured ratios" + log_test "$desc" + log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio" +} + ## # Tests diff --git a/tools/testing/selftests/net/forwarding/router_multipath.sh b/tools/testing/selftests/net/forwarding/router_multipath.sh index 8b6d0fb6d604..79a209927962 100755 --- a/tools/testing/selftests/net/forwarding/router_multipath.sh +++ b/tools/testing/selftests/net/forwarding/router_multipath.sh @@ -159,45 +159,6 @@ router2_destroy() vrf_destroy "vrf-r2" } -multipath_eval() -{ - local desc="$1" - local weight_rp12=$2 - local weight_rp13=$3 - local packets_rp12=$4 - local packets_rp13=$5 - local weights_ratio packets_ratio diff - - RET=0 - - if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then - check_err 1 "Packet difference is 0" - log_test "Multipath" - log_info "Expected ratio $weights_ratio" - return - fi - - if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then - weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \ - | bc -l) - packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \ - | bc -l) - else - weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \ - bc -l) - packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | \ - bc -l) - fi - - diff=$(echo $weights_ratio - $packets_ratio | bc -l) - diff=${diff#-} - - test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0 - check_err $? "Too large discrepancy between expected and measured ratios" - log_test "$desc" - log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio" -} - multipath4_test() { local desc="$1" -- 2.4.11
[PATCH net-next 2/6] selftests: forwarding: multipath_eval(): Improve style
- Change the indentation of the function body from 7 spaces to one tab. - Move initialization of weights_ratio up so that it can be referenced from the error message about packet difference being zero. - Move |'s consistently to continuation line, which reindent. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- tools/testing/selftests/net/forwarding/lib.sh | 74 ++- 1 file changed, 39 insertions(+), 35 deletions(-) diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 7fae805147ae..911d753c4ff0 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -559,41 +559,45 @@ tests_run() multipath_eval() { - local desc="$1" - local weight_rp12=$2 - local weight_rp13=$3 - local packets_rp12=$4 - local packets_rp13=$5 - local weights_ratio packets_ratio diff - - RET=0 - - if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then - check_err 1 "Packet difference is 0" - log_test "Multipath" - log_info "Expected ratio $weights_ratio" - return - fi - - if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then - weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \ - | bc -l) - packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \ - | bc -l) - else - weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \ - bc -l) - packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | \ - bc -l) - fi - - diff=$(echo $weights_ratio - $packets_ratio | bc -l) - diff=${diff#-} - - test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0 - check_err $? "Too large discrepancy between expected and measured ratios" - log_test "$desc" - log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio" + local desc="$1" + local weight_rp12=$2 + local weight_rp13=$3 + local packets_rp12=$4 + local packets_rp13=$5 + local weights_ratio packets_ratio diff + + RET=0 + + if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then + weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \ + | bc -l) + else + weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" \ + | bc -l) + fi + + if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then + check_err 1 "Packet difference is 0" + log_test "Multipath" + log_info "Expected ratio $weights_ratio" + return + fi + + if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then + packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \ + | bc -l) + else + packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" \ + | bc -l) + fi + + diff=$(echo $weights_ratio - $packets_ratio | bc -l) + diff=${diff#-} + + test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0 + check_err $? "Too large discrepancy between expected and measured ratios" + log_test "$desc" + log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio" } ## -- 2.4.11
[PATCH net-next 0/6] Multipath tests for tunnel devices
This patchset adds a test for ECMP and weighted ECMP between two GRE tunnels. In patches #1 and #2, the function multipath_eval() is first moved from router_multipath.sh to lib.sh for ease of reuse, and then fixed up. In patch #3, the function tc_rule_stats_get() is parameterized to be useful for egress rules as well. In patch #4, a new function __simple_if_init() is extracted from simple_if_init(). This covers the logic that needs to be done for the usual interface: VRF migration, upping and installation of IP addresses. Patch #5 then adds the test itself. Additionally in patch #6, a requirement to add diagrams to selftests is documented. Petr Machata (6): selftests: forwarding: Move multipath_eval() to lib.sh selftests: forwarding: multipath_eval(): Improve style selftests: forwarding: tc_rule_stats_get: Parameterize direction selftests: forwarding: lib: Extract interface-init functions selftests: forwarding: Test multipath tunneling selftests: forwarding: README: Require diagrams tools/testing/selftests/net/forwarding/README | 2 + .../selftests/net/forwarding/gre_multipath.sh | 354 + tools/testing/selftests/net/forwarding/lib.sh | 80 - .../selftests/net/forwarding/router_multipath.sh | 39 --- 4 files changed, 427 insertions(+), 48 deletions(-) create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath.sh -- 2.4.11
[PATCH v2 net-next 2/4] selftests: rtnetlink: use dummydev as a test device
We really shouldn't mess with local system settings, so let's use the already created dummy device instead for ipsec testing. Oh, and let's put the temp file into a proper directory. Signed-off-by: Shannon Nelson --- tools/testing/selftests/net/rtnetlink.sh | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 261a981..15948cf 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -523,21 +523,19 @@ kci_test_macsec() kci_test_ipsec() { ret=0 - - # find an ip address on this machine and make up a destination - srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/` - net=`echo $srcip | cut -f1-3 -d.` - base=`echo $srcip | cut -f4 -d.` - dstip="$net."`expr $base + 1` - algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128" + srcip=192.168.123.1 + dstip=192.168.123.2 + spi=7 + + ip addr add $srcip dev $devdummy # flush to be sure there's nothing configured ip x s flush ; ip x p flush check_err $? # start the monitor in the background - tmpfile=`mktemp ipsectestXXX` + tmpfile=`mktemp /var/run/ipsectestXXX` mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null` sleep 0.2 @@ -601,6 +599,7 @@ kci_test_ipsec() check_err $? ip x p flush check_err $? + ip addr del $srcip/32 dev $devdummy if [ $ret -ne 0 ]; then echo "FAIL: ipsec" -- 2.7.4
[PATCH v2 net-next 4/4] selftests: rtnetlink: add ipsec offload API test
Using the netdevsim as a device for testing, try out the XFRM commands for setting up IPsec hardware offloads. Signed-off-by: Shannon Nelson --- tools/testing/selftests/net/rtnetlink.sh | 114 +++ 1 file changed, 114 insertions(+) diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 15948cf..9e1a82e 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -608,6 +608,119 @@ kci_test_ipsec() echo "PASS: ipsec" } +#--- +# Example commands +# ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \ +#spi 0x07 mode transport reqid 0x07 replay-window 32 \ +#aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \ +#sel src 14.0.0.52/24 dst 14.0.0.70/24 +#offload dev sim1 dir out +# ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \ +#tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \ +#spi 0x07 mode transport reqid 0x07 +# +#--- +kci_test_ipsec_offload() +{ + ret=0 + algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128" + srcip=192.168.123.3 + dstip=192.168.123.4 + dev=simx1 + sysfsd=/sys/kernel/debug/netdevsim/$dev + sysfsf=$sysfsd/ipsec + + # setup netdevsim since dummydev doesn't have offload support + modprobe netdevsim + check_err $? + if [ $ret -ne 0 ]; then + echo "FAIL: ipsec_offload can't load netdevsim" + return 1 + fi + + ip link add $dev type netdevsim + ip addr add $srcip dev $dev + ip link set $dev up + if [ ! -d $sysfsd ] ; then + echo "FAIL: ipsec_offload can't create device $dev" + return 1 + fi + if [ ! -f $sysfsf ] ; then + echo "FAIL: ipsec_offload netdevsim doesn't support IPsec offload" + return 1 + fi + + # flush to be sure there's nothing configured + ip x s flush ; ip x p flush + + # create offloaded SAs, both in and out + ip x p add dir out src $srcip/24 dst $dstip/24 \ + tmpl proto esp src $srcip dst $dstip spi 9 \ + mode transport reqid 42 + check_err $? + ip x p add dir out src $dstip/24 dst $srcip/24 \ + tmpl proto esp src $dstip dst $srcip spi 9 \ + mode transport reqid 42 + check_err $? + + ip x s add proto esp src $srcip dst $dstip spi 9 \ + mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \ + offload dev $dev dir out + check_err $? + ip x s add proto esp src $dstip dst $srcip spi 9 \ + mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \ + offload dev $dev dir in + check_err $? + if [ $ret -ne 0 ]; then + echo "FAIL: ipsec_offload can't create SA" + return 1 + fi + + # does offload show up in ip output + lines=`ip x s list | grep -c "crypto offload parameters: dev $dev dir"` + if [ $lines -ne 2 ] ; then + echo "FAIL: ipsec_offload SA offload missing from list output" + check_err 1 + fi + + # use ping to exercise the Tx path + ping -I $dev -c 3 -W 1 -i 0 $dstip >/dev/null + + # does driver have correct offload info + diff $sysfsf - << EOF +SA count=2 tx=3 +sa[0] tx ipaddr=0x +sa[0]spi=0x0009 proto=0x32 salt=0x61626364 crypt=1 +sa[0]key=0x34333231 38373635 32313039 36353433 +sa[1] rx ipaddr=0x 037ba8c0 +sa[1]spi=0x0009 proto=0x32 salt=0x61626364 crypt=1 +sa[1]key=0x34333231 38373635 32313039 36353433 +EOF + if [ $? -ne 0 ] ; then + echo "FAIL: ipsec_offload incorrect driver data" + check_err 1 + fi + + # does offload get removed from driver + ip x s flush + ip x p flush + lines=`grep -c "SA count=0" $sysfsf` + if [ $lines -ne 1 ] ; then + echo "FAIL: ipsec_offload SA not removed from driver" + check_err 1 + fi + + # clean up any leftovers + ip link del $dev + rmmod netdevsim + + if [ $ret -ne 0 ]; then + echo "FAIL: ipsec_offload" + return 1 + fi + echo "PASS: ipsec_offload" +} + kci_test_gretap() { testns="testns" @@ -862,6 +975,7 @@ kci_test_rtnl() kci_test_encap kci_test_macsec kci_test_ipsec + kci_test_ipsec_offload kci_del_dummy } -- 2.7.4
[PATCH v2 net-next 3/4] netdevsim: add ipsec offload testing
Implement the IPsec/XFRM offload API for testing. Signed-off-by: Shannon Nelson --- V2 - addressed formatting comments from Jakub Kicinski drivers/net/netdevsim/Makefile| 4 + drivers/net/netdevsim/ipsec.c | 298 ++ drivers/net/netdevsim/netdev.c| 7 + drivers/net/netdevsim/netdevsim.h | 41 ++ 4 files changed, 350 insertions(+) create mode 100644 drivers/net/netdevsim/ipsec.c diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile index 449b2a1..0fee1d0 100644 --- a/drivers/net/netdevsim/Makefile +++ b/drivers/net/netdevsim/Makefile @@ -13,3 +13,7 @@ endif ifneq ($(CONFIG_NET_DEVLINK),) netdevsim-objs += devlink.o fib.o endif + +ifneq ($(CONFIG_XFRM_OFFLOAD),) +netdevsim-objs += ipsec.o +endif diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c new file mode 100644 index 000..1a23426 --- /dev/null +++ b/drivers/net/netdevsim/ipsec.c @@ -0,0 +1,298 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */ + +#include +#include +#include + +#include "netdevsim.h" + +#define NSIM_IPSEC_AUTH_BITS 128 + +static ssize_t nsim_dbg_netdev_ops_read(struct file *filp, + char __user *buffer, + size_t count, loff_t *ppos) +{ + struct netdevsim *ns = filp->private_data; + struct nsim_ipsec *ipsec = &ns->ipsec; + size_t bufsize; + char *buf, *p; + int len; + int i; + + /* the buffer needed is +* (num SAs * 3 lines each * ~60 bytes per line) + one more line +*/ + bufsize = (ipsec->count * 4 * 60) + 60; + buf = kzalloc(bufsize, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + p = buf; + p += snprintf(p, bufsize - (p - buf), + "SA count=%u tx=%u\n", + ipsec->count, ipsec->tx); + + for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) { + struct nsim_sa *sap = &ipsec->sa[i]; + + if (!sap->used) + continue; + + p += snprintf(p, bufsize - (p - buf), + "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n", + i, (sap->rx ? 'r' : 't'), sap->ipaddr[0], + sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]); + p += snprintf(p, bufsize - (p - buf), + "sa[%i]spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n", + i, be32_to_cpu(sap->xs->id.spi), + sap->xs->id.proto, sap->salt, sap->crypt); + p += snprintf(p, bufsize - (p - buf), + "sa[%i]key=0x%08x %08x %08x %08x\n", + i, sap->key[0], sap->key[1], + sap->key[2], sap->key[3]); + } + + len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf); + + kfree(buf); + return len; +} + +static const struct file_operations ipsec_dbg_fops = { + .owner = THIS_MODULE, + .open = simple_open, + .read = nsim_dbg_netdev_ops_read, +}; + +static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec) +{ + u32 i; + + if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT) + return -ENOSPC; + + /* search sa table */ + for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) { + if (!ipsec->sa[i].used) + return i; + } + + return -ENOSPC; +} + +static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs, + u32 *mykey, u32 *mysalt) +{ + const char aes_gcm_name[] = "rfc4106(gcm(aes))"; + struct net_device *dev = xs->xso.dev; + unsigned char *key_data; + char *alg_name = NULL; + int key_len; + + if (!xs->aead) { + netdev_err(dev, "Unsupported IPsec algorithm\n"); + return -EINVAL; + } + + if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) { + netdev_err(dev, "IPsec offload requires %d bit authentication\n", + NSIM_IPSEC_AUTH_BITS); + return -EINVAL; + } + + key_data = &xs->aead->alg_key[0]; + key_len = xs->aead->alg_key_len; + alg_name = xs->aead->alg_name; + + if (strcmp(alg_name, aes_gcm_name)) { + netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n", + aes_gcm_name); + return -EINVAL; + } + + /* 160 accounts for 16 byte key and 4 byte salt */ + if (key_len > NSIM_IPSEC_AUTH_BITS) { + *mysalt = ((u32 *)key_data)[4]; + } else if (key_len == NSIM_IPSEC_AUTH_BITS) { + *mysalt = 0; + } else { + netdev_err(dev, "IPsec hw offload only supports 128 bit keys with opti
[PATCH v2 net-next 0/4] Updates for ipsec selftests
Fix up the existing ipsec selftest and add tests for the ipsec offload driver API. v2: addressed formatting nits in netdevsim from Jakub Kicinski Shannon Nelson (4): selftests: rtnetlink: clear the return code at start of ipsec test selftests: rtnetlink: use dummydev as a test device netdevsim: add ipsec offload testing selftests: rtnetlink: add ipsec offload API test drivers/net/netdevsim/Makefile | 4 + drivers/net/netdevsim/ipsec.c| 345 +++ drivers/net/netdevsim/netdev.c | 7 + drivers/net/netdevsim/netdevsim.h| 37 tools/testing/selftests/net/rtnetlink.sh | 132 +++- 5 files changed, 518 insertions(+), 7 deletions(-) create mode 100644 drivers/net/netdevsim/ipsec.c -- 2.7.4
[PATCH v2 net-next 1/4] selftests: rtnetlink: clear the return code at start of ipsec test
Following the custom from the other functions, clear the global ret code before starting the test so as to not have previously failed tests cause us to thing this test has failed. Reported-by: Anders Roxell Signed-off-by: Shannon Nelson --- tools/testing/selftests/net/rtnetlink.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index b33a371..261a981 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -522,6 +522,8 @@ kci_test_macsec() #--- kci_test_ipsec() { + ret=0 + # find an ip address on this machine and make up a destination srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/` net=`echo $srcip | cut -f1-3 -d.` -- 2.7.4
[PATCH v2] fib_rules: match rules based on suppress_* properties too
Two rules with different values of suppress_prefix or suppress_ifgroup are not the same. This fixes an -EEXIST when running: $ ip -4 rule add table main suppress_prefixlength 0 Signed-off-by: Jason A. Donenfeld Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule") --- This adds the new condition you mentioned. I'm not sure what you make of DaveM's remark about this not being in the original code, but here is nonetheless the requested change. net/core/fib_rules.c | 8 1 file changed, 8 insertions(+) diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 126ffc5bc630..bc8425d81022 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -416,6 +416,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops *ops, if (rule->mark && r->mark != rule->mark) continue; + if (rule->suppress_ifgroup != -1 && + r->suppress_ifgroup != rule->suppress_ifgroup) + continue; + + if (rule->suppress_prefixlen != -1 && + r->suppress_prefixlen != rule->suppress_prefixlen) + continue; + if (rule->mark_mask && r->mark_mask != rule->mark_mask) continue; -- 2.17.1
Re: [PATCH net 2/2] nfp: reject binding to shared blocks
Hi John, Thank you for the patch! Yet something to improve: [auto build test ERROR on net/master] url: https://github.com/0day-ci/linux/commits/Jakub-Kicinski/nfp-MPLS-and-shared-blocks-TC-offload-fixes/20180626-042358 config: i386-randconfig-x001-201825 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/net/ethernet/netronome/nfp/flower/offload.c: In function 'nfp_flower_setup_tc_block': >> drivers/net/ethernet/netronome/nfp/flower/offload.c:634:6: error: implicit >> declaration of function 'tcf_block_shared'; did you mean 'tcf_block_dev'? >> [-Werror=implicit-function-declaration] if (tcf_block_shared(f->block)) ^~~~ tcf_block_dev cc1: some warnings being treated as errors vim +634 drivers/net/ethernet/netronome/nfp/flower/offload.c 625 626 static int nfp_flower_setup_tc_block(struct net_device *netdev, 627 struct tc_block_offload *f) 628 { 629 struct nfp_repr *repr = netdev_priv(netdev); 630 631 if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) 632 return -EOPNOTSUPP; 633 > 634 if (tcf_block_shared(f->block)) 635 return -EOPNOTSUPP; 636 637 switch (f->command) { 638 case TC_BLOCK_BIND: 639 return tcf_block_cb_register(f->block, 640 nfp_flower_setup_tc_block_cb, 641 repr, repr); 642 case TC_BLOCK_UNBIND: 643 tcf_block_cb_unregister(f->block, 644 nfp_flower_setup_tc_block_cb, 645 repr); 646 return 0; 647 default: 648 return -EOPNOTSUPP; 649 } 650 } 651 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH 3/4] net: lan78xx: Add support for VLAN tag stripping.
The chip supports stripping the VLAN tag and reporting it in metadata. Complete the support for this. Signed-off-by: Dave Stevenson --- drivers/net/usb/lan78xx.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index afe7fa3..f72a8f5 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -64,6 +64,7 @@ #define DEFAULT_RX_CSUM_ENABLE (true) #define DEFAULT_TSO_CSUM_ENABLE(true) #define DEFAULT_VLAN_FILTER_ENABLE (true) +#define DEFAULT_VLAN_RX_OFFLOAD(true) #define TX_OVERHEAD(8) #define RXW_PADDING2 @@ -2363,6 +2364,11 @@ static int lan78xx_set_features(struct net_device *netdev, pdata->rfe_ctl &= ~(RFE_CTL_ICMP_COE_ | RFE_CTL_IGMP_COE_); } + if (features & NETIF_F_HW_VLAN_CTAG_RX) + pdata->rfe_ctl |= RFE_CTL_VLAN_STRIP_; + else + pdata->rfe_ctl &= ~RFE_CTL_VLAN_STRIP_; + if (features & NETIF_F_HW_VLAN_CTAG_FILTER) pdata->rfe_ctl |= RFE_CTL_VLAN_FILTER_; else @@ -2976,6 +2982,9 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf) if (DEFAULT_TSO_CSUM_ENABLE) dev->net->features |= NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_SG; + if (DEFAULT_VLAN_RX_OFFLOAD) + dev->net->features |= NETIF_F_HW_VLAN_CTAG_RX; + if (DEFAULT_VLAN_FILTER_ENABLE) dev->net->features |= NETIF_F_HW_VLAN_CTAG_FILTER; @@ -3052,6 +3061,16 @@ static void lan78xx_rx_csum_offload(struct lan78xx_net *dev, } } +static void lan78xx_rx_vlan_offload(struct lan78xx_net *dev, + struct sk_buff *skb, + u32 rx_cmd_a, u32 rx_cmd_b) +{ + if ((dev->net->features & NETIF_F_HW_VLAN_CTAG_RX) && + (rx_cmd_a & RX_CMD_A_FVTG_)) + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), + (rx_cmd_b & 0x)); +} + static void lan78xx_skb_return(struct lan78xx_net *dev, struct sk_buff *skb) { int status; @@ -3116,6 +3135,8 @@ static int lan78xx_rx(struct lan78xx_net *dev, struct sk_buff *skb) if (skb->len == size) { lan78xx_rx_csum_offload(dev, skb, rx_cmd_a, rx_cmd_b); + lan78xx_rx_vlan_offload(dev, skb, + rx_cmd_a, rx_cmd_b); skb_trim(skb, skb->len - 4); /* remove fcs */ skb->truesize = size + sizeof(struct sk_buff); @@ -3134,6 +3155,7 @@ static int lan78xx_rx(struct lan78xx_net *dev, struct sk_buff *skb) skb_set_tail_pointer(skb2, size); lan78xx_rx_csum_offload(dev, skb2, rx_cmd_a, rx_cmd_b); + lan78xx_rx_vlan_offload(dev, skb2, rx_cmd_a, rx_cmd_b); skb_trim(skb2, skb2->len - 4); /* remove fcs */ skb2->truesize = size + sizeof(struct sk_buff); -- 2.7.4
Re: [PATCH v2 net-next] net/sched: add skbprio scheduler
On Sat, Jun 23, 2018 at 02:43:16PM -0700, Cong Wang wrote: > On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan > wrote: > > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h > > index 37b5096..6fd07e8 100644 > > --- a/include/uapi/linux/pkt_sched.h > > +++ b/include/uapi/linux/pkt_sched.h > ... > > +#define SKBPRIO_MAX_PRIORITY 64 > > + > > +struct tc_skbprio_qopt { > > + __u32 limit; /* Queue length in packets. */ > > +}; > > > Since this is just an integer, you can just make it NLA_U32 instead > of a struct? > > Making it NLA_U32, wouldn't that be incurring a nla_policy struct in the code? I also feel uneasy that we'd be straying convention of having a tc qopt struct to pass in essential parameters from userspace. > > +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt, > > + struct netlink_ext_ack *extack) > > +{ > > + struct skbprio_sched_data *q = qdisc_priv(sch); > > + struct tc_skbprio_qopt *ctl = nla_data(opt); > > + const unsigned int min_limit = 1; > > + > > + if (ctl->limit == (typeof(ctl->limit))-1) > > + q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit); > > + else if (ctl->limit < min_limit || > > + ctl->limit > qdisc_dev(sch)->tx_queue_len) > > + return -EINVAL; > > + else > > + q->max_limit = ctl->limit; > > + > > + return 0; > > +} > > Isn't q->max_limit same with sch->limit? > q->max_limit was intended to represent the maximum limit that Skbprio could accomodate i.e the tx queue len of the device attached to the qdisc, to check the limit parameter passed from userspace. I'll correct this in v3. > Also, please avoid dev->tx_queue_len here, it may change > independently of your qdisc change, unless you want to implement > ops->change_tx_queue_len(). OK, will make this change.
[PATCH net-next] selftests: forwarding: mirror_gre_vlan_bridge_1q: Unset rp_filter
The IP addresses of tunnel endpoint at H3 are set at the VLAN device $h3.555. Therefore when test_gretap_untagged_egress() sets vlan 555 to egress untagged at $swp3, $h3's rp_filter rejects these packets. The test then spuriously fails. Therefore turn off net.ipv4.conf.{all, $h3}.rp_filter. Fixes: 9c7c8a82442c ("selftests: forwarding: mirror_gre_vlan_bridge_1q: Add more tests") Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh| 9 + 1 file changed, 9 insertions(+) diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh index 5dbc7a08f4bd..1ac5038ae256 100755 --- a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh +++ b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh @@ -39,6 +39,12 @@ setup_prepare() swp3=${NETIFS[p5]} h3=${NETIFS[p6]} + # gt4's remote address is at $h3.555, not $h3. Thus the packets arriving + # directly to $h3 for test_gretap_untagged_egress() are rejected by + # rp_filter and the test spuriously fails. + sysctl_set net.ipv4.conf.all.rp_filter 0 + sysctl_set net.ipv4.conf.$h3.rp_filter 0 + vrf_prepare mirror_gre_topo_create @@ -65,6 +71,9 @@ cleanup() mirror_gre_topo_destroy vrf_cleanup + + sysctl_restore net.ipv4.conf.$h3.rp_filter + sysctl_restore net.ipv4.conf.all.rp_filter } test_vlan_match() -- 2.4.11
Re: [PATCH net-next 3/4] netdevsim: add ipsec offload testing
On Mon, 25 Jun 2018 15:37:10 -0700, Shannon Nelson wrote: > On 6/22/2018 9:07 PM, Jakub Kicinski wrote: > > On Fri, 22 Jun 2018 17:31:37 -0700, Shannon Nelson wrote: > >> Implement the IPsec/XFRM offload API for testing. > >> > >> Signed-off-by: Shannon Nelson > >> +#define NSIM_IPSEC_AUTH_BITS 128 > >> + > >> +/** > >> + * nsim_ipsec_dbg_read - read for ipsec data > >> + * @filp: the opened file > >> + * @buffer: where to write the data for the user to read > >> + * @count: the size of the user's buffer > >> + * @ppos: file position offset > >> + **/ > >> +static ssize_t nsim_dbg_netdev_ops_read(struct file *filp, > > > > Doesn't match the kdoc. Please run > > > > ./scripts/kernel-doc -none $file > > > > if you want kdoc. Although IMHO you may as well drop the kdoc, your > > code is quite self explanatory and local. > > By adding -v to that I got a couple of warnings that I didn't include > the Return information - is that what you were commenting on? The rest > seems acceptable to the script I'm using from the net-next tree. Hm, strange. Two things: first kdoc requires () after function name, second the function is called nsim_dbg_netdev_ops_read() while the doc refers to nsim_ipsec_dbg_read(). Perhaps the combination of the two makes the script miss the problem. > >> + char __user *buffer, > >> + size_t count, loff_t *ppos) > >> +{ > >> + struct netdevsim *ns = filp->private_data; > >> + struct nsim_ipsec *ipsec = &ns->ipsec; > >> + size_t bufsize; > >> + char *buf, *p; > >> + int len; > >> + int i; > >> + > >> + /* don't allow partial reads */ > >> + if (*ppos != 0) > >> + return 0; > >> + > >> + /* the buffer needed is > >> + * (num SAs * 3 lines each * ~60 bytes per line) + one more line > >> + */ > >> + bufsize = (ipsec->count * 4 * 60) + 60; > >> + buf = kzalloc(bufsize, GFP_KERNEL); > >> + if (!buf) > >> + return -ENOMEM; > >> + > >> + p = buf; > >> + p += snprintf(p, bufsize - (p - buf), > >> +"SA count=%u tx=%u\n", > >> +ipsec->count, ipsec->tx); > >> + > >> + for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) { > >> + struct nsim_sa *sap = &ipsec->sa[i]; > >> + > >> + if (!sap->used) > >> + continue; > >> + > >> + p += snprintf(p, bufsize - (p - buf), > >> +"sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n", > >> +i, (sap->rx ? 'r' : 't'), sap->ipaddr[0], > >> +sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]); > >> + p += snprintf(p, bufsize - (p - buf), > >> +"sa[%i]spi=0x%08x proto=0x%x salt=0x%08x > >> crypt=%d\n", > >> +i, be32_to_cpu(sap->xs->id.spi), > >> +sap->xs->id.proto, sap->salt, sap->crypt); > >> + p += snprintf(p, bufsize - (p - buf), > >> +"sa[%i]key=0x%08x %08x %08x %08x\n", > >> +i, sap->key[0], sap->key[1], > >> +sap->key[2], sap->key[3]); > >> + } > >> + > >> + len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf); > > > > Why not seq_file for this? > > Why bother with more interface code? This is useful enough to support > the API testing needed. No objection on this, seq_file is less error prone, but I don't mind. FWIW you can drop the *ppos == 0 requirement, simple_read_from_buffer() will handle other cases just fine. > >> + kfree(buf); > >> + return len; > >> +} > >> + > >> +static const struct file_operations ipsec_dbg_fops = { > >> + .owner = THIS_MODULE, > >> + .open = simple_open, > >> + .read = nsim_dbg_netdev_ops_read, > >> +}; > >> + > >> +/** > >> + * nsim_ipsec_find_empty_idx - find the first unused security parameter > >> index > >> + * @ipsec: pointer to ipsec struct > >> + **/ > >> +static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec) > >> +{ > >> + u32 i; > >> + > >> + if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT) > >> + return -ENOSPC; > >> + > >> + /* search sa table */ > >> + for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) { > >> + if (!ipsec->sa[i].used) > >> + return i; > >> + } > >> + > >> + return -ENOSPC; > > > > FWIW I personally find bitmaps and find_first_zero_bit() etc. nice and > > concise for a small ID allocator, but no objection to open coding. > > Sure, we could add a parallel bitmap data structure to track usage of > our array elements, and probably would for a much larger array so as to > lessen the impact of a serial search. But, since this is a short array > for simple testing purposes, the search time is minimal so I think the > simple code is fine. Ack, no objection. > > > >> + } else if (key_len == 128) { > >> + *mysalt = 0; > >> + } else { > >> + netdev_err(dev, "IPsec hw offload only supports 1
[net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
Signed-off-by: Amritha Nambiar --- Documentation/ABI/testing/sysfs-class-net-queues | 11 Documentation/networking/scaling.txt | 57 ++ 2 files changed, 58 insertions(+), 10 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues index 0c0df91..978b763 100644 --- a/Documentation/ABI/testing/sysfs-class-net-queues +++ b/Documentation/ABI/testing/sysfs-class-net-queues @@ -42,6 +42,17 @@ Description: network device transmit queue. Possible vaules depend on the number of available CPU(s) in the system. +What: /sys/class//queues/tx-/xps_rxqs +Date: June 2018 +KernelVersion: 4.18.0 +Contact: netdev@vger.kernel.org +Description: + Mask of the receive queue(s) currently enabled to participate + into the Transmit Packet Steering packet processing flow for this + network device transmit queue. Possible values depend on the + number of available receive queue(s) in the network device. + Default is disabled. + What: /sys/class//queues/tx-/byte_queue_limits/hold_time Date: November 2011 KernelVersion: 3.3 diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index f55639d..8336116 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -366,8 +366,13 @@ XPS: Transmit Packet Steering Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue -device. To accomplish this, a mapping from CPU to hardware queue(s) is -recorded. The goal of this mapping is usually to assign queues +device. This can be accomplished by recording two kinds of maps, either +a mapping of CPU to hardware queue(s) or a mapping of receive queue(s) +to hardware transmit queue(s). + +1. XPS using CPUs map + +The goal of this mapping is usually to assign queues exclusively to a subset of CPUs, where the transmit completions for these queues are processed on a CPU within this set. This choice provides two benefits. First, contention on the device queue lock is @@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit completion is reduced, in particular for data cache lines that hold the sk_buff structures. -XPS is configured per transmit queue by setting a bitmap of CPUs that -may use that queue to transmit. The reverse mapping, from CPUs to -transmit queues, is computed and maintained for each network device. -When transmitting the first packet in a flow, the function -get_xps_queue() is called to select a queue. This function uses the ID -of the running CPU as a key into the CPU-to-queue lookup table. If the +2. XPS using receive queues map + +This mapping is used to pick transmit queue based on the receive +queue(s) map configuration set by the administrator. A set of receive +queues can be mapped to a set of transmit queues (many:many), although +the common use case is a 1:1 mapping. This will enable sending packets +on the same queue associations for transmit and receive. This is useful for +busy polling multi-threaded workloads where there are challenges in +associating a given CPU to a given application thread. The application +threads are not pinned to CPUs and each thread handles packets +received on a single queue. The receive queue number is cached in the +socket for the connection. In this model, sending the packets on the same +transmit queue corresponding to the associated receive queue has benefits +in keeping the CPU overhead low. Transmit completion work is locked into +the same queue-association that a given application is polling on. This +avoids the overhead of triggering an interrupt on another CPU. When the +application cleans up the packets during the busy poll, transmit completion +may be processed along with it in the same thread context and so result in +reduced latency. + +XPS is configured per transmit queue by setting a bitmap of +CPUs/receive-queues that may use that queue to transmit. The reverse +mapping, from CPUs to transmit queues or from receive-queues to transmit +queues, is computed and maintained for each network device. When +transmitting the first packet in a flow, the function get_xps_queue() is +called to select a queue. This function uses the ID of the receive queue +for the socket connection for a match in the receive queue-to-transmit queue +lookup table. Alternatively, this function can also use the ID of the +running CPU as a key into the CPU-to-queue lookup table. If the ID matches a single queue, that is used for transmission. If multiple queues match, one is selected by using the flow hash to compute an index into the set. @@ -404,11 +432,15 @@ acknowledged. XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by default for SMP). The
[net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
Extend transmit queue sysfs attribute to configure Rx queue(s) map per Tx queue. By default no receive queues are configured for the Tx queue. - /sys/class/net/eth0/queues/tx-*/xps_rxqs Signed-off-by: Amritha Nambiar --- net/core/net-sysfs.c | 81 ++ 1 file changed, 81 insertions(+) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index b39987c..5d2ed02 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue, static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init = __ATTR_RW(xps_cpus); + +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf) +{ + struct net_device *dev = queue->dev; + struct xps_dev_maps *dev_maps; + unsigned long *mask, index; + int j, len, num_tc = 1, tc = 0; + + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), + GFP_KERNEL); + if (!mask) + return -ENOMEM; + + index = get_netdev_queue_index(queue); + + if (dev->num_tc) { + num_tc = dev->num_tc; + tc = netdev_txq_to_tc(dev, index); + if (tc < 0) + return -EINVAL; + } + + rcu_read_lock(); + dev_maps = rcu_dereference(dev->xps_rxqs_map); + if (dev_maps) { + for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues), +j < dev->num_rx_queues;) { + int i, tci = j * num_tc + tc; + struct xps_map *map; + + map = rcu_dereference(dev_maps->attr_map[tci]); + if (!map) + continue; + + for (i = map->len; i--;) { + if (map->queues[i] == index) { + set_bit(j, mask); + break; + } + } + } + } + + len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues); + rcu_read_unlock(); + kfree(mask); + + return len < PAGE_SIZE ? len : -EINVAL; +} + +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf, + size_t len) +{ + struct net_device *dev = queue->dev; + unsigned long *mask, index; + int err; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), + GFP_KERNEL); + if (!mask) + return -ENOMEM; + + index = get_netdev_queue_index(queue); + + err = bitmap_parse(buf, len, mask, dev->num_rx_queues); + if (err) { + kfree(mask); + return err; + } + + err = __netif_set_xps_queue(dev, mask, index, true); + kfree(mask); + return err ? : len; +} + +static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init + = __ATTR_RW(xps_rxqs); #endif /* CONFIG_XPS */ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = { @@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = { &queue_traffic_class.attr, #ifdef CONFIG_XPS &xps_cpus_attribute.attr, + &xps_rxqs_attribute.attr, &queue_tx_maxrate.attr, #endif NULL
[net-next PATCH v4 4/7] net: Record receive queue number for a connection
This patch adds a new field to sock_common 'skc_rx_queue_mapping' which holds the receive queue number for the connection. The Rx queue is marked in tcp_finish_connect() to allow a client app to do SO_INCOMING_NAPI_ID after a connect() call to get the right queue association for a socket. Rx queue is also marked in tcp_conn_request() to allow syn-ack to go on the right tx-queue associated with the queue on which syn is received. Signed-off-by: Amritha Nambiar Signed-off-by: Sridhar Samudrala --- include/net/busy_poll.h |1 + include/net/sock.h | 14 ++ net/core/sock.c |4 net/ipv4/tcp_input.c|3 +++ 4 files changed, 22 insertions(+) diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index c518743..9e36fda6 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb) #ifdef CONFIG_NET_RX_BUSY_POLL sk->sk_napi_id = skb->napi_id; #endif + sk_rx_queue_set(sk, skb); } /* variant used for unconnected sockets */ diff --git a/include/net/sock.h b/include/net/sock.h index 009fd30..0ff4416 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair; * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings @@ -215,6 +216,9 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; unsigned short skc_tx_queue_mapping; +#ifdef CONFIG_XPS + unsigned short skc_rx_queue_mapping; +#endif union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -326,6 +330,9 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#ifdef CONFIG_XPS +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#endif #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const struct sock *sk) return sk ? sk->sk_tx_queue_mapping - 1 : -1; } +static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb) +{ +#ifdef CONFIG_XPS + sk->sk_rx_queue_mapping = skb_get_rx_queue(skb) + 1; +#endif +} + static inline void sk_set_socket(struct sock *sk, struct socket *sock) { sk_tx_queue_clear(sk); diff --git a/net/core/sock.c b/net/core/sock.c index bcc4182..5e4715b 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2818,6 +2818,10 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_pacing_rate = ~0U; sk->sk_pacing_shift = 10; sk->sk_incoming_cpu = -1; + +#ifdef CONFIG_XPS + sk->sk_rx_queue_mapping = ~0; +#endif /* * Before updating sk_refcnt, we must commit prior changes to memory * (Documentation/RCU/rculist_nulls.txt for details) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 76ca88f..c404c53 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -78,6 +78,7 @@ #include #include #include +#include int sysctl_tcp_max_orphans __read_mostly = NR_FILE; @@ -5584,6 +5585,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) if (skb) { icsk->icsk_af_ops->sk_rx_dst_set(sk, skb); security_inet_conn_established(sk, skb); + sk_mark_napi_id(sk, skb); } tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); @@ -6412,6 +6414,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, tcp_rsk(req)->snt_isn = isn; tcp_rsk(req)->txhash = net_tx_rndhash(); tcp_openreq_init_rwin(req, sk, dst); + sk_rx_queue_set(req_to_sk(req), skb); if (!want_cookie) { tcp_reqsk_record_syn(sk, req, skb); fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
[net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues
This patch adds support to pick Tx queue based on the Rx queue(s) map configuration set by the admin through the sysfs attribute for each Tx queue. If the user configuration for receive queue(s) map does not apply, then the Tx queue selection falls back to CPU(s) map based selection and finally to hashing. Signed-off-by: Amritha Nambiar --- include/net/sock.h |4 +++ net/core/dev.c | 62 ++-- 2 files changed, 49 insertions(+), 17 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 0ff4416..cb18139 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1710,6 +1710,10 @@ static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb) #endif } +static inline int sk_rx_queue_get(const struct sock *sk) +{ + return sk ? sk->sk_rx_queue_mapping - 1 : -1; +} static inline void sk_set_socket(struct sock *sk, struct socket *sock) { sk_tx_queue_clear(sk); diff --git a/net/core/dev.c b/net/core/dev.c index df2a78d..2450c5e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3454,35 +3454,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) } #endif /* CONFIG_NET_EGRESS */ -static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) +#ifdef CONFIG_XPS +static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, + struct xps_dev_maps *dev_maps, unsigned int tci) +{ + struct xps_map *map; + int queue_index = -1; + + if (dev->num_tc) { + tci *= dev->num_tc; + tci += netdev_get_prio_tc_map(dev, skb->priority); + } + + map = rcu_dereference(dev_maps->attr_map[tci]); + if (map) { + if (map->len == 1) + queue_index = map->queues[0]; + else + queue_index = map->queues[reciprocal_scale( + skb_get_hash(skb), map->len)]; + if (unlikely(queue_index >= dev->real_num_tx_queues)) + queue_index = -1; + } + return queue_index; +} +#endif + +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb) { #ifdef CONFIG_XPS struct xps_dev_maps *dev_maps; - struct xps_map *map; + struct sock *sk = skb->sk; int queue_index = -1; if (!static_key_false(&xps_needed)) return -1; rcu_read_lock(); - dev_maps = rcu_dereference(dev->xps_cpus_map); + if (!static_key_false(&xps_rxqs_needed)) + goto get_cpus_map; + + dev_maps = rcu_dereference(dev->xps_rxqs_map); if (dev_maps) { - unsigned int tci = skb->sender_cpu - 1; + int tci = sk_rx_queue_get(sk); - if (dev->num_tc) { - tci *= dev->num_tc; - tci += netdev_get_prio_tc_map(dev, skb->priority); - } + if (tci >= 0 && tci < dev->num_rx_queues) + queue_index = __get_xps_queue_idx(dev, skb, dev_maps, + tci); + } - map = rcu_dereference(dev_maps->attr_map[tci]); - if (map) { - if (map->len == 1) - queue_index = map->queues[0]; - else - queue_index = map->queues[reciprocal_scale(skb_get_hash(skb), - map->len)]; - if (unlikely(queue_index >= dev->real_num_tx_queues)) - queue_index = -1; +get_cpus_map: + if (queue_index < 0) { + dev_maps = rcu_dereference(dev->xps_cpus_map); + if (dev_maps) { + unsigned int tci = skb->sender_cpu - 1; + + queue_index = __get_xps_queue_idx(dev, skb, dev_maps, + tci); } } rcu_read_unlock();
[net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues
This patch series implements support for Tx queue selection based on Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue using sysfs attribute. If the user configuration for Rx queues does not apply, then the Tx queue selection falls back to XPS using CPUs and finally to hashing. XPS is refactored to support Tx queue selection based on either the CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be enabled. By default no receive queues are configured for the Tx queue. - /sys/class/net//queues/tx-*/xps_rxqs A set of receive queues can be mapped to a set of transmit queues (many:many), although the common use case is a 1:1 mapping. This will enable sending packets on the same Tx-Rx queue association as this is useful for busy polling multi-threaded workloads where it is not possible to pin the threads to a CPU. This is a rework of Sridhar's patch for symmetric queueing via socket option: https://www.spinics.net/lists/netdev/msg453106.html Testing Hints: Kernel: Linux 4.17.0-rc7+ Interface: driver: ixgbe version: 5.1.0-k firmware-version: 0x00015e0b Configuration: ethtool -L $iface combined 16 ethtool -C $iface rx-usecs 1000 sysctl net.core.busy_poll=1000 ATR disabled: ethtool -K $iface ntuple on Workload: Modified memcached that changes the thread selection policy to be based on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket option. The default is round-robin. Default: No rxqs_map configured Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue System: Architecture: x86_64 CPU(s):72 Model name:Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz 16 threads 400K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 4/51/2215 2/30/5163 (usec) intr/sec26655 18606 contextswitch/sec 51454044 insn per cycle 0.430.72 cache-misses6.919 4.310 (% of all cache refs) L1-dcache-load- 4.493.29 -misses (% of all L1-dcache hits) LLC-load-misses 13.26 8.96 (% of all LL-cache hits) --- 32 threads 400K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 10/112/5562 9/46/4637 (usec) intr/sec30456 27666 contextswitch/sec 75525133 insn per cycle 0.410.49 cache-misses9.357 2.769 (% of all cache refs) L1-dcache-load- 4.093.98 -misses (% of all L1-dcache hits) LLC-load-misses 12.96 3.96 (% of all LL-cache hits) --- 16 threads 800K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 5/151/4989 9/69/2611 (usec) intr/sec35686 22907 contextswitch/sec 25522 12281 insn per cycle 0.670.74 cache-misses8.652 6.38 (% of all cache refs) L1-dcache-load- 3.192.86 -misses (% of all L1-dcache hits) LLC-load-misses 16.53 11.99 (% of all LL-cache hits) --- 32 threads 800K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 6/163/6152 8/88/4209 (usec) intr/sec47079 26548 contextswitch/sec 42190 39168 insn per cycle 0.450.54 cache-misses8.798
[net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
Change 'skc_tx_queue_mapping' field in sock_common structure from 'int' to 'unsigned short' type with 0 indicating unset and a positive queue value being set. This way it is consistent with the queue_mapping field in the sk_buff. This will also accommodate adding a new 'unsigned short' field in sock_common in the next patch for rx_queue_mapping. Signed-off-by: Amritha Nambiar --- include/net/sock.h | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index b3b7541..009fd30 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -214,7 +214,7 @@ struct sock_common { struct hlist_node skc_node; struct hlist_nulls_node skc_nulls_node; }; - int skc_tx_queue_mapping; + unsigned short skc_tx_queue_mapping; union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb, static inline void sk_tx_queue_set(struct sock *sk, int tx_queue) { - sk->sk_tx_queue_mapping = tx_queue; + /* sk_tx_queue_mapping accept only upto a 16-bit value */ + WARN_ON((unsigned short)tx_queue > USHRT_MAX); + sk->sk_tx_queue_mapping = tx_queue + 1; } static inline void sk_tx_queue_clear(struct sock *sk) { - sk->sk_tx_queue_mapping = -1; + sk->sk_tx_queue_mapping = 0; } static inline int sk_tx_queue_get(const struct sock *sk) { - return sk ? sk->sk_tx_queue_mapping : -1; + return sk ? sk->sk_tx_queue_mapping - 1 : -1; } static inline void sk_set_socket(struct sock *sk, struct socket *sock)
[net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
Refactor XPS code to support Tx queue selection based on CPU(s) map or Rx queue(s) map. Signed-off-by: Amritha Nambiar --- include/linux/cpumask.h | 11 ++ include/linux/netdevice.h | 100 + net/core/dev.c| 211 ++--- net/core/net-sysfs.c |4 - 4 files changed, 246 insertions(+), 80 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index bf53d89..57f20a0 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask; #define cpu_active(cpu)((cpu) == 0) #endif -/* verify cpu argument to cpumask_* operators */ -static inline unsigned int cpumask_check(unsigned int cpu) +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits) { #ifdef CONFIG_DEBUG_PER_CPU_MAPS - WARN_ON_ONCE(cpu >= nr_cpumask_bits); + WARN_ON_ONCE(cpu >= bits); #endif /* CONFIG_DEBUG_PER_CPU_MAPS */ +} + +/* verify cpu argument to cpumask_* operators */ +static inline unsigned int cpumask_check(unsigned int cpu) +{ + cpu_max_bits_warn(cpu, nr_cpumask_bits); return cpu; } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3ec9850..c534f03 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -730,10 +730,15 @@ struct xps_map { */ struct xps_dev_maps { struct rcu_head rcu; - struct xps_map __rcu *cpu_map[0]; + struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */ }; -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ + +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *))) + +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\ + (_rxqs * (_tcs) * sizeof(struct xps_map *))) + #endif /* CONFIG_XPS */ #define TC_MAX_QUEUE 16 @@ -1909,7 +1914,8 @@ struct net_device { int watchdog_timeo; #ifdef CONFIG_XPS - struct xps_dev_maps __rcu *xps_maps; + struct xps_dev_maps __rcu *xps_cpus_map; + struct xps_dev_maps __rcu *xps_rxqs_map; #endif #ifdef CONFIG_NET_CLS_ACT struct mini_Qdisc __rcu *miniq_egress; @@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index) #ifdef CONFIG_XPS int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, u16 index); +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, + u16 index, bool is_rxqs_map); + +/** + * attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask + * @j: CPU/Rx queue index + * @mask: bitmask of all cpus/rx queues + * @nr_bits: number of bits in the bitmask + * + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues. + */ +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask, + unsigned int nr_bits) +{ + cpu_max_bits_warn(j, nr_bits); + return test_bit(j, mask); +} + +/** + * attr_test_online - Test for online CPU/Rx queue + * @j: CPU/Rx queue index + * @online_mask: bitmask for CPUs/Rx queues that are online + * @nr_bits: number of bits in the bitmask + * + * Returns true if a CPU/Rx queue is online. + */ +static inline bool attr_test_online(unsigned long j, + const unsigned long *online_mask, + unsigned int nr_bits) +{ + cpu_max_bits_warn(j, nr_bits); + + if (online_mask) + return test_bit(j, online_mask); + + if (j >= 0 && j < nr_bits) + return true; + + return false; +} + +/** + * attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask + * @n: CPU/Rx queue index + * @srcp: the cpumask/Rx queue mask pointer + * @nr_bits: number of bits in the bitmask + * + * Returns >= nr_bits if no further CPUs/Rx queues set. + */ +static inline unsigned int attrmask_next(int n, const unsigned long *srcp, +unsigned int nr_bits) +{ + /* -1 is a legal arg here. */ + if (n != -1) + cpu_max_bits_warn(n, nr_bits); + + if (srcp) + return find_next_bit(srcp, nr_bits, n + 1); + + return n + 1; +} + +/** + * attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p + * @n: CPU/Rx queue index + * @src1p: the first CPUs/Rx queues mask pointer + * @src2p: the second CPUs/Rx queues mask pointer + * @nr_bits: number of bits in the bitmask + * + * Returns >= nr_bits if no further CPUs/Rx queues set in both. + */ +static inline int attrmask_next_and(int n, const unsigned long *src1p, + const unsigned long *src2p, + unsi
[net-next PATCH v4 2/7] net: Use static_key for XPS maps
Use static_key for XPS maps to reduce the cost of extra map checks, similar to how it is used for RPS and RFS. This includes static_key 'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using Rx queues map. Signed-off-by: Amritha Nambiar --- net/core/dev.c | 26 -- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 2552556..df2a78d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq) EXPORT_SYMBOL(netdev_txq_to_tc); #ifdef CONFIG_XPS +struct static_key xps_needed __read_mostly; +EXPORT_SYMBOL(xps_needed); +struct static_key xps_rxqs_needed __read_mostly; +EXPORT_SYMBOL(xps_rxqs_needed); static DEFINE_MUTEX(xps_map_mutex); #define xmap_dereference(P)\ rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex)) @@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset, mutex_lock(&xps_map_mutex); - dev_maps = xmap_dereference(dev->xps_rxqs_map); - if (dev_maps) { - nr_ids = dev->num_rx_queues; - clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, - count, true); - + if (static_key_false(&xps_rxqs_needed)) { + dev_maps = xmap_dereference(dev->xps_rxqs_map); + if (dev_maps) { + nr_ids = dev->num_rx_queues; + clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, + offset, count, true); + } + static_key_slow_dec(&xps_rxqs_needed); } dev_maps = xmap_dereference(dev->xps_cpus_map); @@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset, false); out_no_maps: + static_key_slow_dec(&xps_needed); mutex_unlock(&xps_map_mutex); } @@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, if (!new_dev_maps) goto out_no_new_maps; + static_key_slow_inc(&xps_needed); + if (is_rxqs_map) + static_key_slow_inc(&xps_rxqs_needed); + for (j = -1; j = attrmask_next(j, possible_mask, nr_ids), j < nr_ids;) { /* copy maps belonging to foreign traffic classes */ @@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) struct xps_map *map; int queue_index = -1; + if (!static_key_false(&xps_needed)) + return -1; + rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_cpus_map); if (dev_maps) {
Re: [PATCH net-next 3/4] netdevsim: add ipsec offload testing
On 6/22/2018 9:07 PM, Jakub Kicinski wrote: On Fri, 22 Jun 2018 17:31:37 -0700, Shannon Nelson wrote: Implement the IPsec/XFRM offload API for testing. Signed-off-by: Shannon Nelson Thanks for the patch! Just a number of stylistic nit picks. diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c new file mode 100644 index 000..ad64266 --- /dev/null +++ b/drivers/net/netdevsim/ipsec.c @@ -0,0 +1,345 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */ + +#include +#include +#include +#include "netdevsim.h" Other files in the driver sort headers alphabetically and put an empty line between global and local headers. Sure. +#define NSIM_IPSEC_AUTH_BITS 128 + +/** + * nsim_ipsec_dbg_read - read for ipsec data + * @filp: the opened file + * @buffer: where to write the data for the user to read + * @count: the size of the user's buffer + * @ppos: file position offset + **/ +static ssize_t nsim_dbg_netdev_ops_read(struct file *filp, Doesn't match the kdoc. Please run ./scripts/kernel-doc -none $file if you want kdoc. Although IMHO you may as well drop the kdoc, your code is quite self explanatory and local. By adding -v to that I got a couple of warnings that I didn't include the Return information - is that what you were commenting on? The rest seems acceptable to the script I'm using from the net-next tree. + char __user *buffer, + size_t count, loff_t *ppos) +{ + struct netdevsim *ns = filp->private_data; + struct nsim_ipsec *ipsec = &ns->ipsec; + size_t bufsize; + char *buf, *p; + int len; + int i; + + /* don't allow partial reads */ + if (*ppos != 0) + return 0; + + /* the buffer needed is +* (num SAs * 3 lines each * ~60 bytes per line) + one more line +*/ + bufsize = (ipsec->count * 4 * 60) + 60; + buf = kzalloc(bufsize, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + p = buf; + p += snprintf(p, bufsize - (p - buf), + "SA count=%u tx=%u\n", + ipsec->count, ipsec->tx); + + for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) { + struct nsim_sa *sap = &ipsec->sa[i]; + + if (!sap->used) + continue; + + p += snprintf(p, bufsize - (p - buf), + "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n", + i, (sap->rx ? 'r' : 't'), sap->ipaddr[0], + sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]); + p += snprintf(p, bufsize - (p - buf), + "sa[%i]spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n", + i, be32_to_cpu(sap->xs->id.spi), + sap->xs->id.proto, sap->salt, sap->crypt); + p += snprintf(p, bufsize - (p - buf), + "sa[%i]key=0x%08x %08x %08x %08x\n", + i, sap->key[0], sap->key[1], + sap->key[2], sap->key[3]); + } + + len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf); Why not seq_file for this? Why bother with more interface code? This is useful enough to support the API testing needed. + kfree(buf); + return len; +} + +static const struct file_operations ipsec_dbg_fops = { + .owner = THIS_MODULE, + .open = simple_open, + .read = nsim_dbg_netdev_ops_read, +}; + +/** + * nsim_ipsec_find_empty_idx - find the first unused security parameter index + * @ipsec: pointer to ipsec struct + **/ +static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec) +{ + u32 i; + + if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT) + return -ENOSPC; + + /* search sa table */ + for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) { + if (!ipsec->sa[i].used) + return i; + } + + return -ENOSPC; FWIW I personally find bitmaps and find_first_zero_bit() etc. nice and concise for a small ID allocator, but no objection to open coding. Sure, we could add a parallel bitmap data structure to track usage of our array elements, and probably would for a much larger array so as to lessen the impact of a serial search. But, since this is a short array for simple testing purposes, the search time is minimal so I think the simple code is fine. +} + +/** + * nsim_ipsec_parse_proto_keys - find the key and salt based on the protocol + * @xs: pointer to xfrm_state struct + * @mykey: pointer to key array to populate + * @mysalt: pointer to salt value to populate + * + * This copies the protocol keys and salt to our own data tables. The + * 82599 family only supports the one algorithm. 82599 is a fine chip, it's
Re: [PATCH net-next v2 7/7] net: sched: call reoffload op on block callback reg
Mon, Jun 25, 2018 at 11:30:10PM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a >block when a callback tries to register to a block that already has >offloaded rules. If all existing rules cannot be offloaded then the >registration is rejected. This replaces the previous policy of rejecting >such callback registration outright. > >On unregistration of a callback, the rules are flushed for that given cb. >The implementation of block sharing in the NFP driver, for example, >duplicates shared rules to all devs bound to a block. This meant that >rules could still exist in hw even after a device is unbound from a block >(assuming the block still remains active). > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski Acked-by: Jiri Pirko
Re: [PATCH net-next v2 4/7] net: sched: cls_matchall: implement offload tcf_proto_op
Mon, Jun 25, 2018 at 11:30:07PM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Add the reoffload tcf_proto_op in matchall to generate an offload message >for each filter in the given tcf_proto. Call the specified callback with >this new offload message. The function only returns an error if the >callback rejects adding a 'hardware only' rule. > >Ensure matchall flags correctly report if the rule is in hw by keeping a >reference counter for the number of instances of the rule offloaded. Only >update the flag when this counter changes from or to 0. > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski Acked-by: Jiri Pirko
Re: [PATCH net-next v2 3/7] net: sched: cls_flower: implement offload tcf_proto_op
Mon, Jun 25, 2018 at 11:30:06PM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Add the reoffload tcf_proto_op in flower to generate an offload message >for each filter in the given tcf_proto. Call the specified callback with >this new offload message. The function only returns an error if the >callback rejects adding a 'hardware only' rule. > >A filter contains a flag to indicate if it is in hardware or not. To >ensure the reoffload function properly maintains this flag, keep a >reference counter for the number of instances of the filter that are in >hardware. Only update the flag when this counter changes from or to 0. Add >a generic helper function to implement this behaviour. > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski Acked-by: Jiri Pirko
Re: [PATCH net-next v2 2/7] net: sched: add tcf_proto_op to offload a rule
Mon, Jun 25, 2018 at 11:30:05PM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Create a new tcf_proto_op called 'reoffload' that generates a new offload >message for each node in a tcf_proto. Pointers to the tcf_proto and >whether the offload request is to add or delete the node are included. >Also included is a callback function to send the offload message to and >the option of priv data to go with the cb. > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski Acked-by: Jiri Pirko
Re: [PATCH net-next v2 1/7] net: sched: pass extack pointer to block binds and cb registration
Mon, Jun 25, 2018 at 11:30:04PM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Pass the extact struct from a tc qdisc add to the block bind function and, >in turn, to the setup_tc ndo of binding device via the tc_block_offload >struct. Pass this back to any block callback registrations to allow >netlink logging of fails in the bind process. > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski Acked-by: Jiri Pirko
Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky > > Hi, > > This is bunch of patches trigged by running syzkaller internally. > > I'm sending them based on rdma-next mainly for two reasons: > 1, Most of the patches fix the old issues and it doesn't matter when > they will hit the Linus's tree: now or later in a couple of weeks > during merge window. > 2. They interleave with code cleanup, mlx5-next patches and Michael's > feedback on flow counters series. > > Thanks > > Leon Romanovsky (12): > RDMA/uverbs: Protect from attempts to create flows on unsupported QP > RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow I applied these two to for-rc > RDMA/uverbs: Check existence of create_flow callback > RDMA/verbs: Drop kernel variant of create_flow > RDMA/verbs: Drop kernel variant of destroy_flow > net/mlx5: Rate limit errors in command interface > RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR > RDMA/umem: Don't check for negative return value of dma_map_sg_attrs() > RDMA/uverbs: Remove redundant check These to for-next > overflow.h: Add arithmetic shift helper > RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq > RDMA/mlx5: Reuse existed shift_overlow helper And these will have to be respun. Thanks, Jason
[PATCH net-next v2 1/7] net: sched: pass extack pointer to block binds and cb registration
From: John Hurley Pass the extact struct from a tc qdisc add to the block bind function and, in turn, to the setup_tc ndo of binding device via the tc_block_offload struct. Pass this back to any block callback registrations to allow netlink logging of fails in the bind process. Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 2 +- .../net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- .../net/ethernet/intel/i40evf/i40evf_main.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +- .../net/ethernet/mellanox/mlxsw/spectrum.c| 10 +--- drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 +- .../ethernet/netronome/nfp/flower/offload.c | 2 +- .../net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +- drivers/net/netdevsim/netdev.c| 2 +- include/net/pkt_cls.h | 11 +--- net/dsa/slave.c | 2 +- net/sched/cls_api.c | 25 --- 17 files changed, 43 insertions(+), 31 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 176fc9f4d7de..b5fc6414a951 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -7984,7 +7984,7 @@ static int bnxt_setup_tc_block(struct net_device *dev, switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, bnxt_setup_tc_block_cb, -bp, bp); +bp, bp, f->extack); case TC_BLOCK_UNBIND: tcf_block_cb_unregister(f->block, bnxt_setup_tc_block_cb, bp); return 0; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c index 05d405905906..0745f2dfc80c 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c @@ -173,7 +173,7 @@ static int bnxt_vf_rep_setup_tc_block(struct net_device *dev, case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, bnxt_vf_rep_setup_tc_block_cb, -vf_rep, vf_rep); +vf_rep, vf_rep, f->extack); case TC_BLOCK_UNBIND: tcf_block_cb_unregister(f->block, bnxt_vf_rep_setup_tc_block_cb, vf_rep); diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index dd04a2f89ce6..84eca1d45ad1 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -3016,7 +3016,7 @@ static int cxgb_setup_tc_block(struct net_device *dev, switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, cxgb_setup_tc_block_cb, -pi, dev); +pi, dev, f->extack); case TC_BLOCK_UNBIND: tcf_block_cb_unregister(f->block, cxgb_setup_tc_block_cb, pi); return 0; diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 7ad2b1b0b125..426b0ccb1fc6 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -7554,7 +7554,7 @@ static int i40e_setup_tc_block(struct net_device *dev, switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, i40e_setup_tc_block_cb, -np, np); +np, np, f->extack); case TC_BLOCK_UNBIND: tcf_block_cb_unregister(f->block, i40e_setup_tc_block_cb, np); return 0; diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index dc56a8667495..5906c1c1d19d 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -2926,7 +2926,7 @@ static int i40evf_setup_tc_block(struct net_device *dev, switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, i40evf_setup_tc_block_cb, -adapter, adapter); +adapter, adapter, f->extack); case TC_BLOCK_UNBIND: tcf_block_cb_unregister(f->block, i40e
[PATCH net-next v2 2/7] net: sched: add tcf_proto_op to offload a rule
From: John Hurley Create a new tcf_proto_op called 'reoffload' that generates a new offload message for each node in a tcf_proto. Pointers to the tcf_proto and whether the offload request is to add or delete the node are included. Also included is a callback function to send the offload message to and the option of priv data to go with the cb. Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski --- v2: - name parameters in header file. --- include/net/act_api.h | 3 --- include/net/sch_generic.h | 6 ++ 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 9e59ebfded62..5ff11adbe2a6 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -190,9 +190,6 @@ static inline void tcf_action_stats_update(struct tc_action *a, u64 bytes, #endif } -typedef int tc_setup_cb_t(enum tc_setup_type type, - void *type_data, void *cb_priv); - #ifdef CONFIG_NET_CLS_ACT int tc_setup_cb_egdev_register(const struct net_device *dev, tc_setup_cb_t *cb, void *cb_priv); diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 6488daa32f82..18adc9142b18 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -20,6 +20,9 @@ struct qdisc_walker; struct tcf_walker; struct module; +typedef int tc_setup_cb_t(enum tc_setup_type type, + void *type_data, void *cb_priv); + struct qdisc_rate_table { struct tc_ratespec rate; u32 data[256]; @@ -256,6 +259,9 @@ struct tcf_proto_ops { bool *last, struct netlink_ext_ack *); void(*walk)(struct tcf_proto*, struct tcf_walker *arg); + int (*reoffload)(struct tcf_proto *tp, bool add, +tc_setup_cb_t *cb, void *cb_priv, +struct netlink_ext_ack *extack); void(*bind_class)(void *, u32, unsigned long); /* rtnetlink specific */ -- 2.17.1
[PATCH net-next v2 5/7] net: sched: cls_u32: implement offload tcf_proto_op
From: John Hurley Add the offload tcf_proto_op in cls_u32 to generate an offload message for each filter and the hashtable in the given tcf_proto. Call the specified callback with this new offload message. The function only returns an error if the callback rejects adding a 'hardware only' rule. A filter contains a flag to indicate if it is in hardware or not. To ensure the offload function properly maintains this flag, keep a reference counter for the number of instances of the filter that are in hardware. Only update the flag when this counter changes from or to 0. Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski --- net/sched/cls_u32.c | 111 1 file changed, 111 insertions(+) diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index fb861f90fde6..d5d2a6dc3921 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -62,6 +62,7 @@ struct tc_u_knode { struct tc_u32_pcnt __percpu *pf; #endif u32 flags; + unsigned intin_hw_count; #ifdef CONFIG_CLS_U32_MARK u32 val; u32 mask; @@ -571,6 +572,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n, u32_remove_hw_knode(tp, n, NULL); return err; } else if (err > 0) { + n->in_hw_count = err; tcf_block_offload_inc(block, &n->flags); } @@ -1199,6 +1201,114 @@ static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg) } } +static int u32_reoffload_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, + bool add, tc_setup_cb_t *cb, void *cb_priv, + struct netlink_ext_ack *extack) +{ + struct tc_cls_u32_offload cls_u32 = {}; + int err; + + tc_cls_common_offload_init(&cls_u32.common, tp, ht->flags, extack); + cls_u32.command = add ? TC_CLSU32_NEW_HNODE : TC_CLSU32_DELETE_HNODE; + cls_u32.hnode.divisor = ht->divisor; + cls_u32.hnode.handle = ht->handle; + cls_u32.hnode.prio = ht->prio; + + err = cb(TC_SETUP_CLSU32, &cls_u32, cb_priv); + if (err && add && tc_skip_sw(ht->flags)) + return err; + + return 0; +} + +static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n, + bool add, tc_setup_cb_t *cb, void *cb_priv, + struct netlink_ext_ack *extack) +{ + struct tc_u_hnode *ht = rtnl_dereference(n->ht_down); + struct tcf_block *block = tp->chain->block; + struct tc_cls_u32_offload cls_u32 = {}; + int err; + + tc_cls_common_offload_init(&cls_u32.common, tp, n->flags, extack); + cls_u32.command = add ? + TC_CLSU32_REPLACE_KNODE : TC_CLSU32_DELETE_KNODE; + cls_u32.knode.handle = n->handle; + + if (add) { + cls_u32.knode.fshift = n->fshift; +#ifdef CONFIG_CLS_U32_MARK + cls_u32.knode.val = n->val; + cls_u32.knode.mask = n->mask; +#else + cls_u32.knode.val = 0; + cls_u32.knode.mask = 0; +#endif + cls_u32.knode.sel = &n->sel; + cls_u32.knode.exts = &n->exts; + if (n->ht_down) + cls_u32.knode.link_handle = ht->handle; + } + + err = cb(TC_SETUP_CLSU32, &cls_u32, cb_priv); + if (err) { + if (add && tc_skip_sw(n->flags)) + return err; + return 0; + } + + tc_cls_offload_cnt_update(block, &n->in_hw_count, &n->flags, add); + + return 0; +} + +static int u32_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb, +void *cb_priv, struct netlink_ext_ack *extack) +{ + struct tc_u_common *tp_c = tp->data; + struct tc_u_hnode *ht; + struct tc_u_knode *n; + unsigned int h; + int err; + + for (ht = rtnl_dereference(tp_c->hlist); +ht; +ht = rtnl_dereference(ht->next)) { + if (ht->prio != tp->prio) + continue; + + /* When adding filters to a new dev, try to offload the +* hashtable first. When removing, do the filters before the +* hashtable. +*/ + if (add && !tc_skip_hw(ht->flags)) { + err = u32_reoffload_hnode(tp, ht, add, cb, cb_priv, + extack); + if (err) + return err; + } + + for (h = 0; h <= ht->divisor; h++) { + for (n = rtnl_dereference(ht->ht[h]); +n; +n = rtnl_dereference(n->next)) { + if (tc_skip_hw(n->flags)) + continue; +
[PATCH net-next v2 4/7] net: sched: cls_matchall: implement offload tcf_proto_op
From: John Hurley Add the reoffload tcf_proto_op in matchall to generate an offload message for each filter in the given tcf_proto. Call the specified callback with this new offload message. The function only returns an error if the callback rejects adding a 'hardware only' rule. Ensure matchall flags correctly report if the rule is in hw by keeping a reference counter for the number of instances of the rule offloaded. Only update the flag when this counter changes from or to 0. Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski --- net/sched/cls_matchall.c | 32 1 file changed, 32 insertions(+) diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c index 47b207ef7762..af16f36ed578 100644 --- a/net/sched/cls_matchall.c +++ b/net/sched/cls_matchall.c @@ -21,6 +21,7 @@ struct cls_mall_head { struct tcf_result res; u32 handle; u32 flags; + unsigned int in_hw_count; struct rcu_work rwork; }; @@ -95,6 +96,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp, mall_destroy_hw_filter(tp, head, cookie, NULL); return err; } else if (err > 0) { + head->in_hw_count = err; tcf_block_offload_inc(block, &head->flags); } @@ -235,6 +237,35 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg) arg->count++; } +static int mall_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb, + void *cb_priv, struct netlink_ext_ack *extack) +{ + struct cls_mall_head *head = rtnl_dereference(tp->root); + struct tc_cls_matchall_offload cls_mall = {}; + struct tcf_block *block = tp->chain->block; + int err; + + if (tc_skip_hw(head->flags)) + return 0; + + tc_cls_common_offload_init(&cls_mall.common, tp, head->flags, extack); + cls_mall.command = add ? + TC_CLSMATCHALL_REPLACE : TC_CLSMATCHALL_DESTROY; + cls_mall.exts = &head->exts; + cls_mall.cookie = (unsigned long)head; + + err = cb(TC_SETUP_CLSMATCHALL, &cls_mall, cb_priv); + if (err) { + if (add && tc_skip_sw(head->flags)) + return err; + return 0; + } + + tc_cls_offload_cnt_update(block, &head->in_hw_count, &head->flags, add); + + return 0; +} + static int mall_dump(struct net *net, struct tcf_proto *tp, void *fh, struct sk_buff *skb, struct tcmsg *t) { @@ -289,6 +320,7 @@ static struct tcf_proto_ops cls_mall_ops __read_mostly = { .change = mall_change, .delete = mall_delete, .walk = mall_walk, + .reoffload = mall_reoffload, .dump = mall_dump, .bind_class = mall_bind_class, .owner = THIS_MODULE, -- 2.17.1
[PATCH net-next v2 7/7] net: sched: call reoffload op on block callback reg
From: John Hurley Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a block when a callback tries to register to a block that already has offloaded rules. If all existing rules cannot be offloaded then the registration is rejected. This replaces the previous policy of rejecting such callback registration outright. On unregistration of a callback, the rules are flushed for that given cb. The implementation of block sharing in the NFP driver, for example, duplicates shared rules to all devs bound to a block. This meant that rules could still exist in hw even after a device is unbound from a block (assuming the block still remains active). Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski --- v2: - improve extack message. --- .../net/ethernet/mellanox/mlxsw/spectrum.c| 4 +- include/net/pkt_cls.h | 6 ++- net/sched/cls_api.c | 54 --- 3 files changed, 52 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index d2bc335dda11..52437363766a 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -1542,7 +1542,7 @@ mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port *mlxsw_sp_port, err_block_bind: if (!tcf_block_cb_decref(block_cb)) { - __tcf_block_cb_unregister(block_cb); + __tcf_block_cb_unregister(block, block_cb); err_cb_register: mlxsw_sp_acl_block_destroy(acl_block); } @@ -1572,7 +1572,7 @@ mlxsw_sp_setup_tc_block_flower_unbind(struct mlxsw_sp_port *mlxsw_sp_port, err = mlxsw_sp_acl_block_unbind(mlxsw_sp, acl_block, mlxsw_sp_port, ingress); if (!err && !tcf_block_cb_decref(block_cb)) { - __tcf_block_cb_unregister(block_cb); + __tcf_block_cb_unregister(block, block_cb); mlxsw_sp_acl_block_destroy(acl_block); } } diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index a2c6d35ba057..4070b8eb6d14 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -78,7 +78,8 @@ struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block, int tcf_block_cb_register(struct tcf_block *block, tc_setup_cb_t *cb, void *cb_ident, void *cb_priv, struct netlink_ext_ack *extack); -void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb); +void __tcf_block_cb_unregister(struct tcf_block *block, + struct tcf_block_cb *block_cb); void tcf_block_cb_unregister(struct tcf_block *block, tc_setup_cb_t *cb, void *cb_ident); @@ -177,7 +178,8 @@ int tcf_block_cb_register(struct tcf_block *block, } static inline -void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb) +void __tcf_block_cb_unregister(struct tcf_block *block, + struct tcf_block_cb *block_cb) { } diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 8c9fb4b827a1..bbf8dda96b0e 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -751,19 +751,53 @@ unsigned int tcf_block_cb_decref(struct tcf_block_cb *block_cb) } EXPORT_SYMBOL(tcf_block_cb_decref); +static int +tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb, + void *cb_priv, bool add, bool offload_in_use, + struct netlink_ext_ack *extack) +{ + struct tcf_chain *chain; + struct tcf_proto *tp; + int err; + + list_for_each_entry(chain, &block->chain_list, list) { + for (tp = rtnl_dereference(chain->filter_chain); tp; +tp = rtnl_dereference(tp->next)) { + if (tp->ops->reoffload) { + err = tp->ops->reoffload(tp, add, cb, cb_priv, +extack); + if (err && add) + goto err_playback_remove; + } else if (add && offload_in_use) { + err = -EOPNOTSUPP; + NL_SET_ERR_MSG(extack, "Filter HW offload failed - classifier without re-offloading support"); + goto err_playback_remove; + } + } + } + + return 0; + +err_playback_remove: + tcf_block_playback_offloads(block, cb, cb_priv, false, offload_in_use, + extack); + return err; +} + struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block, tc_setup_cb_t *cb, void *cb_ident, void *cb_priv, struct netlink_ext_ack *ext
[PATCH net-next v2 6/7] net: sched: cls_bpf: implement offload tcf_proto_op
From: John Hurley Add the offload tcf_proto_op in cls_bpf to generate an offload message for each bpf prog in the given tcf_proto. Call the specified callback with this new offload message. The function only returns an error if the callback rejects adding a 'hardware only' prog. A prog contains a flag to indicate if it is in hardware or not. To ensure the offload function properly maintains this flag, keep a reference counter for the number of instances of the prog that are in hardware. Only update the flag when this counter changes from or to 0. Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski --- net/sched/cls_bpf.c | 39 +++ 1 file changed, 39 insertions(+) diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 1aa7f6511065..66e0ac9811f9 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -43,6 +43,7 @@ struct cls_bpf_prog { struct tcf_result res; bool exts_integrated; u32 gen_flags; + unsigned int in_hw_count; struct tcf_exts exts; u32 handle; u16 bpf_num_ops; @@ -174,6 +175,7 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog, cls_bpf_offload_cmd(tp, oldprog, prog, extack); return err; } else if (err > 0) { + prog->in_hw_count = err; tcf_block_offload_inc(block, &prog->gen_flags); } } @@ -652,6 +654,42 @@ static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg) } } +static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb, +void *cb_priv, struct netlink_ext_ack *extack) +{ + struct cls_bpf_head *head = rtnl_dereference(tp->root); + struct tcf_block *block = tp->chain->block; + struct tc_cls_bpf_offload cls_bpf = {}; + struct cls_bpf_prog *prog; + int err; + + list_for_each_entry(prog, &head->plist, link) { + if (tc_skip_hw(prog->gen_flags)) + continue; + + tc_cls_common_offload_init(&cls_bpf.common, tp, prog->gen_flags, + extack); + cls_bpf.command = TC_CLSBPF_OFFLOAD; + cls_bpf.exts = &prog->exts; + cls_bpf.prog = add ? prog->filter : NULL; + cls_bpf.oldprog = add ? NULL : prog->filter; + cls_bpf.name = prog->bpf_name; + cls_bpf.exts_integrated = prog->exts_integrated; + + err = cb(TC_SETUP_CLSBPF, &cls_bpf, cb_priv); + if (err) { + if (add && tc_skip_sw(prog->gen_flags)) + return err; + continue; + } + + tc_cls_offload_cnt_update(block, &prog->in_hw_count, + &prog->gen_flags, add); + } + + return 0; +} + static struct tcf_proto_ops cls_bpf_ops __read_mostly = { .kind = "bpf", .owner = THIS_MODULE, @@ -662,6 +700,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = { .change = cls_bpf_change, .delete = cls_bpf_delete, .walk = cls_bpf_walk, + .reoffload = cls_bpf_reoffload, .dump = cls_bpf_dump, .bind_class = cls_bpf_bind_class, }; -- 2.17.1
[PATCH net-next v2 0/7] net: sched: support replay of filter offload when binding to block
Hi! This series from John adds the ability to replay filter offload requests when new offload callback is being registered on a TC block. This is most likely to take place for shared blocks today, when a block which already has rules is bound to another interface. Prior to this patch set if any of the rules were offloaded the block bind would fail. A new tcf_proto_op is added to generate a filter-specific offload request. The new 'offload' op is supporting extack from day 0, hence we need to propagate extack to .ndo_setup_tc TC_BLOCK_BIND/TC_BLOCK_UNBIND and through tcf_block_cb_register() to tcf_block_playback_offloads(). The immediate use of this patch set is to simplify life of drivers which require duplicating rules when sharing blocks. Switch drivers (mlxsw) can bind ports to rule lists dynamically, NIC drivers generally don't have that ability and need the rules to be duplicated for each ingress they match on. In code terms this means that switch drivers don't register multiple callbacks for each port. NIC drivers do, and get a separate request and hance rule per-port, as if the block was not shared. The registration fails today, however, if some rules were already present. As John notes in description of patch 7, drivers which register multiple callbacks to shared blocks will likely need to flush the rules on block unbind. This set makes the core not only replay the the offload add requests but also offload remove requests when callback is unregistered. v2: - name parameters in patch 2; - use unsigned int instead of u32 for in_hw_coun; - improve extack message in patch 7. John Hurley (7): net: sched: pass extack pointer to block binds and cb registration net: sched: add tcf_proto_op to offload a rule net: sched: cls_flower: implement offload tcf_proto_op net: sched: cls_matchall: implement offload tcf_proto_op net: sched: cls_u32: implement offload tcf_proto_op net: sched: cls_bpf: implement offload tcf_proto_op net: sched: call reoffload op on block callback reg drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 2 +- .../net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- .../net/ethernet/intel/i40evf/i40evf_main.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- .../net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +- .../net/ethernet/mellanox/mlxsw/spectrum.c| 14 ++- drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 +- .../ethernet/netronome/nfp/flower/offload.c | 2 +- .../net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +- drivers/net/netdevsim/netdev.c| 2 +- include/net/act_api.h | 3 - include/net/pkt_cls.h | 17 ++- include/net/sch_generic.h | 21 net/dsa/slave.c | 2 +- net/sched/cls_api.c | 79 ++--- net/sched/cls_bpf.c | 39 ++ net/sched/cls_flower.c| 44 +++ net/sched/cls_matchall.c | 32 + net/sched/cls_u32.c | 111 ++ 23 files changed, 342 insertions(+), 46 deletions(-) -- 2.17.1
[PATCH net-next v2 3/7] net: sched: cls_flower: implement offload tcf_proto_op
From: John Hurley Add the reoffload tcf_proto_op in flower to generate an offload message for each filter in the given tcf_proto. Call the specified callback with this new offload message. The function only returns an error if the callback rejects adding a 'hardware only' rule. A filter contains a flag to indicate if it is in hardware or not. To ensure the reoffload function properly maintains this flag, keep a reference counter for the number of instances of the filter that are in hardware. Only update the flag when this counter changes from or to 0. Add a generic helper function to implement this behaviour. Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski -- v2: - use unsigned int for counting how many times filter was offloaded. --- include/net/sch_generic.h | 15 + net/sched/cls_flower.c| 44 +++ 2 files changed, 59 insertions(+) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 18adc9142b18..7432100027b7 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -336,6 +336,21 @@ static inline void tcf_block_offload_dec(struct tcf_block *block, u32 *flags) block->offloadcnt--; } +static inline void +tc_cls_offload_cnt_update(struct tcf_block *block, unsigned int *cnt, + u32 *flags, bool add) +{ + if (add) { + if (!*cnt) + tcf_block_offload_inc(block, flags); + (*cnt)++; + } else { + (*cnt)--; + if (!*cnt) + tcf_block_offload_dec(block, flags); + } +} + static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz) { struct qdisc_skb_cb *qcb; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 9e8b26a80fb3..352876bb901b 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -87,6 +87,7 @@ struct cls_fl_filter { struct list_head list; u32 handle; u32 flags; + unsigned int in_hw_count; struct rcu_work rwork; struct net_device *hw_dev; }; @@ -289,6 +290,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp, fl_hw_destroy_filter(tp, f, NULL); return err; } else if (err > 0) { + f->in_hw_count = err; tcf_block_offload_inc(block, &f->flags); } @@ -1087,6 +1089,47 @@ static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg) } } +static int fl_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb, + void *cb_priv, struct netlink_ext_ack *extack) +{ + struct cls_fl_head *head = rtnl_dereference(tp->root); + struct tc_cls_flower_offload cls_flower = {}; + struct tcf_block *block = tp->chain->block; + struct fl_flow_mask *mask; + struct cls_fl_filter *f; + int err; + + list_for_each_entry(mask, &head->masks, list) { + list_for_each_entry(f, &mask->filters, list) { + if (tc_skip_hw(f->flags)) + continue; + + tc_cls_common_offload_init(&cls_flower.common, tp, + f->flags, extack); + cls_flower.command = add ? + TC_CLSFLOWER_REPLACE : TC_CLSFLOWER_DESTROY; + cls_flower.cookie = (unsigned long)f; + cls_flower.dissector = &mask->dissector; + cls_flower.mask = &f->mkey; + cls_flower.key = &f->key; + cls_flower.exts = &f->exts; + cls_flower.classid = f->res.classid; + + err = cb(TC_SETUP_CLSFLOWER, &cls_flower, cb_priv); + if (err) { + if (add && tc_skip_sw(f->flags)) + return err; + continue; + } + + tc_cls_offload_cnt_update(block, &f->in_hw_count, + &f->flags, add); + } + } + + return 0; +} + static int fl_dump_key_val(struct sk_buff *skb, void *val, int val_type, void *mask, int mask_type, int len) @@ -1438,6 +1481,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = { .change = fl_change, .delete = fl_delete, .walk = fl_walk, + .reoffload = fl_reoffload, .dump = fl_dump, .bind_class = fl_bind_class, .owner = THIS_MODULE, -- 2.17.1
Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg
Mon, Jun 25, 2018 at 11:10:14PM CEST, jakub.kicin...@netronome.com wrote: >On Mon, 25 Jun 2018 22:58:32 +0200, Jiri Pirko wrote: >> Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicin...@netronome.com wrote: >> >From: John Hurley >> >> [...] >> >> >+static int >> >+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb, >> >+ void *cb_priv, bool add, bool offload_in_use, >> >+ struct netlink_ext_ack *extack) >> >+{ >> >+ struct tcf_chain *chain; >> >+ struct tcf_proto *tp; >> >+ int err; >> >+ >> >+ list_for_each_entry(chain, &block->chain_list, list) { >> >+ for (tp = rtnl_dereference(chain->filter_chain); tp; >> >+tp = rtnl_dereference(tp->next)) { >> >+ if (tp->ops->reoffload) { >> >+ err = tp->ops->reoffload(tp, add, cb, cb_priv, >> >+extack); >> >+ if (err && add) >> >+ goto err_playback_remove; >> >+ } else if (add && offload_in_use) { >> >+ err = -EOPNOTSUPP; >> >+ NL_SET_ERR_MSG(extack, "Filter replay failed - >> >a filters doesn't support re-offloading"); >> >> This msg sounds weird. Please fix it. > >Indeed.. How about: > >"Filter HW offload failed - classifier without re-offloading support" Sounds good. > >> Otherwise this looks very good to me! Thanks! > >Cool, thanks for the comments! I will respin shortly.
Re: [PATCH rdma-next 01/12] RDMA/uverbs: Protect from attempts to create flows on unsupported QP
On Sun, Jun 24, 2018 at 11:23:42AM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky > > Flows can be created on UD and RAW_PACKET QP types. Attempts to provide > other QP types as an input causes to various unpredictable failures. > > The reason to it that in order to support all various types (e.g. XRC), > we are supposed to use real_qp handle and not qp handle and give to > driver/FW to fail such (XRC) flows. Being valuable solution, the simpler > and safer variant is to ban all QP types except UD and RAW_PACKET, > instead of relying on driver/FW. > > Cc: # 3.11 > Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs") > Cc: syzkaller > Reported-by: Noa Osherovich > Signed-off-by: Leon Romanovsky > --- > drivers/infiniband/core/uverbs_cmd.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/uverbs_cmd.c > b/drivers/infiniband/core/uverbs_cmd.c > index 779892b63729..c842a9423fbf 100644 > --- a/drivers/infiniband/core/uverbs_cmd.c > +++ b/drivers/infiniband/core/uverbs_cmd.c > @@ -3553,14 +3553,20 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file > *file, > goto err_free_attr; > } > > - qp = uobj_get_obj_read(qp, UVERBS_OBJECT_QP, cmd.qp_handle, > file->ucontext); > + qp = uobj_get_obj_read(qp, UVERBS_OBJECT_QP, cmd.qp_handle, > +file->ucontext); This hunk is just whitespace changing > if (!qp) { > err = -EINVAL; > goto err_uobj; > } > > + if (qp->qp_type != IB_QPT_UD && qp->qp_type != IB_QPT_RAW_PACKET) { > + err = -EINVAL; > + goto err_put; > + } > + > flow_attr = kzalloc(struct_size(flow_attr, flows, > - cmd.flow_attr.num_of_specs), GFP_KERNEL); > + cmd.flow_attr.num_of_specs), > GFP_KERNEL); Same here. I dropped the two hunks and applied this to for-rc since it has stable tags. Jason
Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg
On Mon, 25 Jun 2018 22:58:32 +0200, Jiri Pirko wrote: > Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicin...@netronome.com wrote: > >From: John Hurley > > [...] > > >+static int > >+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb, > >+void *cb_priv, bool add, bool offload_in_use, > >+struct netlink_ext_ack *extack) > >+{ > >+struct tcf_chain *chain; > >+struct tcf_proto *tp; > >+int err; > >+ > >+list_for_each_entry(chain, &block->chain_list, list) { > >+for (tp = rtnl_dereference(chain->filter_chain); tp; > >+ tp = rtnl_dereference(tp->next)) { > >+if (tp->ops->reoffload) { > >+err = tp->ops->reoffload(tp, add, cb, cb_priv, > >+ extack); > >+if (err && add) > >+goto err_playback_remove; > >+} else if (add && offload_in_use) { > >+err = -EOPNOTSUPP; > >+NL_SET_ERR_MSG(extack, "Filter replay failed - > >a filters doesn't support re-offloading"); > > This msg sounds weird. Please fix it. Indeed.. How about: "Filter HW offload failed - classifier without re-offloading support" > Otherwise this looks very good to me! Thanks! Cool, thanks for the comments! I will respin shortly.
[patch iproute2/net-next] tc: introduce support for chain templates
From: Jiri Pirko Signed-off-by: Jiri Pirko --- include/uapi/linux/rtnetlink.h | 7 man/man8/tc.8 | 26 tc/tc_filter.c | 95 ++ tc/tc_monitor.c| 5 ++- 4 files changed, 105 insertions(+), 28 deletions(-) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index c3a7d8ecc7b9..dddb05e5cca8 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -150,6 +150,13 @@ enum { RTM_NEWCACHEREPORT = 96, #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT + RTM_NEWCHAINTMPLT = 100, +#define RTM_NEWCHAINTMPLT RTM_NEWCHAINTMPLT + RTM_DELCHAINTMPLT, +#define RTM_DELCHAINTMPLT RTM_DELCHAINTMPLT + RTM_GETCHAINTMPLT, +#define RTM_GETCHAINTMPLT RTM_GETCHAINTMPLT + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; diff --git a/man/man8/tc.8 b/man/man8/tc.8 index 840880fbdba6..291eeb2ce8ca 100644 --- a/man/man8/tc.8 +++ b/man/man8/tc.8 @@ -58,6 +58,22 @@ tc \- show / manipulate traffic control settings .B flowid \fIflow-id\fR +.B tc +.RI "[ " OPTIONS " ]" +.B filter template [ add | delete | get ] dev +\fIDEV\fR +.B [ parent +\fIqdisc-id\fR +.B | root ]\fR filtertype +[ filtertype specific parameters ] + +.B tc +.RI "[ " OPTIONS " ]" +.B filter template [ add | delete | get ] block +\fIBLOCK_INDEX\fR filtertype +[ filtertype specific parameters ] + + .B tc .RI "[ " OPTIONS " ]" .RI "[ " FORMAT " ]" @@ -80,6 +96,16 @@ tc \- show / manipulate traffic control settings .RI "[ " OPTIONS " ]" .B filter show block \fIBLOCK_INDEX\fR +.P +.B tc +.RI "[ " OPTIONS " ]" +.B filter template show dev +\fIDEV\fR +.P +.B tc +.RI "[ " OPTIONS " ]" +.B filter template show block +\fIBLOCK_INDEX\fR .P .B tc diff --git a/tc/tc_filter.c b/tc/tc_filter.c index c5bb0bffe19b..94d7692fe2fa 100644 --- a/tc/tc_filter.c +++ b/tc/tc_filter.c @@ -39,6 +39,8 @@ static void usage(void) "\n" " tc filter show [ dev STRING ] [ root | ingress | egress | parent CLASSID ]\n" " tc filter show [ block BLOCK_INDEX ]\n" + " tc filter template [ add | del | get | show ] [ dev STRING ]\n" + " tc filter template [ add | del | get | show ] [ block BLOCK_INDEX ] ]\n" "Where:\n" "FILTER_TYPE := { rsvp | u32 | bpf | fw | route | etc. }\n" "FILTERID := ... format depends on classifier, see there\n" @@ -85,7 +87,8 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv, req->n.nlmsg_type = cmd; req->t.tcm_family = AF_UNSPEC; - if (cmd == RTM_NEWTFILTER && flags & NLM_F_CREATE) + if ((cmd == RTM_NEWTFILTER || cmd == RTM_NEWCHAINTMPLT) && + flags & NLM_F_CREATE) protocol = htons(ETH_P_ALL); while (argc > 0) { @@ -261,7 +264,10 @@ int print_filter(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (n->nlmsg_type != RTM_NEWTFILTER && n->nlmsg_type != RTM_GETTFILTER && - n->nlmsg_type != RTM_DELTFILTER) { + n->nlmsg_type != RTM_DELTFILTER && + n->nlmsg_type != RTM_NEWCHAINTMPLT && + n->nlmsg_type != RTM_GETCHAINTMPLT && + n->nlmsg_type != RTM_DELCHAINTMPLT) { fprintf(stderr, "Not a filter(cmd %d)\n", n->nlmsg_type); return 0; } @@ -280,20 +286,26 @@ int print_filter(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) open_json_object(NULL); - if (n->nlmsg_type == RTM_DELTFILTER) + if (n->nlmsg_type == RTM_DELTFILTER || n->nlmsg_type == RTM_DELCHAINTMPLT) print_bool(PRINT_ANY, "deleted", "deleted ", true); - if (n->nlmsg_type == RTM_NEWTFILTER && + if ((n->nlmsg_type == RTM_NEWTFILTER || +n->nlmsg_type == RTM_NEWCHAINTMPLT) && (n->nlmsg_flags & NLM_F_CREATE) && !(n->nlmsg_flags & NLM_F_EXCL)) print_bool(PRINT_ANY, "replaced", "replaced ", true); - if (n->nlmsg_type == RTM_NEWTFILTER && + if ((n->nlmsg_type == RTM_NEWTFILTER || +n->nlmsg_type == RTM_NEWCHAINTMPLT) && (n->nlmsg_flags & NLM_F_CREATE) && (n->nlmsg_flags & NLM_F_EXCL)) print_bool(PRINT_ANY, "added", "added ", true); - print_string(PRINT_FP, NULL, "filter ", NULL); + if (n->nlmsg_type == RTM_NEWTFILTER || + n->nlmsg_type == RTM_DELTFILTER) + print_string(PRINT_FP, NULL, "filter ", NULL); + else + print_string(PRINT_FP, NULL, "filter template ", NULL); if (t->tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK) { if (!filter_block_index || filter_block_index != t->tcm_block_index) @@ -317,7 +329,9 @@ int print_fi
[patch net-next 4/9] net: sched: cls_flower: change fl_init_dissector to accept mask and dissector
From: Jiri Pirko This function is going to be used for templates as well, so we need to pass the pointer separately. Signed-off-by: Jiri Pirko --- net/sched/cls_flower.c | 39 --- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 76c5516357d5..9ce4375b3252 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -793,47 +793,48 @@ static int fl_init_mask_hashtable(struct fl_flow_mask *mask) FL_KEY_SET(keys, cnt, id, member); \ } while(0); -static void fl_init_dissector(struct fl_flow_mask *mask) +static void fl_init_dissector(struct flow_dissector *dissector, + struct fl_flow_key *mask) { struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX]; size_t cnt = 0; FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_CONTROL, control); FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_BASIC, basic); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ETH_ADDRS, eth); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_PORTS, tp); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_IP, ip); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_TCP, tcp); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ICMP, icmp); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ARP, arp); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_MPLS, mpls); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_VLAN, vlan); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, enc_ipv4); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, enc_ipv6); - if (FL_KEY_IS_MASKED(&mask->key, enc_ipv4) || - FL_KEY_IS_MASKED(&mask->key, enc_ipv6)) + if (FL_KEY_IS_MASKED(mask, enc_ipv4) || + FL_KEY_IS_MASKED(mask, enc_ipv6)) FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_ENC_CONTROL, enc_control); - FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, + FL_KEY_SET_IF_MASKED(mask, keys, cnt, FLOW_DISSECTOR_KEY_ENC_PORTS, enc_tp); - skb_flow_dissector_init(&mask->dissector, keys, cnt); + skb_flow_dissector_init(dissector, keys, cnt); } static struct fl_flow_mask *fl_create_new_mask(struct cls_fl_head *head, @@ -852,7 +853,7 @@ static struct fl_flow_mask *fl_create_new_mask(struct cls_fl_head *head, if (err) goto errout_free; - fl_init_dissector(newmask); + fl_init_dissector(&newmask->dissector, &newmask->key); INIT_LIST_HEAD_RCU(&newmask->filters); -- 2.14.4
[patch net-next 5/9] net: sched: cls_flower: implement chain templates
From: Jiri Pirko Use the previously introduced template extension and implement callback to create, destroy and dump chain template. The existing parsing and dumping functions are re-used. Also, check if newly added filters fit the template if it is set. Signed-off-by: Jiri Pirko --- net/sched/cls_flower.c | 107 - 1 file changed, 106 insertions(+), 1 deletion(-) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 9ce4375b3252..d64d43843a3a 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -70,6 +70,13 @@ struct fl_flow_mask { struct list_head list; }; +struct fl_flow_tmplt { + struct fl_flow_key dummy_key; + struct fl_flow_key mask; + struct flow_dissector dissector; + struct tcf_chain *chain; +}; + struct cls_fl_head { struct rhashtable ht; struct list_head masks; @@ -144,6 +151,23 @@ static void fl_set_masked_key(struct fl_flow_key *mkey, struct fl_flow_key *key, *lmkey++ = *lkey++ & *lmask++; } +static bool fl_mask_fits_tmplt(struct fl_flow_tmplt *tmplt, + struct fl_flow_mask *mask) +{ + const long *lmask = fl_key_get_start(&mask->key, mask); + const long *ltmplt; + int i; + + if (!tmplt) + return true; + ltmplt = fl_key_get_start(&tmplt->mask, mask); + for (i = 0; i < fl_mask_range(mask); i += sizeof(long)) { + if (~*ltmplt++ & *lmask++) + return false; + } + return true; +} + static void fl_clear_masked_range(struct fl_flow_key *key, struct fl_flow_mask *mask) { @@ -902,6 +926,7 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp, struct cls_fl_filter *f, struct fl_flow_mask *mask, unsigned long base, struct nlattr **tb, struct nlattr *est, bool ovr, + struct fl_flow_tmplt *tmplt, struct netlink_ext_ack *extack) { int err; @@ -922,6 +947,11 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp, fl_mask_update_range(mask); fl_set_masked_key(&f->mkey, &f->key, mask); + if (!fl_mask_fits_tmplt(tmplt, mask)) { + NL_SET_ERR_MSG_MOD(extack, "Mask does not fit the template"); + return -EINVAL; + } + return 0; } @@ -932,6 +962,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, struct netlink_ext_ack *extack) { struct cls_fl_head *head = rtnl_dereference(tp->root); + struct fl_flow_tmplt *tmplt = tmplt_priv; struct cls_fl_filter *fold = *arg; struct cls_fl_filter *fnew; struct nlattr **tb; @@ -988,7 +1019,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb, } err = fl_set_parms(net, tp, fnew, &mask, base, tb, tca[TCA_RATE], ovr, - extack); + tmplt, extack); if (err) goto errout_idr; @@ -1089,6 +1120,52 @@ static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg) } } +static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain, +struct nlattr **tca, +struct netlink_ext_ack *extack) +{ + struct fl_flow_tmplt *tmplt; + struct nlattr **tb; + int err; + + if (!tca[TCA_OPTIONS]) + return ERR_PTR(-EINVAL); + + tb = kcalloc(TCA_FLOWER_MAX + 1, sizeof(struct nlattr *), GFP_KERNEL); + if (!tb) + return ERR_PTR(-ENOBUFS); + err = nla_parse_nested(tb, TCA_FLOWER_MAX, tca[TCA_OPTIONS], + fl_policy, NULL); + if (err) + goto errout_tb; + + tmplt = kzalloc(sizeof(*tmplt), GFP_KERNEL); + if (!tmplt) + goto errout_tb; + tmplt->chain = chain; + err = fl_set_key(net, tb, &tmplt->dummy_key, &tmplt->mask, extack); + if (err) + goto errout_tmplt; + kfree(tb); + + fl_init_dissector(&tmplt->dissector, &tmplt->mask); + + return tmplt; + +errout_tmplt: + kfree(tmplt); +errout_tb: + kfree(tb); + return ERR_PTR(err); +} + +static void fl_tmplt_destroy(void *tmplt_priv) +{ + struct fl_flow_tmplt *tmplt = tmplt_priv; + + kfree(tmplt); +} + static int fl_dump_key_val(struct sk_buff *skb, void *val, int val_type, void *mask, int mask_type, int len) @@ -1435,6 +1512,31 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, return -1; } +static int fl_tmplt_dump(struct sk_buff *skb, struct net *net, void *tmplt_priv) +{ + struct fl_flow_tmplt *tmplt = tmplt_priv; + struct fl_flow_key *key, *mask; + struct nlattr *nest; + +
[patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers
From: Jiri Pirko Introduce a couple of flower offload commands in order to propagate template creation/destruction events down to device drivers. Drivers may use this information to prepare HW in an optimal way for future filter insertions. Signed-off-by: Jiri Pirko --- include/net/pkt_cls.h | 2 ++ net/sched/cls_flower.c | 40 2 files changed, 42 insertions(+) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index a3c1a2c47cd4..e83968cf9a70 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -715,6 +715,8 @@ enum tc_fl_command { TC_CLSFLOWER_REPLACE, TC_CLSFLOWER_DESTROY, TC_CLSFLOWER_STATS, + TC_CLSFLOWER_TMPLT_CREATE, + TC_CLSFLOWER_TMPLT_DESTROY, }; struct tc_cls_flower_offload { diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index d64d43843a3a..276ba25a09c3 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg) } } +static void fl_hw_create_tmplt(struct tcf_chain *chain, + struct fl_flow_tmplt *tmplt, + struct netlink_ext_ack *extack) +{ + struct tc_cls_flower_offload cls_flower = {}; + struct tcf_block *block = chain->block; + struct tcf_exts dummy_exts = { 0, }; + + cls_flower.common.chain_index = chain->index; + cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE; + cls_flower.cookie = (unsigned long) tmplt; + cls_flower.dissector = &tmplt->dissector; + cls_flower.mask = &tmplt->mask; + cls_flower.key = &tmplt->dummy_key; + cls_flower.exts = &dummy_exts; + + /* We don't care if driver (any of them) fails to handle this +* call. It serves just as a hint for it. +*/ + tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER, +&cls_flower, false); +} + +static void fl_hw_destroy_tmplt(struct tcf_chain *chain, + struct fl_flow_tmplt *tmplt) +{ + struct tc_cls_flower_offload cls_flower = {}; + struct tcf_block *block = chain->block; + + cls_flower.common.chain_index = chain->index; + cls_flower.command = TC_CLSFLOWER_TMPLT_DESTROY; + cls_flower.cookie = (unsigned long) tmplt; + + tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER, +&cls_flower, false); +} + static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain, struct nlattr **tca, struct netlink_ext_ack *extack) @@ -1150,6 +1187,8 @@ static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain, fl_init_dissector(&tmplt->dissector, &tmplt->mask); + fl_hw_create_tmplt(chain, tmplt, extack); + return tmplt; errout_tmplt: @@ -1163,6 +1202,7 @@ static void fl_tmplt_destroy(void *tmplt_priv) { struct fl_flow_tmplt *tmplt = tmplt_priv; + fl_hw_destroy_tmplt(tmplt->chain, tmplt); kfree(tmplt); } -- 2.14.4
[patch net-next 7/9] mlxsw: spectrum: Implement chain template hinting
From: Jiri Pirko Since cld_flower provides information about the filter template for specific chain, use this information in order to prepare a region. Use the template to find out what elements are going to be used and pass that down to mlxsw_sp_acl_tcam_group_add(). Later on, when the first filter is inserted, the mlxsw_sp_acl_tcam_group_use_patterns() function would use this element usage information instead of looking up a pattern. Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 5 +++ drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 12 -- .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 25 ++-- .../net/ethernet/mellanox/mlxsw/spectrum_flower.c | 44 -- 5 files changed, 86 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index 968b88af2ef5..da19fa343d0b 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -1441,6 +1441,11 @@ mlxsw_sp_setup_tc_cls_flower(struct mlxsw_sp_acl_block *acl_block, return 0; case TC_CLSFLOWER_STATS: return mlxsw_sp_flower_stats(mlxsw_sp, acl_block, f); + case TC_CLSFLOWER_TMPLT_CREATE: + return mlxsw_sp_flower_tmplt_create(mlxsw_sp, acl_block, f); + case TC_CLSFLOWER_TMPLT_DESTROY: + mlxsw_sp_flower_tmplt_destroy(mlxsw_sp, acl_block, f); + return 0; default: return -EOPNOTSUPP; } diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h index 4a519d8edec8..b0a8e611e730 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h @@ -459,7 +459,8 @@ enum mlxsw_sp_acl_profile { struct mlxsw_sp_acl_profile_ops { size_t ruleset_priv_size; int (*ruleset_add)(struct mlxsw_sp *mlxsw_sp, - void *priv, void *ruleset_priv); + void *priv, void *ruleset_priv, + struct mlxsw_afk_element_usage *tmplt_elusage); void (*ruleset_del)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv); int (*ruleset_bind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv, struct mlxsw_sp_port *mlxsw_sp_port, @@ -514,7 +515,8 @@ mlxsw_sp_acl_ruleset_lookup(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_acl_ruleset * mlxsw_sp_acl_ruleset_get(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_acl_block *block, u32 chain_index, -enum mlxsw_sp_acl_profile profile); +enum mlxsw_sp_acl_profile profile, +struct mlxsw_afk_element_usage *tmplt_elusage); void mlxsw_sp_acl_ruleset_put(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_acl_ruleset *ruleset); u16 mlxsw_sp_acl_ruleset_group_id(struct mlxsw_sp_acl_ruleset *ruleset); @@ -594,6 +596,12 @@ void mlxsw_sp_flower_destroy(struct mlxsw_sp *mlxsw_sp, int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_acl_block *block, struct tc_cls_flower_offload *f); +int mlxsw_sp_flower_tmplt_create(struct mlxsw_sp *mlxsw_sp, +struct mlxsw_sp_acl_block *block, +struct tc_cls_flower_offload *f); +void mlxsw_sp_flower_tmplt_destroy(struct mlxsw_sp *mlxsw_sp, + struct mlxsw_sp_acl_block *block, + struct tc_cls_flower_offload *f); /* spectrum_qdisc.c */ int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port); diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c index 79b1fa27a9a4..ea42605c451d 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c @@ -319,7 +319,8 @@ int mlxsw_sp_acl_block_unbind(struct mlxsw_sp *mlxsw_sp, static struct mlxsw_sp_acl_ruleset * mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_acl_block *block, u32 chain_index, - const struct mlxsw_sp_acl_profile_ops *ops) + const struct mlxsw_sp_acl_profile_ops *ops, + struct mlxsw_afk_element_usage *tmplt_elusage) { struct mlxsw_sp_acl *acl = mlxsw_sp->acl; struct mlxsw_sp_acl_ruleset *ruleset; @@ -339,7 +340,8 @@ mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp, if (err) goto err_rhashtable_init; - err = ops->ruleset_add(mlxsw_sp, acl->priv, ruleset->priv); + err = ops->ruleset_add(mlxsw_sp, acl->priv, ruleset->priv, +
[patch net-next 3/9] net: sched: cls_flower: move key/mask dumping into a separate function
From: Jiri Pirko Push key/mask dumping from fl_dump() into a separate function fl_dump_key(), that will be reused for template dumping. Signed-off-by: Jiri Pirko --- net/sched/cls_flower.c | 62 ++ 1 file changed, 37 insertions(+), 25 deletions(-) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 09d6c6e67f9d..76c5516357d5 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -1217,29 +1217,9 @@ static int fl_dump_key_flags(struct sk_buff *skb, u32 flags_key, u32 flags_mask) return nla_put(skb, TCA_FLOWER_KEY_FLAGS_MASK, 4, &_mask); } -static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, - struct sk_buff *skb, struct tcmsg *t) +static int fl_dump_key(struct sk_buff *skb, struct net *net, + struct fl_flow_key *key, struct fl_flow_key *mask) { - struct cls_fl_filter *f = fh; - struct nlattr *nest; - struct fl_flow_key *key, *mask; - - if (!f) - return skb->len; - - t->tcm_handle = f->handle; - - nest = nla_nest_start(skb, TCA_OPTIONS); - if (!nest) - goto nla_put_failure; - - if (f->res.classid && - nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid)) - goto nla_put_failure; - - key = &f->key; - mask = &f->mask->key; - if (mask->indev_ifindex) { struct net_device *dev; @@ -1248,9 +1228,6 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, goto nla_put_failure; } - if (!tc_skip_hw(f->flags)) - fl_hw_update_stats(tp, f); - if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST, mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK, sizeof(key->eth.dst)) || @@ -1404,6 +1381,41 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags)) goto nla_put_failure; + return 0; + +nla_put_failure: + return -EMSGSIZE; +} + +static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, + struct sk_buff *skb, struct tcmsg *t) +{ + struct cls_fl_filter *f = fh; + struct nlattr *nest; + struct fl_flow_key *key, *mask; + + if (!f) + return skb->len; + + t->tcm_handle = f->handle; + + nest = nla_nest_start(skb, TCA_OPTIONS); + if (!nest) + goto nla_put_failure; + + if (f->res.classid && + nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid)) + goto nla_put_failure; + + key = &f->key; + mask = &f->mask->key; + + if (fl_dump_key(skb, net, key, mask)) + goto nla_put_failure; + + if (!tc_skip_hw(f->flags)) + fl_hw_update_stats(tp, f); + if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags)) goto nla_put_failure; -- 2.14.4
[patch net-next 2/9] net: sched: introduce chain templates
From: Jiri Pirko Introduce a group of new tc-rtnl commands to allow user to set per-chain template. Templates lock down individual chains for particular classifier type/options combinations. The classifier needs to support templates, otherwise kernel would reply with error. For example, to lock chain 22 to allow only filters of type flower with destination mac address, user needs to do: chain 22 flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF In case the chain already contains some filters it is not possible to add or remove template. That is permitted only for empty chains. Alongside with add/del commands, introduce also get/dump and notifications. Signed-off-by: Jiri Pirko --- include/net/sch_generic.h | 14 +- include/uapi/linux/rtnetlink.h | 7 + net/sched/cls_api.c| 371 - net/sched/cls_basic.c | 2 +- net/sched/cls_bpf.c| 3 +- net/sched/cls_cgroup.c | 2 +- net/sched/cls_flow.c | 3 +- net/sched/cls_flower.c | 3 +- net/sched/cls_fw.c | 3 +- net/sched/cls_matchall.c | 3 +- net/sched/cls_route.c | 2 +- net/sched/cls_rsvp.h | 3 +- net/sched/cls_tcindex.c| 2 +- net/sched/cls_u32.c| 2 +- security/selinux/nlmsgtab.c| 2 +- 15 files changed, 405 insertions(+), 17 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 6488daa32f82..f2a27d41fed5 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -235,6 +235,8 @@ struct tcf_result { }; }; +struct tcf_chain; + struct tcf_proto_ops { struct list_headhead; charkind[IFNAMSIZ]; @@ -250,17 +252,25 @@ struct tcf_proto_ops { int (*change)(struct net *net, struct sk_buff *, struct tcf_proto*, unsigned long, u32 handle, struct nlattr **, - void **, bool, + void **, bool, void *tmplt_priv, struct netlink_ext_ack *); int (*delete)(struct tcf_proto *tp, void *arg, bool *last, struct netlink_ext_ack *); void(*walk)(struct tcf_proto*, struct tcf_walker *arg); void(*bind_class)(void *, u32, unsigned long); + void * (*tmplt_create)(struct net *net, + struct tcf_chain *chain, + struct nlattr **tca, + struct netlink_ext_ack *extack); + void(*tmplt_destroy)(void *tmplt_priv); /* rtnetlink specific */ int (*dump)(struct net*, struct tcf_proto*, void *, struct sk_buff *skb, struct tcmsg*); + int (*tmplt_dump)(struct sk_buff *skb, + struct net *net, + void *tmplt_priv); struct module *owner; }; @@ -299,6 +309,8 @@ struct tcf_chain { struct tcf_block *block; u32 index; /* chain index */ unsigned int refcnt; + const struct tcf_proto_ops *tmplt_ops; + void *tmplt_priv; }; struct tcf_block { diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 7d8502313c99..45fd8cc1fdb2 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -150,6 +150,13 @@ enum { RTM_NEWCACHEREPORT = 96, #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT + RTM_NEWCHAINTMPLT = 100, +#define RTM_NEWCHAINTMPLT RTM_NEWCHAINTMPLT + RTM_DELCHAINTMPLT, +#define RTM_DELCHAINTMPLT RTM_DELCHAINTMPLT + RTM_GETCHAINTMPLT, +#define RTM_GETCHAINTMPLT RTM_GETCHAINTMPLT + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index db45931bbada..0c88520f80f2 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -227,7 +227,7 @@ static void tcf_chain_head_change(struct tcf_chain *chain, tcf_chain_head_change_item(item, tp_head); } -static void tcf_chain_flush(struct tcf_chain *chain) +static void tcf_chain_flush(struct tcf_chain *chain, bool destroy_template) { struct tcf_proto *tp = rtnl_dereference(chain->filter_chain); @@ -238,6 +238,11 @@ static void tcf_chain_flush(struct tcf_chain *chain) tp = rtnl_dereference(chain->filter_chain); tcf_chain_put(chain); } + if (destroy_template && chain->tmplt_ops) { + chain->tmplt_ops->tmplt_destroy(
[patch net-next 1/9] net: sched: push ops lookup bits into tcf_proto_lookup_ops()
From: Jiri Pirko Push all bits that take care of ops lookup, including module loading outside tcf_proto_create() function, into tcf_proto_lookup_ops() Signed-off-by: Jiri Pirko --- net/sched/cls_api.c | 53 +++-- 1 file changed, 31 insertions(+), 22 deletions(-) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index cdc3c87c53e6..db45931bbada 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -39,7 +39,7 @@ static DEFINE_RWLOCK(cls_mod_lock); /* Find classifier type by string name */ -static const struct tcf_proto_ops *tcf_proto_lookup_ops(const char *kind) +static const struct tcf_proto_ops *__tcf_proto_lookup_ops(const char *kind) { const struct tcf_proto_ops *t, *res = NULL; @@ -57,6 +57,33 @@ static const struct tcf_proto_ops *tcf_proto_lookup_ops(const char *kind) return res; } +static const struct tcf_proto_ops * +tcf_proto_lookup_ops(const char *kind, struct netlink_ext_ack *extack) +{ + const struct tcf_proto_ops *ops; + + ops = __tcf_proto_lookup_ops(kind); + if (ops) + return ops; +#ifdef CONFIG_MODULES + rtnl_unlock(); + request_module("cls_%s", kind); + rtnl_lock(); + ops = __tcf_proto_lookup_ops(kind); + /* We dropped the RTNL semaphore in order to perform +* the module load. So, even if we succeeded in loading +* the module we have to replay the request. We indicate +* this using -EAGAIN. +*/ + if (ops) { + module_put(ops->owner); + return ERR_PTR(-EAGAIN); + } +#endif + NL_SET_ERR_MSG(extack, "TC classifier not found"); + return ERR_PTR(-ENOENT); +} + /* Register(unregister) new classifier type */ int register_tcf_proto_ops(struct tcf_proto_ops *ops) @@ -133,27 +160,9 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol, if (!tp) return ERR_PTR(-ENOBUFS); - err = -ENOENT; - tp->ops = tcf_proto_lookup_ops(kind); - if (!tp->ops) { -#ifdef CONFIG_MODULES - rtnl_unlock(); - request_module("cls_%s", kind); - rtnl_lock(); - tp->ops = tcf_proto_lookup_ops(kind); - /* We dropped the RTNL semaphore in order to perform -* the module load. So, even if we succeeded in loading -* the module we have to replay the request. We indicate -* this using -EAGAIN. -*/ - if (tp->ops) { - module_put(tp->ops->owner); - err = -EAGAIN; - } else { - NL_SET_ERR_MSG(extack, "TC classifier not found"); - err = -ENOENT; - } -#endif + tp->ops = tcf_proto_lookup_ops(kind, extack); + if (IS_ERR(tp->ops)) { + err = PTR_ERR(tp->ops); goto errout; } tp->classify = tp->ops->classify; -- 2.14.4
[patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jiri Pirko For the TC clsact offload these days, some of HW drivers need to hold a magic ball. The reason is, with the first inserted rule inside HW they need to guess what fields will be used for the matching. If later on this guess proves to be wrong and user adds a filter with a different field to match, there's a problem. Mlxsw resolves it now with couple of patterns. Those try to cover as many match fields as possible. This aproach is far from optimal, both performance-wise and scale-wise. Also, there is a combination of filters that in certain order won't succeed. Most of the time, when user inserts filters in chain, he knows right away how the filters are going to look like - what type and option will they have. For example, he knows that he will only insert filters of type flower matching destination IP address. He can specify a template that would cover all the filters in the chain. This patchset is providing the possibility to user to provide such template to kernel and propagate it all the way down to device drivers. See the examples below. Create dummy device with clsact first: # ip link add type dummy # tc qdisc add dev dummy0 clsact There is no template assigned by default: # tc filter template show dev dummy0 ingress Add a template of type flower allowing to insert rules matching on last 2 bytes of destination mac address: # tc filter template add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF The template is now showed in the list: # tc filter template show dev dummy0 ingress filter flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 Add another template, this time for chain number 22: # tc filter template add dev dummy0 ingress proto ip chain 22 flower dst_ip 0.0.0.0/16 # tc filter template show dev dummy0 ingress filter flower chain 0 dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff eth_type ipv4 filter flower chain 22 eth_type ipv4 dst_ip 0.0.0.0/16 Add a filter that fits the template: # tc filter add dev dummy0 ingress proto ip flower dst_mac aa:bb:cc:dd:ee:ff/00:00:00:00:00:0F action drop Addition of filters that does not fit the template would fail: # tc filter add dev dummy0 ingress proto ip flower dst_mac aa:11:22:33:44:55/00:00:00:FF:00:00 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 # tc filter add dev dummy0 ingress proto ip flower dst_ip 10.0.0.1 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 Additions of filters to chain 22: # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/8 action drop # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 # tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/24 action drop Error: Mask does not fit the template. We have an error talking to the kernel, -1 Removal of a template from non-empty chain would fail: # tc filter template del dev dummy0 ingress Error: The chain is not empty, unable to delete template. We have an error talking to the kernel, -1 Once the chain is flushed, the template could be removed: # tc filter del dev dummy0 ingress # tc filter template del dev dummy0 ingress Jiri Pirko (9): net: sched: push ops lookup bits into tcf_proto_lookup_ops() net: sched: introduce chain templates net: sched: cls_flower: move key/mask dumping into a separate function net: sched: cls_flower: change fl_init_dissector to accept mask and dissector net: sched: cls_flower: implement chain templates net: sched: cls_flower: propagate chain teplate creation and destruction to drivers mlxsw: spectrum: Implement chain template hinting selftests: forwarding: move shblock tc support check to a separate helper selftests: forwarding: add tests for TC chain templates drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 5 + drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 12 +- .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 25 +- .../net/ethernet/mellanox/mlxsw/spectrum_flower.c | 44 ++- include/net/pkt_cls.h | 2 + include/net/sch_generic.h | 14 +- include/uapi/linux/rtnetlink.h | 7 + net/sched/cls_api.c| 424 +++-- net/sched/cls_basic.c | 2 +- net/sched/cls_bpf.c| 3 +- net/sched/cls_cgroup.c | 2 +- net/sched/cls_flow.c | 3 +- net/sched/cls_flower.c | 251 +--- net/sched/cls_fw.c | 3 +- net/sched/cls_matchall.c | 3 +- net/sched/cls_route.c
[patch net-next 9/9] selftests: forwarding: add tests for TC chain templates
From: Jiri Pirko Add basic sanity tests for TC chain templates. Signed-off-by: Jiri Pirko --- tools/testing/selftests/net/forwarding/lib.sh | 9 ++ .../selftests/net/forwarding/tc_chaintemplates.sh | 160 + 2 files changed, 169 insertions(+) create mode 100755 tools/testing/selftests/net/forwarding/tc_chaintemplates.sh diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index a736d1d7ecdb..128a5b5a8ea9 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -39,6 +39,15 @@ check_tc_shblock_support() fi } +check_tc_chaintemplate_support() +{ + tc filter help 2>&1|grep template &> /dev/null + if [[ $? -ne 0 ]]; then + echo "SKIP: iproute2 too old; tc is missing chain template support" + exit 1 + fi +} + if [[ "$(id -u)" -ne 0 ]]; then echo "SKIP: need root privileges" exit 0 diff --git a/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh b/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh new file mode 100755 index ..21f2c18e973a --- /dev/null +++ b/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh @@ -0,0 +1,160 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +ALL_TESTS="template_create_destroy template_filter_fits \ + template_create_nonempty template_destroy_nonempty" +NUM_NETIFS=2 +source tc_common.sh +source lib.sh + +h1_create() +{ + simple_if_init $h1 192.0.2.1/24 +} + +h1_destroy() +{ + simple_if_fini $h1 192.0.2.1/24 +} + +h2_create() +{ + simple_if_init $h2 192.0.2.2/24 + tc qdisc add dev $h2 clsact +} + +h2_destroy() +{ + tc qdisc del dev $h2 clsact + simple_if_fini $h2 192.0.2.2/24 +} + +template_create_destroy() +{ + RET=0 + + tc filter template add dev $h2 ingress protocol ip \ + flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF + check_err $? "Failed to create template for default chain" + + tc filter template add dev $h2 ingress chain 1 protocol ip \ + flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF + check_err $? "Failed to create template for chain 1" + + tc filter template del dev $h2 ingress + check_err $? "Failed to destroy template for default chain" + + tc filter template del dev $h2 ingress chain 1 + check_err $? "Failed to destroy template for chain 1" + + log_test "template create destroy" +} + +template_filter_fits() +{ + RET=0 + + tc filter template add dev $h2 ingress protocol ip \ + flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null + tc filter template add dev $h2 ingress chain 1 protocol ip \ + flower src_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null + + tc filter add dev $h2 ingress protocol ip pref 1 handle 1101 \ + flower dst_mac $h2mac action drop + check_err $? "Failed to insert filter which fits template" + + tc filter add dev $h2 ingress protocol ip pref 1 handle 1102 \ + flower src_mac $h2mac action drop &> /dev/null + check_fail $? "Incorrectly succeded to insert filter which does not template" + + tc filter add dev $h2 ingress chain 1 protocol ip pref 1 handle 1101 \ + flower src_mac $h2mac action drop + check_err $? "Failed to insert filter which fits template" + + tc filter add dev $h2 ingress chain 1protocol ip pref 1 handle 1102 \ + flower dst_mac $h2mac action drop &> /dev/null + check_fail $? "Incorrectly succeded to insert filter which does not template" + + tc filter del dev $h2 ingress chain 1 protocol ip pref 1 handle 1102 \ + flower &> /dev/null + tc filter del dev $h2 ingress chain 1 protocol ip pref 1 handle 1101 \ + flower &> /dev/null + + tc filter del dev $h2 ingress protocol ip pref 1 handle 1102 \ + flower &> /dev/null + tc filter del dev $h2 ingress protocol ip pref 1 handle 1101 \ + flower &> /dev/null + + tc filter template del dev $h2 ingress chain 1 + tc filter template del dev $h2 ingress + + log_test "template filter fits" +} + +template_create_nonempty() +{ + RET=0 + + tc filter add dev $h2 ingress protocol ip pref 1 handle 1101 \ + flower dst_mac $h2mac action drop + tc filter template add dev $h2 ingress protocol ip \ + flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null + check_fail $? "Incorrectly succeded to create template for non-empty chain" + + tc filter template del dev $h2 ingress &> /dev/null + tc filter del dev $h2 ingress protocol ip pref 1 handle 1101 flower + + log_test "template create non-empty" +} + +template_destroy_nonempty() +{ + RET=0 + + tc filter template ad
[patch net-next 8/9] selftests: forwarding: move shblock tc support check to a separate helper
From: Jiri Pirko The shared block support is only needed for tc_shblock.sh. No need to require that for other test. Signed-off-by: Jiri Pirko --- tools/testing/selftests/net/forwarding/lib.sh | 3 +++ tools/testing/selftests/net/forwarding/tc_shblocks.sh | 2 ++ 2 files changed, 5 insertions(+) diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 7b18a53aa556..a736d1d7ecdb 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -28,7 +28,10 @@ check_tc_version() echo "SKIP: iproute2 too old; tc is missing JSON support" exit 1 fi +} +check_tc_shblock_support() +{ tc filter help 2>&1 | grep block &> /dev/null if [[ $? -ne 0 ]]; then echo "SKIP: iproute2 too old; tc is missing shared block support" diff --git a/tools/testing/selftests/net/forwarding/tc_shblocks.sh b/tools/testing/selftests/net/forwarding/tc_shblocks.sh index b5b917203815..9826a446e2c0 100755 --- a/tools/testing/selftests/net/forwarding/tc_shblocks.sh +++ b/tools/testing/selftests/net/forwarding/tc_shblocks.sh @@ -105,6 +105,8 @@ cleanup() ip link set $swp2 address $swp2origmac } +check_tc_shblock_support + trap cleanup EXIT setup_prepare -- 2.14.4
Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg
Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley [...] >+static int >+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb, >+ void *cb_priv, bool add, bool offload_in_use, >+ struct netlink_ext_ack *extack) >+{ >+ struct tcf_chain *chain; >+ struct tcf_proto *tp; >+ int err; >+ >+ list_for_each_entry(chain, &block->chain_list, list) { >+ for (tp = rtnl_dereference(chain->filter_chain); tp; >+ tp = rtnl_dereference(tp->next)) { >+ if (tp->ops->reoffload) { >+ err = tp->ops->reoffload(tp, add, cb, cb_priv, >+ extack); >+ if (err && add) >+ goto err_playback_remove; >+ } else if (add && offload_in_use) { >+ err = -EOPNOTSUPP; >+ NL_SET_ERR_MSG(extack, "Filter replay failed - >a filters doesn't support re-offloading"); This msg sounds weird. Please fix it. Otherwise this looks very good to me! Thanks!
Re: [PATCH net 2/2] nfp: reject binding to shared blocks
Mon, Jun 25, 2018 at 10:50:14PM CEST, jakub.kicin...@netronome.com wrote: >On Mon, 25 Jun 2018 22:40:21 +0200, Jiri Pirko wrote: >> Mon, Jun 25, 2018 at 10:22:46PM CEST, jakub.kicin...@netronome.com wrote: >> >From: John Hurley >> > >> >TC shared blocks allow multiple qdiscs to be grouped together and filters >> >shared between them. Currently the chains of filters attached to a block >> >are only flushed when the block is removed. If a qdisc is removed from a >> >block but the block still exists, flow del messages are not passed to the >> >callback registered for that qdisc. For the NFP, this presents the >> >possibility of rules still existing in hw when they should be removed. >> > >> >Prevent binding to shared blocks until the kernel can send per qdisc del >> >messages when block unbinds occur. >> >> This is not nfp-specific problem. Should be handled differently. The >> driver has information about offloaded filters. On unbind, it have >> enough info to do the flush, doesn't it? > >Certainly. But this fix is simpler and sufficient. We need to >backport it back to 4.16. If we have to go through driver tables >and flush filters we may as well merge the reoffload series to net. Oh, I missed this is for net. Sorry. Acked-by: Jiri Pirko
Re: [PATCH net 2/2] nfp: reject binding to shared blocks
On Mon, 25 Jun 2018 22:40:21 +0200, Jiri Pirko wrote: > Mon, Jun 25, 2018 at 10:22:46PM CEST, jakub.kicin...@netronome.com wrote: > >From: John Hurley > > > >TC shared blocks allow multiple qdiscs to be grouped together and filters > >shared between them. Currently the chains of filters attached to a block > >are only flushed when the block is removed. If a qdisc is removed from a > >block but the block still exists, flow del messages are not passed to the > >callback registered for that qdisc. For the NFP, this presents the > >possibility of rules still existing in hw when they should be removed. > > > >Prevent binding to shared blocks until the kernel can send per qdisc del > >messages when block unbinds occur. > > This is not nfp-specific problem. Should be handled differently. The > driver has information about offloaded filters. On unbind, it have > enough info to do the flush, doesn't it? Certainly. But this fix is simpler and sufficient. We need to backport it back to 4.16. If we have to go through driver tables and flush filters we may as well merge the reoffload series to net.
Re: [PATCH net-next 3/7] net: sched: cls_flower: implement offload tcf_proto_op
Mon, Jun 25, 2018 at 06:34:27AM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Add the reoffload tcf_proto_op in flower to generate an offload message >for each filter in the given tcf_proto. Call the specified callback with >this new offload message. The function only returns an error if the >callback rejects adding a 'hardware only' rule. > >A filter contains a flag to indicate if it is in hardware or not. To >ensure the reoffload function properly maintains this flag, keep a >reference counter for the number of instances of the filter that are in >hardware. Only update the flag when this counter changes from or to 0. Add >a generic helper function to implement this behaviour. > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski >--- > include/net/sch_generic.h | 15 + > net/sched/cls_flower.c| 44 +++ > 2 files changed, 59 insertions(+) > >diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >index 88ed64f60056..c0bd11a928ed 100644 >--- a/include/net/sch_generic.h >+++ b/include/net/sch_generic.h >@@ -336,6 +336,21 @@ static inline void tcf_block_offload_dec(struct tcf_block >*block, u32 *flags) > block->offloadcnt--; > } > >+static inline void >+tc_cls_offload_cnt_update(struct tcf_block *block, u32 *cnt, u32 *flags, Why u32? This is not uapi or hw facing. Just use "unsigned int". [...]
Re: [PATCH v3,net-next] vlan: implement vlan id and protocol changes
On 6/25/18 4:30 AM, Chas Williams wrote: > vlan_changelink silently ignores attempts to change the vlan id > or protocol id of an existing vlan interface. Implement by adding > the new vlan id and protocol to the interface's vlan group and then > removing the old vlan id and protocol from the vlan group. > > Signed-off-by: Chas Williams <3ch...@gmail.com> > --- > include/linux/netdevice.h | 1 + > net/8021q/vlan.c | 4 ++-- > net/8021q/vlan.h | 2 ++ > net/8021q/vlan_netlink.c | 38 ++ > net/core/dev.c| 1 + > 5 files changed, 44 insertions(+), 2 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 3ec9850c7936..a95ae238addf 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2409,6 +2409,7 @@ enum netdev_cmd { > NETDEV_CVLAN_FILTER_DROP_INFO, > NETDEV_SVLAN_FILTER_PUSH_INFO, > NETDEV_SVLAN_FILTER_DROP_INFO, > + NETDEV_CHANGEVLAN, > }; > const char *netdev_cmd_to_name(enum netdev_cmd cmd); > you add the new notifier, but do not add any hooks to catch and process it. Personally, I think it is a bit sketchy to change the vlan id on an existing device and I suspect it will cause latent errors. What's your use case for trying to implement the change versus causing it to generate an unsupported error? If this patch does get accepted, I believe the mlxsw switchdev driver will be impacted. > diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c > index 73a65789271b..b5e0ad1a581a 100644 > --- a/net/8021q/vlan.c > +++ b/net/8021q/vlan.c > @@ -51,8 +51,8 @@ const char vlan_version[] = DRV_VERSION; > > /* End of global variables definitions. */ > > -static int vlan_group_prealloc_vid(struct vlan_group *vg, > -__be16 vlan_proto, u16 vlan_id) > +int vlan_group_prealloc_vid(struct vlan_group *vg, > + __be16 vlan_proto, u16 vlan_id) > { > struct net_device **array; > unsigned int pidx, vidx; > diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h > index 44df1c3df02d..c734dd21d70d 100644 > --- a/net/8021q/vlan.h > +++ b/net/8021q/vlan.h > @@ -116,6 +116,8 @@ int register_vlan_dev(struct net_device *dev, struct > netlink_ext_ack *extack); > void unregister_vlan_dev(struct net_device *dev, struct list_head *head); > bool vlan_dev_inherit_address(struct net_device *dev, > struct net_device *real_dev); > +int vlan_group_prealloc_vid(struct vlan_group *vg, > + __be16 vlan_proto, u16 vlan_id); > > static inline u32 vlan_get_ingress_priority(struct net_device *dev, > u16 vlan_tci) > diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c > index 9b60c1e399e2..ee27781939e0 100644 > --- a/net/8021q/vlan_netlink.c > +++ b/net/8021q/vlan_netlink.c > @@ -107,10 +107,48 @@ static int vlan_changelink(struct net_device *dev, > struct nlattr *tb[], > struct nlattr *data[], > struct netlink_ext_ack *extack) > { > + struct vlan_dev_priv *vlan = vlan_dev_priv(dev); > struct ifla_vlan_flags *flags; > struct ifla_vlan_qos_mapping *m; > struct nlattr *attr; > int rem; > + __be16 vlan_proto = vlan->vlan_proto; > + u16 vlan_id = vlan->vlan_id; > + > + if (data[IFLA_VLAN_ID]) > + vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); > + > + if (data[IFLA_VLAN_PROTOCOL]) > + vlan_proto = nla_get_be16(data[IFLA_VLAN_PROTOCOL]); > + > + if (vlan->vlan_id != vlan_id || vlan->vlan_proto != vlan_proto) { > + struct net_device *real_dev = vlan->real_dev; > + struct vlan_info *vlan_info; > + struct vlan_group *grp; > + __be16 old_vlan_proto = vlan->vlan_proto; > + u16 old_vlan_id = vlan->vlan_id; > + int err; > + > + err = vlan_vid_add(real_dev, vlan_proto, vlan_id); > + if (err) > + return err; > + vlan_info = rtnl_dereference(real_dev->vlan_info); > + grp = &vlan_info->grp; > + err = vlan_group_prealloc_vid(grp, vlan_proto, vlan_id); > + if (err < 0) { > + vlan_vid_del(real_dev, vlan_proto, vlan_id); > + return err; > + } > + vlan_group_set_device(grp, vlan_proto, vlan_id, dev); > + vlan->vlan_proto = vlan_proto; > + vlan->vlan_id = vlan_id; > + > + vlan_group_set_device(grp, old_vlan_proto, old_vlan_id, NULL); > + vlan_vid_del(real_dev, old_vlan_proto, old_vlan_id); > + > + err = call_netdevice_notifiers(NETDEV_CHANGEVLAN, dev); > + notifier_to_errno(err); > + } > > if (data[IFLA_VLAN_FLAGS]) { > flags = nla_data(data[IFLA_VLAN_FLAGS]); > diff --git a/net/core/dev.c b/net/core/d
Re: [PATCH net-next 2/7] net: sched: add tcf_proto_op to offload a rule
Mon, Jun 25, 2018 at 06:34:26AM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >Create a new tcf_proto_op called 'reoffload' that generates a new offload >message for each node in a tcf_proto. Pointers to the tcf_proto and >whether the offload request is to add or delete the node are included. >Also included is a callback function to send the offload message to and >the option of priv data to go with the cb. > >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski >--- > include/net/act_api.h | 3 --- > include/net/sch_generic.h | 6 ++ > 2 files changed, 6 insertions(+), 3 deletions(-) > >diff --git a/include/net/act_api.h b/include/net/act_api.h >index 9e59ebfded62..5ff11adbe2a6 100644 >--- a/include/net/act_api.h >+++ b/include/net/act_api.h >@@ -190,9 +190,6 @@ static inline void tcf_action_stats_update(struct >tc_action *a, u64 bytes, > #endif > } > >-typedef int tc_setup_cb_t(enum tc_setup_type type, >-void *type_data, void *cb_priv); >- > #ifdef CONFIG_NET_CLS_ACT > int tc_setup_cb_egdev_register(const struct net_device *dev, > tc_setup_cb_t *cb, void *cb_priv); >diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >index 6488daa32f82..88ed64f60056 100644 >--- a/include/net/sch_generic.h >+++ b/include/net/sch_generic.h >@@ -20,6 +20,9 @@ struct qdisc_walker; > struct tcf_walker; > struct module; > >+typedef int tc_setup_cb_t(enum tc_setup_type type, >+void *type_data, void *cb_priv); >+ > struct qdisc_rate_table { > struct tc_ratespec rate; > u32 data[256]; >@@ -256,6 +259,9 @@ struct tcf_proto_ops { > bool *last, > struct netlink_ext_ack *); > void(*walk)(struct tcf_proto*, struct tcf_walker > *arg); >+ int (*reoffload)(struct tcf_proto *, bool, >+ tc_setup_cb_t *, void *, >+ struct netlink_ext_ack *); Please, name the args. Unnamed args are annoying... > void(*bind_class)(void *, u32, unsigned long); > > /* rtnetlink specific */ >-- >2.17.1 >
Re: [PATCH net 2/2] nfp: reject binding to shared blocks
Mon, Jun 25, 2018 at 10:22:46PM CEST, jakub.kicin...@netronome.com wrote: >From: John Hurley > >TC shared blocks allow multiple qdiscs to be grouped together and filters >shared between them. Currently the chains of filters attached to a block >are only flushed when the block is removed. If a qdisc is removed from a >block but the block still exists, flow del messages are not passed to the >callback registered for that qdisc. For the NFP, this presents the >possibility of rules still existing in hw when they should be removed. > >Prevent binding to shared blocks until the kernel can send per qdisc del >messages when block unbinds occur. This is not nfp-specific problem. Should be handled differently. The driver has information about offloaded filters. On unbind, it have enough info to do the flush, doesn't it? > >Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks >infrastructure") >Signed-off-by: John Hurley >Signed-off-by: Jakub Kicinski >Reviewed-by: Simon Horman >--- > drivers/net/ethernet/netronome/nfp/bpf/main.c | 3 +++ > drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++ > 2 files changed, 6 insertions(+) > >diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c >b/drivers/net/ethernet/netronome/nfp/bpf/main.c >index fcdfb8e7fdea..6b15e3b11956 100644 >--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c >+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c >@@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device >*netdev, > if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) > return -EOPNOTSUPP; > >+ if (tcf_block_shared(f->block)) >+ return -EOPNOTSUPP; >+ > switch (f->command) { > case TC_BLOCK_BIND: > return tcf_block_cb_register(f->block, >diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c >b/drivers/net/ethernet/netronome/nfp/flower/offload.c >index 477f584f6d28..525057bee0ed 100644 >--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c >+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c >@@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device >*netdev, > if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) > return -EOPNOTSUPP; > >+ if (tcf_block_shared(f->block)) >+ return -EOPNOTSUPP; >+ > switch (f->command) { > case TC_BLOCK_BIND: > return tcf_block_cb_register(f->block, >-- >2.17.1 >
[PATCH net 2/2] nfp: reject binding to shared blocks
From: John Hurley TC shared blocks allow multiple qdiscs to be grouped together and filters shared between them. Currently the chains of filters attached to a block are only flushed when the block is removed. If a qdisc is removed from a block but the block still exists, flow del messages are not passed to the callback registered for that qdisc. For the NFP, this presents the possibility of rules still existing in hw when they should be removed. Prevent binding to shared blocks until the kernel can send per qdisc del messages when block unbinds occur. Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks infrastructure") Signed-off-by: John Hurley Signed-off-by: Jakub Kicinski Reviewed-by: Simon Horman --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 3 +++ drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index fcdfb8e7fdea..6b15e3b11956 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev, if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) return -EOPNOTSUPP; + if (tcf_block_shared(f->block)) + return -EOPNOTSUPP; + switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index 477f584f6d28..525057bee0ed 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device *netdev, if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS) return -EOPNOTSUPP; + if (tcf_block_shared(f->block)) + return -EOPNOTSUPP; + switch (f->command) { case TC_BLOCK_BIND: return tcf_block_cb_register(f->block, -- 2.17.1
[PATCH net 0/2] nfp: MPLS and shared blocks TC offload fixes
Hi! This series brings two fixes to TC filter/action offload code. Pieter fixes matching MPLS packets when the match is purely on the MPLS ethertype and none of the MPLS fields are used. John provides a fix for offload of shared blocks. Unfortunately, with shared blocks there is currently no guarantee that filters which were added by the core will be removed before block unbind. Our simple fix is to not support offload of rules on shared blocks at all, a revert of this fix will be send for -next once the reoffload infrastructure lands. The shared blocks became important as we are trying to use them for bonding offload (managed from user space) and lack of remove calls leads to resource leaks. John Hurley (1): nfp: reject binding to shared blocks Pieter Jansen van Vuuren (1): nfp: flower: fix mpls ether type detection drivers/net/ethernet/netronome/nfp/bpf/main.c | 3 +++ drivers/net/ethernet/netronome/nfp/flower/match.c | 14 ++ .../net/ethernet/netronome/nfp/flower/offload.c| 12 3 files changed, 29 insertions(+) -- 2.17.1
[PATCH net 1/2] nfp: flower: fix mpls ether type detection
From: Pieter Jansen van Vuuren Previously it was not possible to distinguish between mpls ether types and other ether types. This leads to incorrect classification of offloaded filters that match on mpls ether type. For example the following two filters overlap: # tc filter add dev eth0 parent : \ protocol 0x8847 flower \ action mirred egress redirect dev eth1 # tc filter add dev eth0 parent : \ protocol 0x0800 flower \ action mirred egress redirect dev eth2 The driver now correctly includes the mac_mpls layer where HW stores mpls fields, when it detects an mpls ether type. It also sets the MPLS_Q bit to indicate that the filter should match mpls packets. Fixes: bb055c198d9b ("nfp: add mpls match offloading support") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/match.c | 14 ++ .../net/ethernet/netronome/nfp/flower/offload.c| 8 2 files changed, 22 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c b/drivers/net/ethernet/netronome/nfp/flower/match.c index 91935405f586..84f7a5dbea9d 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/match.c +++ b/drivers/net/ethernet/netronome/nfp/flower/match.c @@ -123,6 +123,20 @@ nfp_flower_compile_mac(struct nfp_flower_mac_mpls *frame, NFP_FLOWER_MASK_MPLS_Q; frame->mpls_lse = cpu_to_be32(t_mpls); + } else if (dissector_uses_key(flow->dissector, + FLOW_DISSECTOR_KEY_BASIC)) { + /* Check for mpls ether type and set NFP_FLOWER_MASK_MPLS_Q +* bit, which indicates an mpls ether type but without any +* mpls fields. +*/ + struct flow_dissector_key_basic *key_basic; + + key_basic = skb_flow_dissector_target(flow->dissector, + FLOW_DISSECTOR_KEY_BASIC, + flow->key); + if (key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_UC) || + key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_MC)) + frame->mpls_lse = cpu_to_be32(NFP_FLOWER_MASK_MPLS_Q); } } diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index c42e64f32333..477f584f6d28 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -264,6 +264,14 @@ nfp_flower_calculate_key_layers(struct nfp_app *app, case cpu_to_be16(ETH_P_ARP): return -EOPNOTSUPP; + case cpu_to_be16(ETH_P_MPLS_UC): + case cpu_to_be16(ETH_P_MPLS_MC): + if (!(key_layer & NFP_FLOWER_LAYER_MAC)) { + key_layer |= NFP_FLOWER_LAYER_MAC; + key_size += sizeof(struct nfp_flower_mac_mpls); + } + break; + /* Will be included in layer 2. */ case cpu_to_be16(ETH_P_8021Q): break; -- 2.17.1
Re: Route fallback issue
On 06/25/2018 12:50 PM, Julian Anastasov wrote: Hello, Hi Julian, Yes, ARP state for unreachable GWs may be updated slowly, there is in-time feedback only for reachable state. Fair. Most of the installations where I needed D.G.D. to work would be okay with a < 5 minute timeout. Obviously they would like faster, but automation is a LOT better than waiting on manual intervention. IMHO < 30 seconds is great. < 90 seconds is acceptable. < 300 seconds leaves some room for improvement. You can create the two routes, of course. But only the default routes are alternative. Are you saying that the functionality I'm describing only works for default gateways or that the term "alternative route" only applies to default gateways? The testing that I did indicated that alternative routes worked for specific prefixes too. I tested multiple NetNSs with only directly attached routes and appended routes to a destination prefix, no default gateway / route of last resort. The behavior seemed to be different when ignore_routes_with_linkdown was set verses unset. Specifically, ignore_routes_with_linkdown seemed to help considerably. Hence why I question the requirement for the "default" route verses a route to a specific prefix. Can you explain why I saw the behavior difference with ignore_routes_with_linkdown if it only applies to the default route? The alternative routes work in this way: - on lookup, routes are walked in order - as listed in table - as long as route contains reachable gateway (ARP state), only this route is used - if some gateway becomes unreachable (ARP state), next alternative routes are tried - if ARP entry is expired (missing), this gateway can be probed if the route is before the currently used route. This is what happens initially when no ARP state is present for the GWs. It is bad luck if the probed GW is actually unreachable. - active probing by user space (ping GWs) can only help to keep the ARP state present for the used gateways. By this way, if ARP entry for GW is missing, the kernel will not risk to select unavailable route with the goal to probe the GW. This all makes sense. Please confirm if "gateway" in this context is the "/default/ gateway" or not. I ask because arguably "gateway" can be used as a term to describe the next hop for a route, or gateway, to a prefix. Further, the "/default/ (gateway,router)" is the gateway or route of last resort. Which to me means that "gateway" can be any route in this context. nexthop is the GW in the route Thank you for confirming. Yes, the kernel avoids alternative routes with unreachable GWs Fair enough. The multipath route uses all its alive nexthops at the same time... But you may need in the same way active probing by user space, otherwise unavailable GW can be selected. I assume that the dead ECMP NEXTHOP is also subject to similar timeouts as alternative routes. Correct? Yes, if you prefer, you may run PING every second to avoid such delays... Agreed. I'm trying to make sure I understand basic functionality before I do things to modify it. -- Grant. . . . unix || die smime.p7s Description: S/MIME Cryptographic Signature
Re: Route fallback issue
Hello, On Sun, 24 Jun 2018, Erik Auerswald wrote: > Hello Julien, > > > http://ja.ssi.bg/dgd-usage.txt > > Thanks for that info! > > Can you tell us what parts from the above text is actually implemented > in the upstream Linux kernel, and starting with which version(s) > (approximately)? The text describes ideas and patches from nearly two > decades ago, is more recent documentation available somewhere? Nothing is included in kernel. The idea is that user space has more control. It is best done with CONNMARK: stick NATed connection to some path (via alive ISP), use route lookup just to select alive path for the first packet in connection. So, what we balance are connections, not packets (which does not work with different ISPs). Probe GWs to keep only alive routes in the table. Regards -- Julian Anastasov
Re: Route fallback issue
Hello, On Thu, 21 Jun 2018, Grant Taylor wrote: > On 06/21/2018 01:57 PM, Julian Anastasov wrote: > > Hello, > > > http://ja.ssi.bg/dgd-usage.txt > > "DGD" or "Dead Gateway Detection" sounds very familiar. I referenced it in an > earlier reply. > > I distinctly remember DGD not behaving satisfactorily years ago. Where > unsatisfactorily was something like 90 seconds (or more) to recover. Which > actually matches what I was getting without the ignore_routes_with_linkdown=1 > setting that David A. mentioned. Yes, ARP state for unreachable GWs may be updated slowly, there is in-time feedback only for reachable state. > With ignore_routes_with_linkdown=1 things behaved much better. > > > Not true. net/ipv4/fib_semantics.c:fib_select_path() calls > > fib_select_default() only when prefixlen = 0 (default route). > > Okay My testing last night disagrees with you. Specifically, I was able > to add a alternate routes to the same prefix, 192.0.2.128/26. There was not > any default gateway configured on any of the NetNSs. So everything was using > routes for locally attacked or the two added via "ip route append". > > What am I misinterpreting? Or where are we otherwise talking past each other? You can create the two routes, of course. But only the default routes are alternative. > > > Otherwise, only the first route will be considered. > > "only the first route" almost sounds like something akin to Equal Cost Multi > Path. > > I was not expecting "alternative routes" to use more than one route at a time, > equally or otherwise. I was wanting for the kernel to fall back to an > alternate route / gateway / path in the event that the one that was being used > became unusable / unreachable. > > So what should "Alternative Routes" do? How does this compare / contract to > E.C.M.P. or D.G.D. The alternative routes work in this way: - on lookup, routes are walked in order - as listed in table - as long as route contains reachable gateway (ARP state), only this route is used - if some gateway becomes unreachable (ARP state), next alternative routes are tried - if ARP entry is expired (missing), this gateway can be probed if the route is before the currently used route. This is what happens initially when no ARP state is present for the GWs. It is bad luck if the probed GW is actually unreachable. - active probing by user space (ping GWs) can only help to keep the ARP state present for the used gateways. By this way, if ARP entry for GW is missing, the kernel will not risk to select unavailable route with the goal to probe the GW. > > fib_select_default() is the function that decides which nexthop is reachable > > and whether to contact it. It uses the ARP state via fib_detect_death(). > > That is all code that is behind this feature called "alternative routes": > > the kernel selects one based on nexthop's ARP state. > > Please confirm that you aren't entering / referring to E.C.M.P. territory when > you say "nexthop". I think that you are not, but I want to ask and be sure, > particularly seeing as how things are very closely related. nexthop is the GW in the route > It sounds like you're referring to literally the router that is the next hop > in the path. I.e. the device on the other end of the wire. Yes, the kernel avoids alternative routes with unreachable GWs > I want to do some testing to see if fib_multipath_use_neigh alters this > behavior at all. I'm hoping that it will invalidate an alternate route if the > MAC is not resolvable even if the physical link stays up. The multipath route uses all its alive nexthops at the same time... But you may need in the same way active probing by user space, otherwise unavailable GW can be selected. > Sure, the ARP cache may have a 30 ~ 120 second timeout before triggering this > behavior. But having that timeout and starting to use an alternative route is > considerably better than not using an alternative route. Yes, if you prefer, you may run PING every second to avoid such delays... Regards -- Julian Anastasov
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On 6/25/2018 10:50 AM, Sowmini Varadhan wrote: On (06/26/18 01:43), Ka-Cheong Poon wrote: Yes, I think if the socket is bound, it should check the scope_id in msg_name (if not NULL) to make sure that they match. A bound RDS socket can send to multiple peers. But if the bound local address is link local, it should only be allowed to send to peers on the same link. agree. Yep. Its inline with RDS bind behavior. If a socket is bound, I guess the scope_id should be used. So if a socket is not bound to a link local address and the socket is used to sent to a link local peer, it should fail. PF_RDS sockets *MUST* alwasy be bound. See Documentation/networking/rds.txt: " Sockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change." In any case link local or not, the socket needs to be bound before any data can be sent as documented. Send path already enforces it. Also, why is there no IPv6 support in rds_connect? Oops, I missed this when I ported the internal version to the net-next version. Will add it back. So the net-next wasn't tested? IPv6 connections itself wouldn't be formed with this missing. As mentioned already, please test v2 before posting on list. Regards, Santosh
Re: [net regression] "fib_rules: move common handling of newrule delrule msgs into fib_nl2rule" breaks suppress_prefixlength
On Mon, Jun 25, 2018 at 8:23 AM, Roopa Prabhu wrote: > On Sat, Jun 23, 2018 at 8:46 AM, Jason A. Donenfeld wrote: >> Hey Roopa, >> >> On a kernel with a minimal networking config, >> CONFIG_IP_MULTIPLE_TABLES appears to be broken for certain rules after >> f9d4b0c1e9695e3de7af3768205bacc27312320c. >> >> Try, for example, running: >> >> $ ip -4 rule add table main suppress_prefixlength 0 >> >> It returns with EEXIST. >> >> Perhaps the reason is that the new rule_find function does not match >> on suppress_prefixlength? However, rule_exist from before didn't do >> that either. I'll keep playing and see if I can track it down myself, >> but thought I should let you know first. > > I am surprised at that also. I cannot find prior rule_exist looking at > suppress_prefixlength. > I will dig deeper also today. But your patch LGTM with a small change > I commented on it. > So the previous rule_exists code did not check for attribute matches correctly. It would ignore a rule at the first non-existent attribute mis-match. eg in your case, it would return at pref mismatch. $ip -4 rule add table main suppress_prefixlength 0 $ip -4 rule add table main suppress_prefixlength 0 $ip -4 rule add table main suppress_prefixlength 0 $ip rule show 0: from all lookup local 32763: from all lookup main suppress_prefixlength 0 32764: from all lookup main suppress_prefixlength 0 32765: from all lookup main suppress_prefixlength 0 32766: from all lookup main 32767: from all lookup default With your patch (with my proposed fixes), you should get proper EXISTS check $ git diff HEAD diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 126ffc5..de4c171 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -428,6 +428,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops *ops, if (rule->l3mdev && r->l3mdev != rule->l3mdev) continue; + if (rule->suppress_ifgroup != -1 && + (r->suppress_ifgroup != rule->suppress_ifgroup)) + continue; + + if (rule->suppress_prefixlen != -1 && + (r->suppress_prefixlen != rule->suppress_prefixlen)) + continue; + if (uid_range_set(&rule->uid_range) && (!uid_eq(r->uid_range.start, rule->uid_range.start) || !uid_eq(r->uid_range.end, rule->uid_range.end))) $ ip -4 rule add table main suppress_prefixlength 0 $ ip -4 rule add table main suppress_prefixlength 0 RTNETLINK answers: File exists
[PATCH net-next] r8169: reject unsupported WoL options
So far unsupported WoL options are silently ignored. Change this and reject attempts to set unsupported options. This prevents situations where a user tries to set an unsupported WoL option and is under the impression it was successful because ethtool doesn't complain. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 1d33672c..70c13cc2 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -1712,11 +1712,14 @@ static int rtl8169_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol) struct rtl8169_private *tp = netdev_priv(dev); struct device *d = tp_to_dev(tp); + if (wol->wolopts & ~WAKE_ANY) + return -EINVAL; + pm_runtime_get_noresume(d); rtl_lock_work(tp); - tp->saved_wolopts = wol->wolopts & WAKE_ANY; + tp->saved_wolopts = wol->wolopts; if (pm_runtime_active(d)) __rtl8169_set_wol(tp, tp->saved_wolopts); -- 2.18.0