[PATCH net] ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
In 'seg6_output', stack variable 'struct flowi6 fl6' was missing initialization. Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: Shmulik Ladkani --- Suggeting this fix, spotted during code review while experimenting with SRv6, although havn't encountered a specific issue during experiments. Was there any genuine intention to actually keep 'fl6' uninitialized? --- net/ipv6/seg6_iptunnel.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c index a8854dd3e9c5..8181ee7e1e27 100644 --- a/net/ipv6/seg6_iptunnel.c +++ b/net/ipv6/seg6_iptunnel.c @@ -347,6 +347,7 @@ static int seg6_output(struct net *net, struct sock *sk, struct sk_buff *skb) struct ipv6hdr *hdr = ipv6_hdr(skb); struct flowi6 fl6; + memset(&fl6, 0, sizeof(fl6)); fl6.daddr = hdr->daddr; fl6.saddr = hdr->saddr; fl6.flowlabel = ip6_flowinfo(hdr); -- 2.19.1
Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC
Hi John, On 4/12/18 12:02 am, John Crispin wrote: On 03/12/2018 15:00, René van Dorst wrote: Quoting Bjørn Mork : Greg Ungerer writes: The following change helped alot, but I still get some problems under sustained load and some types of port setups. Still trying to figure out what exactly is going on. --- a/linux/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/linux/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -1750,8 +1750,8 @@ static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth) if (likely(napi_schedule_prep(ð->rx_napi))) { __napi_schedule(ð->rx_napi); - mtk_rx_irq_disable(eth, MTK_RX_DONE_INT); } + mtk_rx_irq_disable(eth, MTK_RX_DONE_INT); return IRQ_HANDLED; } @@ -1762,11 +1762,53 @@ static irqreturn_t mtk_handle_irq_tx(int irq, void *_eth) if (likely(napi_schedule_prep(ð->tx_napi))) { __napi_schedule(ð->tx_napi); - mtk_tx_irq_disable(eth, MTK_TX_DONE_INT); } + mtk_tx_irq_disable(eth, MTK_TX_DONE_INT); return IRQ_HANDLED; } Yes, sorry I didn't point to that as well. Just to be clear: I have no clue how this thing is actually wired up, or if you could use three interrupts on the MT7621 too. I just messed with it until I got something to work, based on Renés original idea and code. My idea is a just a copy of mtk_handle_irq_{rx,tx} see [1] You probably want to look at the staging driver or Ubiquity source with a 3.10.x kernel [2] or padavan with 3.4.x kernel [3]. AFAIK mt7621 only has 1 IRQ for ethernet part. correct there is only 1 single IRQ on mt7621 One of the main differences I see between the mainline mtk_eth_soc.c and the older mediatek/openwrt driver is that the older driver uses the PDMA module for TX transmission, while the mainline uses the QDMA module. I have no documentation on the what the differences are between the 2 (or why there is even 2 DMA engines in there?). Can you shed any light on that? I did a quick and dirty recode of the QDMA transmission parts of the mainline driver code to use the PDMA engine instead. The most immediate result is that it suffers the same IP header checksum problem on TX packets :-( But it also still suffers from the same occasional TX watchdog timeout I see with only the mainline driver and basic support of MT7621. What I see with the TX watchdog timeouts is that there is valid TX descriptors, but the frame engine is just not processing them. It seems to be just sitting there idle. The CTX and DTX registers look valid and consistent with the local last_free/next_free pointers. Regards Greg
Re: [PATCH AUTOSEL 4.19 040/123] bpf: allocate local storage buffers using GFP_ATOMIC
On Fri, 7 Dec 2018 at 12:10, Naresh Kamboju wrote: > > On Wed, 5 Dec 2018 at 15:08, Sasha Levin wrote: > > > > From: Roman Gushchin > > > > [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ] > > > > Naresh reported an issue with the non-atomic memory allocation of > > cgroup local storage buffers: > > > > [ 73.047526] BUG: sleeping function called from invalid context at > > /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421 > > [ 73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: > > test_cgroup_sto > > [ 73.068342] INFO: lockdep is turned off. > > [ 73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted > > 4.20.0-rc2-next-20181113 #1 > > [ 73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS > > 2.0b 07/27/2017 > > [ 73.088018] Call Trace: > > [ 73.090463] dump_stack+0x70/0xa5 > > [ 73.093783] ___might_sleep+0x152/0x240 > > [ 73.097619] __might_sleep+0x4a/0x80 > > [ 73.101191] __kmalloc_node+0x1cf/0x2f0 > > [ 73.105031] ? cgroup_storage_update_elem+0x46/0x90 > > [ 73.109909] cgroup_storage_update_elem+0x46/0x90 > > > > cgroup_storage_update_elem() (as well as other update map update > > callbacks) is called with disabled preemption, so GFP_ATOMIC > > allocation should be used: e.g. alloc_htab_elem() in hashtab.c. > > > > Reported-by: Naresh Kamboju > > Tested-by: Naresh Kamboju > > Signed-off-by: Roman Gushchin > > Cc: Alexei Starovoitov > > Cc: Daniel Borkmann > > Signed-off-by: Alexei Starovoitov > > Signed-off-by: Sasha Levin > > I have reported above issue on 4.20.0-rc2-next-20181113. > Now this BUG re-occurring on 4.19.8-rc1 on x86_64 and arm64 devices. This BUG: was seen on 4.19.1-rc1 also on x86_64 and arm64 devices. - Naresh
Re: [PATCH AUTOSEL 4.19 040/123] bpf: allocate local storage buffers using GFP_ATOMIC
On Wed, 5 Dec 2018 at 15:08, Sasha Levin wrote: > > From: Roman Gushchin > > [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ] > > Naresh reported an issue with the non-atomic memory allocation of > cgroup local storage buffers: > > [ 73.047526] BUG: sleeping function called from invalid context at > /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421 > [ 73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: > test_cgroup_sto > [ 73.068342] INFO: lockdep is turned off. > [ 73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted > 4.20.0-rc2-next-20181113 #1 > [ 73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS > 2.0b 07/27/2017 > [ 73.088018] Call Trace: > [ 73.090463] dump_stack+0x70/0xa5 > [ 73.093783] ___might_sleep+0x152/0x240 > [ 73.097619] __might_sleep+0x4a/0x80 > [ 73.101191] __kmalloc_node+0x1cf/0x2f0 > [ 73.105031] ? cgroup_storage_update_elem+0x46/0x90 > [ 73.109909] cgroup_storage_update_elem+0x46/0x90 > > cgroup_storage_update_elem() (as well as other update map update > callbacks) is called with disabled preemption, so GFP_ATOMIC > allocation should be used: e.g. alloc_htab_elem() in hashtab.c. > > Reported-by: Naresh Kamboju > Tested-by: Naresh Kamboju > Signed-off-by: Roman Gushchin > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > Signed-off-by: Alexei Starovoitov > Signed-off-by: Sasha Levin I have reported above issue on 4.20.0-rc2-next-20181113. Now this BUG re-occurring on 4.19.8-rc1 on x86_64 and arm64 devices. [ 70.288592] BUG: sleeping function called from invalid context at /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421 [ 70.301992] in_atomic(): 1, irqs_disabled(): 0, pid: 3001, name: test_cgroup_sto [ 70.309424] INFO: lockdep is turned off. [ 70.313416] CPU: 0 PID: 3001 Comm: test_cgroup_sto Not tainted 4.19.8-rc1 #1 [ 70.320483] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0b 07/27/2017 [ 70.327953] Call Trace: [ 70.330402] dump_stack+0x70/0xa5 [ 70.333765] ___might_sleep+0x152/0x240 [ 70.337599] __might_sleep+0x4a/0x80 [ 70.341169] __kmalloc_node+0x1d1/0x300 [ 70.345003] ? cgroup_storage_update_elem+0x46/0x90 [ 70.349881] cgroup_storage_update_elem+0x46/0x90 [ 70.354585] map_update_elem+0x1fd/0x450 [ 70.358504] __x64_sys_bpf+0x129/0x270 [ 70.362258] do_syscall_64+0x55/0x190 [ 70.365923] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 70.370974] RIP: 0033:0x7f42e0ebb969 [ 70.374544] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d df e4 2b 00 f7 d8 64 89 01 48 [ 70.393281] RSP: 002b:7ffde61a0a08 EFLAGS: 0202 ORIG_RAX: 0141 [ 70.400845] RAX: ffda RBX: 0003 RCX: 7f42e0ebb969 [ 70.407971] RDX: 0048 RSI: 7ffde61a0a50 RDI: 0002 [ 70.415094] RBP: 7ffde61a0a20 R08: 7ffde61a0a50 R09: 7ffde61a0a50 [ 70.422216] R10: 7ffde61a0a50 R11: 0202 R12: 0005 [ 70.429342] R13: 7ffde61a0c10 R14: R15: selftests: bpf: test_cgroup_storage Full test log links, arm64 Juno https://lkft.validation.linaro.org/scheduler/job/537820#L2971 x86_64 Supermicro SYS-5019S-ML/X11SSH-F https://lkft.validation.linaro.org/scheduler/job/537772#L2724 Best regards Naresh Kamboju > --- > kernel/bpf/local_storage.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c > index 830d7f095748..fc1605aee5ea 100644 > --- a/kernel/bpf/local_storage.c > +++ b/kernel/bpf/local_storage.c > @@ -138,7 +138,8 @@ static int cgroup_storage_update_elem(struct bpf_map > *map, void *_key, > return -ENOENT; > > new = kmalloc_node(sizeof(struct bpf_storage_buffer) + > - map->value_size, __GFP_ZERO | GFP_USER, > + map->value_size, > + __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN, >map->numa_node); > if (!new) > return -ENOMEM; > -- > 2.17.1 >
[PATCHv2 net 3/3] sctp: fa_resize sctp stream instead of redo fa_alloc
Now when doing 4-shakehand or adding new streams, sctp has to allocate new memory for asoc->stream and copy the old stream's information from the old asoc->stream to the new one. It also cause the stream pointers to change, by which a panic was even caused due to stream->out_curr's change. To fix this, flex_array_resize() is used in sctp_stream_alloc_out/in() when asoc->stream has been allocated. Besides, with this asoc->stream will only be allocated once, and grow or shrink dynamically later. Note that flex_array_prealloc() is needed before growing as fa_alloc does, while flex_array_clear() and flex_array_shrink() are called to free the unused memory before shrinking. Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Ying Xu Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com Suggested-by: Neil Horman Signed-off-by: Xin Long Acked-by: Neil Horman --- net/sctp/stream.c | 87 +-- 1 file changed, 40 insertions(+), 47 deletions(-) diff --git a/net/sctp/stream.c b/net/sctp/stream.c index 3892e76..aff30b2 100644 --- a/net/sctp/stream.c +++ b/net/sctp/stream.c @@ -37,6 +37,17 @@ #include #include +static void fa_zero(struct flex_array *fa, size_t index, size_t count) +{ + void *elem; + + while (count--) { + elem = flex_array_get(fa, index); + memset(elem, 0, fa->element_size); + index++; + } +} + static struct flex_array *fa_alloc(size_t elem_size, size_t elem_count, gfp_t gfp) { @@ -48,8 +59,9 @@ static struct flex_array *fa_alloc(size_t elem_size, size_t elem_count, err = flex_array_prealloc(result, 0, elem_count, gfp); if (err) { flex_array_free(result); - result = NULL; + return NULL; } + fa_zero(result, 0, elem_count); } return result; @@ -61,27 +73,28 @@ static void fa_free(struct flex_array *fa) flex_array_free(fa); } -static void fa_copy(struct flex_array *fa, struct flex_array *from, - size_t index, size_t count) +static int fa_resize(struct flex_array *fa, size_t count, gfp_t gfp) { - void *elem; + int nr = fa->total_nr_elements, n; - while (count--) { - elem = flex_array_get(from, index); - flex_array_put(fa, index, elem, 0); - index++; + if (count > nr) { + if (flex_array_resize(fa, count, gfp)) + return -ENOMEM; + if (flex_array_prealloc(fa, nr, count - nr, gfp)) + return -ENOMEM; + fa_zero(fa, nr, count - nr); + + return 0; } -} -static void fa_zero(struct flex_array *fa, size_t index, size_t count) -{ - void *elem; + /* Shrink the unused memory, +* FLEX_ARRAY_FREE check is safe for sctp stream. +*/ + for (n = count; n < nr; n++) + flex_array_clear(fa, n); + flex_array_shrink(fa); - while (count--) { - elem = flex_array_get(fa, index); - memset(elem, 0, fa->element_size); - index++; - } + return flex_array_resize(fa, count, gfp); } /* Migrates chunks from stream queues to new stream queues if needed, @@ -138,47 +151,27 @@ static void sctp_stream_outq_migrate(struct sctp_stream *stream, static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt, gfp_t gfp) { - struct flex_array *out; - size_t elem_size = sizeof(struct sctp_stream_out); - - out = fa_alloc(elem_size, outcnt, gfp); - if (!out) - return -ENOMEM; + if (!stream->out) { + stream->out = fa_alloc(sizeof(struct sctp_stream_out), + outcnt, gfp); - if (stream->out) { - fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt)); - fa_free(stream->out); + return stream->out ? 0 : -ENOMEM; } - if (outcnt > stream->outcnt) - fa_zero(out, stream->outcnt, (outcnt - stream->outcnt)); - - stream->out = out; - - return 0; + return fa_resize(stream->out, outcnt, gfp); } static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt, gfp_t gfp) { - struct flex_array *in; - size_t elem_size = sizeof(struct sctp_stream_in); + if (!stream->in) { + stream->in = fa_alloc(sizeof(struct sctp_stream_in), + incnt, gfp); - in = fa_alloc(elem_size, incnt, gfp); - if (!in) - return -ENOMEM; - - if (stream->in) { - fa_copy(in, stream->in, 0, min(incnt, stream->incnt)); - fa_
[PATCHv2 net 1/3] flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE
This patch is to separate the base data memory from struct flex_array and save it into a page. With this change, total_nr_elements of a flex_array can grow or shrink without having the old element's memory changed when the new size of the flex_arry crosses FLEX_ARRAY_BASE_SIZE, which will be added in the next patch. Suggested-by: Neil Horman Signed-off-by: Xin Long Acked-by: Neil Horman --- include/linux/flex_array.h | 29 + lib/flex_array.c | 15 --- 2 files changed, 21 insertions(+), 23 deletions(-) diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h index b94fa61..29ad65f 100644 --- a/include/linux/flex_array.h +++ b/include/linux/flex_array.h @@ -7,9 +7,10 @@ #include #define FLEX_ARRAY_PART_SIZE PAGE_SIZE -#define FLEX_ARRAY_BASE_SIZE PAGE_SIZE +#define FLEX_ARRAY_BASE_SIZE FLEX_ARRAY_PART_SIZE struct flex_array_part; +struct flex_array_part_p; /* * This is meant to replace cases where an array-like @@ -19,29 +20,17 @@ struct flex_array_part; */ struct flex_array { - union { - struct { - int element_size; - int total_nr_elements; - int elems_per_part; - struct reciprocal_value reciprocal_elems; - struct flex_array_part *parts[]; - }; - /* -* This little trick makes sure that -* sizeof(flex_array) == PAGE_SIZE -*/ - char padding[FLEX_ARRAY_BASE_SIZE]; - }; + int element_size; + int total_nr_elements; + int elems_per_part; + struct reciprocal_value reciprocal_elems; + struct flex_array_part_p *part_p; +#define parts part_p->p_part }; -/* Number of bytes left in base struct flex_array, excluding metadata */ -#define FLEX_ARRAY_BASE_BYTES_LEFT \ - (FLEX_ARRAY_BASE_SIZE - offsetof(struct flex_array, parts)) - /* Number of pointers in base to struct flex_array_part pages */ #define FLEX_ARRAY_NR_BASE_PTRS \ - (FLEX_ARRAY_BASE_BYTES_LEFT / sizeof(struct flex_array_part *)) + (FLEX_ARRAY_BASE_SIZE / sizeof(struct flex_array_part *)) /* Number of elements of size that fit in struct flex_array_part */ #define FLEX_ARRAY_ELEMENTS_PER_PART(size) \ diff --git a/lib/flex_array.c b/lib/flex_array.c index 2eed22f..8c0b9b6 100644 --- a/lib/flex_array.c +++ b/lib/flex_array.c @@ -30,6 +30,10 @@ struct flex_array_part { char elements[FLEX_ARRAY_PART_SIZE]; }; +struct flex_array_part_p { + struct flex_array_part *p_part[FLEX_ARRAY_NR_BASE_PTRS]; +}; + /* * If a user requests an allocation which is small * enough, we may simply use the space in the @@ -39,7 +43,7 @@ struct flex_array_part { static inline int elements_fit_in_base(struct flex_array *fa) { int data_size = fa->element_size * fa->total_nr_elements; - if (data_size <= FLEX_ARRAY_BASE_BYTES_LEFT) + if (data_size <= FLEX_ARRAY_BASE_SIZE) return 1; return 0; } @@ -105,13 +109,17 @@ struct flex_array *flex_array_alloc(int element_size, unsigned int total, ret = kzalloc(sizeof(struct flex_array), flags); if (!ret) return NULL; + ret->part_p = kzalloc(sizeof(struct flex_array_part_p), flags); + if (!ret->part_p) { + kfree(ret); + return NULL; + } ret->element_size = element_size; ret->total_nr_elements = total; ret->elems_per_part = elems_per_part; ret->reciprocal_elems = reciprocal_elems; if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO)) - memset(&ret->parts[0], FLEX_ARRAY_FREE, - FLEX_ARRAY_BASE_BYTES_LEFT); + memset(&ret->parts[0], FLEX_ARRAY_FREE, FLEX_ARRAY_BASE_SIZE); return ret; } EXPORT_SYMBOL(flex_array_alloc); @@ -148,6 +156,7 @@ EXPORT_SYMBOL(flex_array_free_parts); void flex_array_free(struct flex_array *fa) { flex_array_free_parts(fa); + kfree(fa->part_p); kfree(fa); } EXPORT_SYMBOL(flex_array_free); -- 2.1.0
[PATCHv2 net 0/3] net: add support for flex_array_resize in flex_array
Without the support for the total_nr_elements's growing or shrinking dynamically, flex_array is not that 'flexible'. Like when users want to change the size, they have to redo flex_array_alloc and copy all the elements from the old to the new one. The worse thing is every element's memory gets changed. To implement flex_array_resize based on current code, the difficult thing is to process the size border of FLEX_ARRAY_BASE_BYTES_LEFT, where the base data memory may change to an array for the 2nd level data memory for growing, likewise for shrinking. To make this part easier, we separate the base data memory and define FLEX_ARRAY_BASE_SIZE as a same value of FLEX_ARRAY_PART_SIZE, as Neil suggested. When new size is crossing the border, the base memory is allocated as the array for the 2nd level data memory and its part[0] is pointed to the old base memory, and do the opposite for shrinking. But it doesn't do any memory allocation or shrinking for elements in flex_array_resize, as which should be done by flex_array_prealloc or flex_array_shrink called by users. No memory leaks can be caused by that. SCTP has benefited a lot from flex_array_resize() for managing its stream memory so far. v1->v2: Cc LKML and more developers. Xin Long (3): flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE flex_array: support flex_array_resize sctp: fa_resize sctp stream instead of redo fa_alloc include/linux/flex_array.h | 40 ++--- lib/flex_array.c | 73 -- net/sctp/stream.c | 87 +- 3 files changed, 130 insertions(+), 70 deletions(-) -- 2.1.0
[PATCHv2 net 2/3] flex_array: support flex_array_resize
This function can dynamically change total_nr_elements of a flex_array, and keep the old elements of the same memory. Returns 0 if it succeeds. Note that it won't do any memory allocation or shrinking for elements, which should be only done by flex_array_prealloc and flex_array_shrink. Suggested-by: Neil Horman Signed-off-by: Xin Long Acked-by: Neil Horman --- include/linux/flex_array.h | 11 + lib/flex_array.c | 58 ++ 2 files changed, 69 insertions(+) diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h index 29ad65f..19ff58d 100644 --- a/include/linux/flex_array.h +++ b/include/linux/flex_array.h @@ -130,6 +130,17 @@ void *flex_array_get(struct flex_array *fa, unsigned int element_nr); */ int flex_array_shrink(struct flex_array *fa); +/** + * flex_array_resize() - Resize without the old elements memory changed + * @fa:array to resize + * @total: total number of elements that this would change to + * @flags: page allocation flags to use for base array + * + * Return: Returns 0 if it succeeds. + * + */ +int flex_array_resize(struct flex_array *fa, unsigned int total, gfp_t flags); + #define flex_array_put_ptr(fa, nr, src, gfp) \ flex_array_put(fa, nr, (void *)&(src), gfp) diff --git a/lib/flex_array.c b/lib/flex_array.c index 8c0b9b6..2f913e7 100644 --- a/lib/flex_array.c +++ b/lib/flex_array.c @@ -405,3 +405,61 @@ int flex_array_shrink(struct flex_array *fa) return ret; } EXPORT_SYMBOL(flex_array_shrink); + +/** + * flex_array_resize - resize without the old elements memory changed + * @fa:the flex array to resize + * @total: total number of elements that this would change to + * @flags: page allocation flags to use for base array + * + * This function can dynamically change total_nr_elements of a flex_array, + * and keep the old elements of the same memory. Returns 0 if it succeeds. + * Note that it won't do any memory allocation or shrinking for elements, + * which should be only done by flex_array_prealloc and flex_array_shrink. + * + * Locking must be provided by the caller. + */ +int flex_array_resize(struct flex_array *fa, unsigned int total, gfp_t flags) +{ + int nr; + + if (total > FLEX_ARRAY_NR_BASE_PTRS * fa->elems_per_part) + return -EINVAL; + + if (elements_fit_in_base(fa)) { + struct flex_array_part_p *part_p; + + nr = fa->total_nr_elements; + fa->total_nr_elements = total; + if (elements_fit_in_base(fa)) + return 0; + + part_p = kzalloc(sizeof(*part_p), flags); + if (!part_p) { + fa->total_nr_elements = nr; + return -ENOMEM; + } + + part_p->p_part[0] = (struct flex_array_part *)&fa->parts[0]; + fa->part_p = part_p; + } else { + struct flex_array_part *part; + + fa->total_nr_elements = total; + if (!elements_fit_in_base(fa)) + return 0; + + for (nr = 1; nr < FLEX_ARRAY_NR_BASE_PTRS; nr++) { + part = fa->parts[nr]; + if (part) { + fa->parts[nr] = NULL; + kfree(part); + } + } + + fa->part_p = (struct flex_array_part_p *)fa->parts[0]; + } + + return 0; +} +EXPORT_SYMBOL(flex_array_resize); -- 2.1.0
Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead
From: David Miller Date: Thu, 06 Dec 2018 22:24:09 -0800 (PST) > Series applied, thanks! Erm... actually reverted. Please fix these build failures: ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_receive': ip6_offload.c:(.text+0xda2): undefined reference to `udp6_gro_receive' ld: ip6_offload.c:(.text+0xdb6): undefined reference to `udp6_gro_receive' ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_complete': ip6_offload.c:(.text+0x1953): undefined reference to `udp6_gro_complete' ld: ip6_offload.c:(.text+0x1966): undefined reference to `udp6_gro_complete' make: *** [Makefile:1036: vmlinux] Error 1 THanks.
Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead
From: Paolo Abeni Date: Wed, 5 Dec 2018 19:13:38 +0100 > The spectre v2 counter-measures, aka retpolines, are a source of measurable > overhead[1]. We can partially address that when the function pointer refers to > a builtin symbol resorting to a list of tests vs well-known builtin function > and > direct calls. > > Experimental results show that replacing a single indirect call via > retpoline with several branches and a direct call gives performance gains > even when multiple branches are added - 5 or more, as reported in [2]. > > This may lead to some uglification around the indirect calls. In netconf 2018 > Eric Dumazet described a technique to hide the most relevant part of the > needed > boilerplate with some macro help. > > This series is a [re-]implementation of such idea, exposing the introduced > helpers in a new header file. They are later leveraged to avoid the indirect > call overhead in the GRO path, when possible. > > Overall this gives > 10% performance improvement for UDP GRO benchmark and > smaller but measurable for TCP syn flood. > > The added infra can be used in follow-up patches to cope with retpoline > overhead > in other points of the networking stack (e.g. at the qdisc layer) and possibly > even in other subsystems. ... Series applied, thanks!
Re: [PATCH rdma-next 0/3] Packet based credit mode
On Thu, Dec 06, 2018 at 08:27:06PM -0700, Jason Gunthorpe wrote: > On Fri, Nov 30, 2018 at 01:22:03PM +0200, Leon Romanovsky wrote: > > From: Leon Romanovsky > > > > >From Danit, > > > > Packet based credit mode is an alternative end-to-end credit mode for QPs > > set during their creation. Credits are transported from the responder > > to the requester to optimize the use of its receive resources. > > In packet-based credit mode, credits are issued on a per packet basis. > > > > The advantage of this feature comes while sending large RDMA messages > > through switches that are short in memory. > > > > The first commit exposes QP creation flag and the HCA capability. The > > second commit adds support for a new DV QP creation flag. The last > > commit report packet based credit mode capability via the MLX5DV device > > capabilities. > > > > Thanks > > > > Danit Goldberg (3): > > net/mlx5: Expose packet based credit mode > > IB/mlx5: Add packet based credit mode support > > IB/mlx5: Report packet based credit mode device capability > > This looks fine to me, can you update the shared branch please Done, thanks 3fd3c80acc17 net/mlx5: Expose packet based credit mode > > Thanks, > Jason signature.asc Description: PGP signature
[PATCH v3] ptp: fix an IS_ERR() vs NULL check
We recently modified pps_register_source() to return error pointers instead of NULL but it seems like there was a merge issue and part of the commit was lost. Anyway, the ptp_clock_register() function needs to be updated to check for IS_ERR() as well. Fixes: 3b1ad360acad ("pps: using ERR_PTR instead of NULL while pps_register_source fails") Signed-off-by: Dan Carpenter --- v2: Use the correct Fixes tag. v3: Add Greg to the CC list for real this time. drivers/ptp/ptp_clock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c index 8a81eecc0ecd..48f3594a7458 100644 --- a/drivers/ptp/ptp_clock.c +++ b/drivers/ptp/ptp_clock.c @@ -265,8 +265,8 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info, pps.mode = PTP_PPS_MODE; pps.owner = info->owner; ptp->pps_source = pps_register_source(&pps, PTP_PPS_DEFAULTS); - if (!ptp->pps_source) { - err = -EINVAL; + if (IS_ERR(ptp->pps_source)) { + err = PTR_ERR(ptp->pps_source); pr_err("failed to register pps source\n"); goto no_pps; } -- 2.11.0
Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk
On Wed, Dec 05, 2018 at 02:51:02PM +1100, NeilBrown wrote: > > If the sequence: >obj = rhashtable_walk_next(iter); >rhashtable_walk_stop(iter); >rhashtable_remove_fast(ht, &obj->head, params); >rhashtable_walk_start(iter); > > races with another thread inserting or removing > an object on the same hash chain, a subsequent > rhashtable_walk_next() is not guaranteed to get the "next" > object. It is possible that an object could be > repeated, or missed. > > This can be made more reliable by keeping the objects in a hash chain > sorted by memory address. A subsequent rhashtable_walk_next() > call can reliably find the correct position in the list, and thus > find the 'next' object. > > It is not possible to take this approach with an rhltable as keeping > the hash chain in order is not so easy. When the first object with a > given key is removed, it is replaced in the chain with the next > object with the same key, and the address of that object may not be > correctly ordered. > I have not yet found any way to achieve the same stability > with rhltables, that doesn't have a major impact on lookup > or insert. No code currently in Linux would benefit from > such extra stability. > > With this patch: > - a new object is always inserted after the last object with a >smaller address, or at the start. > - when rhashtable_walk_start() is called, it records that 'p' is not >'safe', meaning that it cannot be dereferenced. The revalidation >that was previously done here is moved to rhashtable_walk_next() > - when rhashtable_walk_next() is called while p is not NULL and not >safe, it walks the chain looking for the first object with an >address greater than p and returns that. If there is none, it moves >to the next hash chain. > > Signed-off-by: NeilBrown > --- > > This is a resend of a patch that I sent back in July. I couldn't > applied then because it assumed another rhashtable patch which hadn't > landed yet - it now has. I thought we had agreed to drop this because nobody needs it currently and it doesn't handle rhlist? Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
linux-next: manual merge of the rcu tree with the net-next tree
Hi all, Today's linux-next merge of the rcu tree got conflicts in: net/bridge/br_mdb.c net/bridge/br_multicast.c between commits: 19e3a9c90c53 ("net: bridge: convert multicast to generic rhashtable") 4329596cb10d ("net: bridge: multicast: use non-bh rcu flavor") from the net-next tree and commit: 1a56f7d53b5c ("net/bridge: Replace call_rcu_bh() and rcu_barrier_bh()") from the rcu tree. I fixed it up (I used the versions form net-next) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell pgp1wJLljZ6hr.pgp Description: OpenPGP digital signature
Re: [PATCH bpf] selftests/bpf: add missing pointer dereference for map stacktrace fixup
On 12/7/2018 1:14 PM, Stanislav Fomichev wrote: I get a segfault without it, other fixups always do dereference, and without dereference I don't understand how it can ever work. Fixes: 7c85c448e7d74 ("selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf prog") Signed-off-by: Stanislav Fomichev --- tools/testing/selftests/bpf/test_verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index df6f751cc1e8..d23929a1985d 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -14166,7 +14166,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_map_type prog_type, do { prog[*fixup_map_stacktrace].imm = map_fds[12]; fixup_map_stacktrace++; - } while (fixup_map_stacktrace); + } while (*fixup_map_stacktrace); } } It was my mistake. Thanks for the fix! -Prashant
[PATCH bpf] selftests/bpf: add missing pointer dereference for map stacktrace fixup
I get a segfault without it, other fixups always do dereference, and without dereference I don't understand how it can ever work. Fixes: 7c85c448e7d74 ("selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf prog") Signed-off-by: Stanislav Fomichev --- tools/testing/selftests/bpf/test_verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index df6f751cc1e8..d23929a1985d 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -14166,7 +14166,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_map_type prog_type, do { prog[*fixup_map_stacktrace].imm = map_fds[12]; fixup_map_stacktrace++; - } while (fixup_map_stacktrace); + } while (*fixup_map_stacktrace); } } -- 2.20.0.rc2.403.gdbc3b29805-goog
Re: [RFC PATCH 4/6] dt-bindings: update mvneta binding document
Hi Russell, On 05/12/18 9:00 PM, Rob Herring wrote: > On Wed, Dec 5, 2018 at 5:00 AM Russell King - ARM Linux > wrote: >> >> On Mon, Dec 03, 2018 at 05:54:55PM -0600, Rob Herring wrote: >>> On Mon, Nov 12, 2018 at 12:31:02PM +, Russell King wrote: Signed-off-by: Russell King >>> >>> Needs a better subject and a commit msg. >> >> Hmm, not sure why it didn't contain: >> >> "dt-bindings: net: mvneta: add phys property >> >> Add an optional phys property to the mvneta binding documentation for >> the common phy. >> >> Signed-off-by: Russell King " >> >> as the commit message. With the correct commit message, are you happy >> with it? > > Yes. > > Reviewed-by: Rob Herring Are you planning to resend this series? Thanks Kishon
Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"
On Fri, 2018-12-07 at 14:20 +1100, Benjamin Herrenschmidt wrote: > Apologies for the empty email, not too sure what happened, I did a resend and the second one worked. Cheers Ben.
[PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"
This reverts commit 624ca9c33c8a853a4a589836e310d776620f4ab9. This commit is completely bogus. The STACR register has two formats, old and new, depending on the version of the IP block used. There's a pair of device-tree properties that can be used to specify the format used: has-inverted-stacr-oc has-new-stacr-staopc What this commit did was to change the bit definition used with the old parts to match the new parts. This of course breaks the driver on all the old ones. Instead, the author should have set the appropriate properties in the device-tree for the variant used on his board. Signed-off-by: Benjamin Herrenschmidt --- Found while setting up some old ppc440 boxes for test/CI drivers/net/ethernet/ibm/emac/emac.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ibm/emac/emac.h b/drivers/net/ethernet/ibm/emac/emac.h index e2f80cc..0d2de6f 100644 --- a/drivers/net/ethernet/ibm/emac/emac.h +++ b/drivers/net/ethernet/ibm/emac/emac.h @@ -231,7 +231,7 @@ struct emac_regs { #define EMAC_STACR_PHYE0x4000 #define EMAC_STACR_STAC_MASK 0x3000 #define EMAC_STACR_STAC_READ 0x1000 -#define EMAC_STACR_STAC_WRITE 0x0800 +#define EMAC_STACR_STAC_WRITE 0x2000 #define EMAC_STACR_OPBC_MASK 0x0C00 #define EMAC_STACR_OPBC_50 0x #define EMAC_STACR_OPBC_66 0x0400 -- 2.7.4
Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"
Looks like your posting was empty?
Re: [PATCH net-next] neighbour: Improve garbage collection
From: David Ahern Date: Thu, 6 Dec 2018 14:38:44 -0800 > The existing garbage collection algorithm has a number of problems: Thanks for working on this! I totally agree with what you are doing, especially the separate gc_list. But why do you need the on_gc_list boolean state? That's equivalent to "!list_empty(&n->gc_list)" and seems redundant.
Re: [PATCH net-next 3/4] net: use indirect call wrapper at GRO transport layer
Hi Paolo, Thank you for the patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Paolo-Abeni/net-mitigate-retpoline-overhead/20181206-183400 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): net/ipv6/ip6_offload.o: In function `ipv6_gro_receive': >> ip6_offload.c:(.text+0x58e): undefined reference to `transport6_gro_receive1' ip6_offload.c:(.text+0x595): undefined reference to `transport6_gro_receive1' net/ipv6/ip6_offload.o: In function `ipv6_gro_complete': >> ip6_offload.c:(.text+0xb26): undefined reference to >> `transport6_gro_complete1' ip6_offload.c:(.text+0xb30): undefined reference to `transport6_gro_complete1' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH bpf] selftests/bpf: skip sockmap tests on kernels without support
Include "autoconf.h" header in the test_maps.c and selectively disable test_sockmap if CONFIG_BPF_STREAM_PARSER is not specified in the kernel config. When building out of tree/without autoconf.h, fall back to 'enabled'. Signed-off-by: Stanislav Fomichev --- tools/testing/selftests/bpf/test_maps.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index 4db2116e52be..3548de8a78ac 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -32,6 +32,13 @@ #define ENOTSUPP 524 #endif +#ifdef HAVE_GENHDR +# include "autoconf.h" +#else +/* fallback to all features enabled */ +# define CONFIG_BPF_STREAM_PARSER 1 +#endif + static int map_flags; #define CHECK(condition, tag, format...) ({\ @@ -588,6 +595,7 @@ static void test_stackmap(int task, void *data) close(fd); } +#ifdef CONFIG_BPF_STREAM_PARSER #include #include #include @@ -1079,6 +1087,7 @@ static void test_sockmap(int tasks, void *data) close(fd); exit(1); } +#endif #define MAP_SIZE (32 * 1024) @@ -1541,7 +1550,9 @@ static void run_all_tests(void) test_arraymap_percpu_many_keys(); test_devmap(0, NULL); +#ifdef CONFIG_BPF_STREAM_PARSER test_sockmap(0, NULL); +#endif test_map_large(); test_map_parallel(); -- 2.20.0.rc2.403.gdbc3b29805-goog
Re: [PATCH rdma-next 0/3] Packet based credit mode
On Fri, Nov 30, 2018 at 01:22:03PM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky > > >From Danit, > > Packet based credit mode is an alternative end-to-end credit mode for QPs > set during their creation. Credits are transported from the responder > to the requester to optimize the use of its receive resources. > In packet-based credit mode, credits are issued on a per packet basis. > > The advantage of this feature comes while sending large RDMA messages > through switches that are short in memory. > > The first commit exposes QP creation flag and the HCA capability. The > second commit adds support for a new DV QP creation flag. The last > commit report packet based credit mode capability via the MLX5DV device > capabilities. > > Thanks > > Danit Goldberg (3): > net/mlx5: Expose packet based credit mode > IB/mlx5: Add packet based credit mode support > IB/mlx5: Report packet based credit mode device capability This looks fine to me, can you update the shared branch please Thanks, Jason
[PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"
Re: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver
On Thu, Dec 06, 2018 at 08:00:26PM +, tristram...@microchip.com wrote: > A customer has already inquired about implementing 1588 PTP in the DSA > driver. I hope > this mechanism is approved so that I can start doing that. If you need changes to the PTP core, you had better discuss this with the PTP maintainer. Thanks, Richard
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, 2018-12-06 at 15:08 -0800, Nadav Amit wrote: > > On Dec 6, 2018, at 12:17 PM, Andy Lutomirski wrote: > > > > On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit wrote: > > > > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski wrote: > > > > > > > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: > > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: > > > > > > > If we are going to unmap the linear alias, why not do it at > > > > > > > vmalloc() > > > > > > > time rather than vfree() time? > > > > > > > > > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > > > > > work on module data? Perhaps crypto code trying to encrypt static > > > > > > data because our APIs don’t understand virtual addresses. I guess > > > > > > if > > > > > > highmem is ever used for modules, then we should be fine. > > > > > > > > > > > > RO instead of not present might be safer. But I do like the idea of > > > > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP > > > > > > and > > > > > > making it do all of this. > > > > > > > > > > Yeah, doing it for everything automatically seemed like it was/is > > > > > going to be a lot of work to debug all the corner cases where things > > > > > expect memory to be mapped but don't explicitly say it. And in > > > > > particular, the XPFO series only does it for user memory, whereas an > > > > > additional flag like this would work for extra paranoid allocations > > > > > of kernel memory too. > > > > > > > > I just read the code, and I looks like vmalloc() is already using > > > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for > > > > example, we already don't have modules in the direct map. > > > > > > > > So I say we go for it. This should be quite simple to implement -- > > > > the pageattr code already has almost all the needed logic on x86. The > > > > only arch support we should need is a pair of functions to remove a > > > > vmalloc address range from the address map (if it was present in the > > > > first place) and a function to put it back. On x86, this should only > > > > be a few lines of code. > > > > > > > > What do you all think? This should solve most of the problems we have. > > > > > > > > If we really wanted to optimize this, we'd make it so that > > > > module_alloc() allocates memory the normal way, then, later on, we > > > > call some function that, all at once, removes the memory from the > > > > direct map and applies the right permissions to the vmalloc alias (or > > > > just makes the vmalloc alias not-present so we can add permissions > > > > later without flushing), and flushes the TLB. And we arrange for > > > > vunmap to zap the vmalloc range, then put the memory back into the > > > > direct map, then free the pages back to the page allocator, with the > > > > flush in the appropriate place. > > > > > > > > I don't see why the page allocator needs to know about any of this. > > > > It's already okay with the permissions being changed out from under it > > > > on x86, and it seems fine. Rick, do you want to give some variant of > > > > this a try? > > > > > > Setting it as read-only may work (and already happens for the read-only > > > module data). I am not sure about setting it as non-present. > > > > > > At some point, a discussion about a threat-model, as Rick indicated, would > > > be required. I presume ROP attacks can easily call > > > set_all_modules_text_rw() > > > and override all the protections. > > > > I am far from an expert on exploit techniques, but here's a > > potentially useful model: let's assume there's an attacker who can > > write controlled data to a controlled kernel address but cannot > > directly modify control flow. It would be nice for such an attacker > > to have a very difficult time of modifying kernel text or of > > compromising control flow. So we're assuming a feature like kernel > > CET or that the attacker finds it very difficult to do something like > > modifying some thread's IRET frame. > > > > Admittedly, for the kernel, this is an odd threat model, since an > > attacker can presumably quite easily learn the kernel stack address of > > one of their tasks, do some syscall, and then modify their kernel > > thread's stack such that it will IRET right back to a fully controlled > > register state with RSP pointing at an attacker-supplied kernel stack. > > So this threat model gives very strong ROP powers. unless we have > > either CET or some software technique to harden all the RET > > instructions in the kernel. > > > > I wonder if there's a better model to use. Maybe with stack-protector > > we get some degree of protection? Or is all of this is rather weak > > until we have CET or a RAP-like feature. > > I believe that seeing the end-goal would make reasoning about patches > easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping > up. > > I’m not sure CET or other CFI would be eno
linux-next: manual merge of the net-next tree with the net tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: net/dsa/master.c between commit: a3d7e01da060 ("net: dsa: Fix tagging attribute location") from the net tree and commit: dc0fe7d47f9f ("net: dsa: Set the master device's MTU to account for DSA overheads") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc net/dsa/master.c index 5e8c9bef78bd,42f525bc68e2.. --- a/net/dsa/master.c +++ b/net/dsa/master.c @@@ -158,31 -158,24 +158,47 @@@ static void dsa_master_ethtool_teardown cpu_dp->orig_ethtool_ops = NULL; } +static ssize_t tagging_show(struct device *d, struct device_attribute *attr, + char *buf) +{ + struct net_device *dev = to_net_dev(d); + struct dsa_port *cpu_dp = dev->dsa_ptr; + + return sprintf(buf, "%s\n", + dsa_tag_protocol_to_str(cpu_dp->tag_ops)); +} +static DEVICE_ATTR_RO(tagging); + +static struct attribute *dsa_slave_attrs[] = { + &dev_attr_tagging.attr, + NULL +}; + +static const struct attribute_group dsa_group = { + .name = "dsa", + .attrs = dsa_slave_attrs, +}; + + void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) + { + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead; + int err; + + rtnl_lock(); + if (mtu <= dev->max_mtu) { + err = dev_set_mtu(dev, mtu); + if (err) + netdev_dbg(dev, "Unable to set MTU to include for DSA overheads\n"); + } + rtnl_unlock(); + } + int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) { + int ret; + + dsa_master_set_mtu(dev, cpu_dp); + /* If we use a tagging format that doesn't have an ethertype * field, make sure that all packets from this point on get * sent to the tag format's receive function. pgpu9_qpMQdSC.pgp Description: OpenPGP digital signature
Re: [PATCH v2 perf,bpf 1/3] perf, bpf: Introduce PERF_RECORD_BPF_EVENT
Hi Song, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/perf/core] [also build test ERROR on v4.20-rc5 next-20181206] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Song-Liu/perf-bpf-Introduce-PERF_RECORD_BPF_EVENT/20181207-083615 config: i386-randconfig-x070-201848 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All error/warnings (new ones prefixed by >>): kernel/events/core.c: In function 'perf_event_bpf_output': >> kernel/events/core.c:7694:2: error: implicit declaration of function >> 'bpf_get_prog_name'; did you mean 'bpf_prog_free'? >> [-Werror=implicit-function-declaration] bpf_get_prog_name(bpf_event->prog, name); ^ bpf_prog_free kernel/events/core.c: In function 'perf_event_bpf_event_subprog': >> kernel/events/core.c:7737:12: warning: cast from pointer to integer of >> different size [-Wpointer-to-int-cast] .addr = (u64)prog->bpf_func, ^ cc1: some warnings being treated as errors vim +7694 kernel/events/core.c 7679 7680 static void perf_event_bpf_output(struct perf_event *event, 7681 void *data) 7682 { 7683 struct perf_bpf_event *bpf_event = data; 7684 struct perf_output_handle handle; 7685 struct perf_sample_data sample; 7686 char name[KSYM_NAME_LEN]; 7687 int name_len; 7688 int ret; 7689 7690 if (!perf_event_bpf_match(event)) 7691 return; 7692 7693 /* get prog name and round up to 64 bit aligned */ > 7694 bpf_get_prog_name(bpf_event->prog, name); 7695 name_len = strlen(name) + 1; 7696 while (!IS_ALIGNED(name_len, sizeof(u64))) 7697 name[name_len++] = '\0'; 7698 bpf_event->event_id.len += name_len; 7699 7700 perf_event_header__init_id(&bpf_event->event_id.header, &sample, event); 7701 ret = perf_output_begin(&handle, event, 7702 bpf_event->event_id.header.size); 7703 if (ret) 7704 return; 7705 7706 perf_output_put(&handle, bpf_event->event_id); 7707 7708 __output_copy(&handle, name, name_len); 7709 7710 perf_event__output_id_sample(event, &handle, &sample); 7711 7712 perf_output_end(&handle); 7713 } 7714 7715 static void perf_event_bpf(struct perf_bpf_event *bpf_event) 7716 { 7717 perf_iterate_sb(perf_event_bpf_output, 7718 bpf_event, 7719 NULL); 7720 } 7721 7722 static void perf_event_bpf_event_subprog( 7723 enum perf_bpf_event_type type, 7724 struct bpf_prog *prog, u32 id, u32 sub_id) 7725 { 7726 struct perf_bpf_event bpf_event = (struct perf_bpf_event){ 7727 .prog = prog, 7728 .event_id = { 7729 .header = { 7730 .type = PERF_RECORD_BPF_EVENT, 7731 .size = sizeof(bpf_event.event_id), 7732 }, 7733 .type = type, 7734 /* .flags = 0 */ 7735 .id = id, 7736 .sub_id = sub_id, > 7737 .addr = (u64)prog->bpf_func, 7738 .len = prog->jited_len, 7739 }, 7740 }; 7741 7742 memcpy(bpf_event.event_id.tag, prog->tag, BPF_TAG_SIZE); 7743 perf_event_bpf(&bpf_event); 7744 } 7745 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH] crypto: user - support incremental algorithm dumps
From: Eric Biggers CRYPTO_MSG_GETALG in NLM_F_DUMP mode sometimes doesn't return all registered crypto algorithms, because it doesn't support incremental dumps. crypto_dump_report() only permits itself to be called once, yet the netlink subsystem allocates at most ~64 KiB for the skb being dumped to. Thus only the first recvmsg() returns data, and it may only include a subset of the crypto algorithms even if the user buffer passed to recvmsg() is large enough to hold all of them. Fix this by using one of the arguments in the netlink_callback structure to keep track of the current position in the algorithm list. Then userspace can do multiple recvmsg() on the socket after sending the dump request. This is the way netlink dumps work elsewhere in the kernel; it's unclear why this was different (probably just an oversight). Also fix an integer overflow when calculating the dump buffer size hint. Fixes: a38f7907b926 ("crypto: Add userspace configuration API") Signed-off-by: Eric Biggers --- crypto/crypto_user_base.c | 37 - 1 file changed, 20 insertions(+), 17 deletions(-) diff --git a/crypto/crypto_user_base.c b/crypto/crypto_user_base.c index 7021efbb35a1e..5311fd7fae34b 100644 --- a/crypto/crypto_user_base.c +++ b/crypto/crypto_user_base.c @@ -231,30 +231,33 @@ static int crypto_report(struct sk_buff *in_skb, struct nlmsghdr *in_nlh, static int crypto_dump_report(struct sk_buff *skb, struct netlink_callback *cb) { - struct crypto_alg *alg; + const size_t start_pos = cb->args[0]; + size_t pos = 0; struct crypto_dump_info info; - int err; - - if (cb->args[0]) - goto out; - - cb->args[0] = 1; + struct crypto_alg *alg; + int res; info.in_skb = cb->skb; info.out_skb = skb; info.nlmsg_seq = cb->nlh->nlmsg_seq; info.nlmsg_flags = NLM_F_MULTI; + down_read(&crypto_alg_sem); list_for_each_entry(alg, &crypto_alg_list, cra_list) { - err = crypto_report_alg(alg, &info); - if (err) - goto out_err; + if (pos >= start_pos) { + res = crypto_report_alg(alg, &info); + if (res == -EMSGSIZE) + break; + if (res) + goto out; + } + pos++; } - + cb->args[0] = pos; + res = skb->len; out: - return skb->len; -out_err: - return err; + up_read(&crypto_alg_sem); + return res; } static int crypto_dump_report_done(struct netlink_callback *cb) @@ -442,7 +445,7 @@ static int crypto_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, if ((type == (CRYPTO_MSG_GETALG - CRYPTO_MSG_BASE) && (nlh->nlmsg_flags & NLM_F_DUMP))) { struct crypto_alg *alg; - u16 dump_alloc = 0; + unsigned long dump_alloc = 0; if (link->dump == NULL) return -EINVAL; @@ -450,16 +453,16 @@ static int crypto_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, down_read(&crypto_alg_sem); list_for_each_entry(alg, &crypto_alg_list, cra_list) dump_alloc += CRYPTO_REPORT_MAXSIZE; + up_read(&crypto_alg_sem); { struct netlink_dump_control c = { .dump = link->dump, .done = link->done, - .min_dump_alloc = dump_alloc, + .min_dump_alloc = min(dump_alloc, 65535UL), }; err = netlink_dump_start(crypto_nlsk, skb, nlh, &c); } - up_read(&crypto_alg_sem); return err; } -- 2.20.0.rc2.403.gdbc3b29805-goog
[net-next PATCH RFC 8/8] veth: xdp_frames redirected into veth need to transfer xdp_mem_info
XDP frames redirected into a veth device, that choose XDP_PASS end-up creating an SKB from the xdp_frame. The xdp_frame mem info need to be transferred into the SKB. Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Ilias Apalodimas --- drivers/net/veth.c |1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index f412ea1cef18..925d300402ca 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -555,6 +555,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, goto err; } + skb->mem_info = frame->mem; xdp_scrub_frame(frame); skb->protocol = eth_type_trans(skb, rq->dev); err:
[net-next PATCH RFC 2/8] net: mvneta: use page pool API for sw buffer manager
From: Ilias Apalodimas Use the page_pool api for allocations and DMA handling instead of __dev_alloc_page()/dma_map_page() and free_page()/dma_unmap_page(). The page_pool API offers buffer recycling capabilities for XDP but allocates one page per packet, unless the driver splits and manages the allocated page. Although XDP is not a part of the driver yet, the current implementation is allocating one page per packet, thus there's no performance penalty from using the API. For now pages are unmapped via page_pool_unmap_page() before packets travel into the network stack, as it doesn't have a return hook yet. Given this call cleared the page_pool state, it is safe to let the page be returned to the normal page allocator. Signed-off-by: Ilias Apalodimas Signed-off-by: Jesper Dangaard Brouer --- drivers/net/ethernet/marvell/Kconfig |1 + drivers/net/ethernet/marvell/mvneta.c | 56 - 2 files changed, 41 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig index 3238aa7f5dac..3325abe67465 100644 --- a/drivers/net/ethernet/marvell/Kconfig +++ b/drivers/net/ethernet/marvell/Kconfig @@ -60,6 +60,7 @@ config MVNETA depends on ARCH_MVEBU || COMPILE_TEST select MVMDIO select PHYLINK + select PAGE_POOL ---help--- This driver supports the network interface units in the Marvell ARMADA XP, ARMADA 370, ARMADA 38x and diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 5bfd349bf41a..2354421fe96f 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -33,6 +33,7 @@ #include #include #include "mvneta_bm.h" +#include #include #include #include @@ -624,6 +625,9 @@ struct mvneta_rx_queue { struct sk_buff *skb; int left_size; + /* page pool */ + struct page_pool *page_pool; + /* error counters */ u32 skb_alloc_err; u32 refill_err; @@ -1813,17 +1817,11 @@ static int mvneta_rx_refill(struct mvneta_port *pp, dma_addr_t phys_addr; struct page *page; - page = __dev_alloc_page(gfp_mask); + page = page_pool_dev_alloc_pages(rxq->page_pool); if (!page) return -ENOMEM; - /* map page for use */ - phys_addr = dma_map_page(pp->dev->dev.parent, page, 0, PAGE_SIZE, -DMA_FROM_DEVICE); - if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) { - __free_page(page); - return -ENOMEM; - } + phys_addr = page_pool_get_dma_addr(page); phys_addr += pp->rx_offset_correction; mvneta_rx_desc_fill(rx_desc, phys_addr, page, rxq); @@ -1892,10 +1890,11 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, if (!data || !(rx_desc->buf_phys_addr)) continue; - dma_unmap_page(pp->dev->dev.parent, rx_desc->buf_phys_addr, - PAGE_SIZE, DMA_FROM_DEVICE); - __free_page(data); + page_pool_put_page(rxq->page_pool, data, false); } + + if (rxq->page_pool) + page_pool_destroy(rxq->page_pool); } static inline @@ -2010,8 +2009,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi, skb_add_rx_frag(rxq->skb, frag_num, page, frag_offset, frag_size, PAGE_SIZE); - dma_unmap_page(dev->dev.parent, phys_addr, - PAGE_SIZE, DMA_FROM_DEVICE); + page_pool_unmap_page(rxq->page_pool, page); rxq->left_size -= frag_size; } } else { @@ -2041,8 +2039,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi, frag_offset, frag_size, PAGE_SIZE); - dma_unmap_page(dev->dev.parent, phys_addr, - PAGE_SIZE, DMA_FROM_DEVICE); + page_pool_unmap_page(rxq->page_pool, page); rxq->left_size -= frag_size; } @@ -2828,11 +2825,37 @@ static int mvneta_poll(struct napi_struct *napi, int budget) return rx_done; } +static int mvneta_create_page_pool(struct mvneta_port *pp, + struct mvneta_rx_queue *rxq, int num) +{ + struct page_pool_params pp_params = { 0 }; + int err = 0; + + pp_params.order = 0; + /* internal DMA mapping in page_pool */ + pp_params.flags = PP_FLAG_DMA_MAP; + pp_params.pool_size = num; + pp_params.nid = NUMA_NO_NO
[net-next PATCH RFC 1/8] page_pool: add helper functions for DMA
From: Ilias Apalodimas Add helper functions for retreiving dma_addr_t stored in page_private and unmapping dma addresses, mapped via the page_pool API. Signed-off-by: Ilias Apalodimas Signed-off-by: Jesper Dangaard Brouer --- include/net/page_pool.h |6 ++ net/core/page_pool.c|7 +++ 2 files changed, 13 insertions(+) diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 694d055e01ef..439f9183d4cd 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -111,6 +111,8 @@ struct page_pool *page_pool_create(const struct page_pool_params *params); void page_pool_destroy(struct page_pool *pool); +void page_pool_unmap_page(struct page_pool *pool, struct page *page); + /* Never call this directly, use helpers below */ void __page_pool_put_page(struct page_pool *pool, struct page *page, bool allow_direct); @@ -141,4 +143,8 @@ static inline bool is_page_pool_compiled_in(void) #endif } +static inline dma_addr_t page_pool_get_dma_addr(struct page *page) +{ + return page_private(page); +} #endif /* _NET_PAGE_POOL_H */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 43a932cb609b..26e14a17a67c 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -184,6 +184,13 @@ static void __page_pool_clean_page(struct page_pool *pool, set_page_private(page, 0); } +/* unmap the page and clean our state */ +void page_pool_unmap_page(struct page_pool *pool, struct page *page) +{ + __page_pool_clean_page(pool, page); +} +EXPORT_SYMBOL(page_pool_unmap_page); + /* Return a page to the page allocator, cleaning up our state */ static void __page_pool_return_page(struct page_pool *pool, struct page *page) {
[net-next PATCH RFC 6/8] mvneta: activate page recycling via skb using page_pool
Previous mvneta patches have already started to use page_pool, but this was primarily for RX page alloc-side and for doing DMA map/unmap handling. Pages traveling through the netstack was unmapped and returned through the normal page allocator. It is now time to activate that pages are recycled back. This involves registering the page_pool with the XDP rxq memory model API, even-though the driver doesn't support XDP itself yet. And simply updating the SKB->mem_info field with info from the xdp_rxq_info. Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Ilias Apalodimas --- drivers/net/ethernet/marvell/mvneta.c | 29 + 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 78f1fcdc1f00..449c19829d67 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -628,6 +628,9 @@ struct mvneta_rx_queue { /* page pool */ struct page_pool *page_pool; + /* XDP */ + struct xdp_rxq_info xdp_rxq; + /* error counters */ u32 skb_alloc_err; u32 refill_err; @@ -1892,6 +1895,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp, page_pool_put_page(rxq->page_pool, data, false); } + if (xdp_rxq_info_is_reg(&rxq->xdp_rxq)) + xdp_rxq_info_unreg(&rxq->xdp_rxq); + if (rxq->page_pool) page_pool_destroy(rxq->page_pool); } @@ -1978,11 +1984,11 @@ static int mvneta_rx_swbm(struct napi_struct *napi, rx_desc->buf_phys_addr = 0; frag_num = 0; + rxq->skb->mem_info = rxq->xdp_rxq.mem; skb_reserve(rxq->skb, MVNETA_MH_SIZE + NET_SKB_PAD); skb_put(rxq->skb, rx_bytes < PAGE_SIZE ? rx_bytes : PAGE_SIZE); mvneta_rx_csum(pp, rx_status, rxq->skb); - page_pool_unmap_page(rxq->page_pool, page); rxq->left_size = rx_bytes < PAGE_SIZE ? 0 : rx_bytes - PAGE_SIZE; } else { @@ -2001,7 +2007,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi, skb_add_rx_frag(rxq->skb, frag_num, page, 0, frag_size, PAGE_SIZE); - + /* skb frags[] are not recycled, unmap now */ page_pool_unmap_page(rxq->page_pool, page); rxq->left_size -= frag_size; @@ -2815,10 +2821,25 @@ static int mvneta_create_page_pool(struct mvneta_port *pp, static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, int num) { - int i = 0; + int err, i = 0; + + err = mvneta_create_page_pool(pp, rxq, num); + if (err) + goto out; - if (mvneta_create_page_pool(pp, rxq, num)) + err = xdp_rxq_info_reg(&rxq->xdp_rxq, pp->dev, rxq->id); + if (err) { + page_pool_destroy(rxq->page_pool); + goto out; + } + + err = xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL, +rxq->page_pool); + if (err) { + xdp_rxq_info_unreg(&rxq->xdp_rxq); + page_pool_destroy(rxq->page_pool); goto out; + } for (i = 0; i < num; i++) { memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
[net-next PATCH RFC 5/8] net: mvneta: remove copybreak, prefetch and use build_skb
From: Ilias Apalodimas The driver memcpy for packets < 256b and it's recycle tricks are not needed anymore. As previous patch introduces buffer recycling using the page_pool API (although recycling doesn't get fully activated in this patch). After this switch to using build_skb(). This patch implicit fixes a driver bug where the memory is copied prior to it's syncing for the CPU, in the < 256b case (as this case is removed). We also remove the data prefetch completely. The original driver had the prefetch misplaced before any dma_sync operations took place. Based on Jesper's analysis even if the prefetch is placed after any DMA sync ops it ends up hurting performance. Signed-off-by: Ilias Apalodimas Signed-off-by: Jesper Dangaard Brouer --- drivers/net/ethernet/marvell/mvneta.c | 81 + 1 file changed, 22 insertions(+), 59 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 2354421fe96f..78f1fcdc1f00 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -643,7 +643,6 @@ static int txq_number = 8; static int rxq_def; static int rx_copybreak __read_mostly = 256; -static int rx_header_size __read_mostly = 128; /* HW BM need that each port be identify by a unique ID */ static int global_port_id; @@ -1823,7 +1822,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp, phys_addr = page_pool_get_dma_addr(page); - phys_addr += pp->rx_offset_correction; + phys_addr += pp->rx_offset_correction + NET_SKB_PAD; mvneta_rx_desc_fill(rx_desc, phys_addr, page, rxq); return 0; } @@ -1944,14 +1943,12 @@ static int mvneta_rx_swbm(struct napi_struct *napi, struct page *page; dma_addr_t phys_addr; u32 rx_status, index; - int rx_bytes, skb_size, copy_size; - int frag_num, frag_size, frag_offset; + int frag_num, frag_size; + int rx_bytes; index = rx_desc - rxq->descs; page = (struct page *)rxq->buf_virt_addr[index]; data = page_address(page); - /* Prefetch header */ - prefetch(data); phys_addr = rx_desc->buf_phys_addr; rx_status = rx_desc->status; @@ -1969,49 +1966,25 @@ static int mvneta_rx_swbm(struct napi_struct *napi, rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE); - /* Allocate small skb for each new packet */ - skb_size = max(rx_copybreak, rx_header_size); - rxq->skb = netdev_alloc_skb_ip_align(dev, skb_size); - if (unlikely(!rxq->skb)) { - netdev_err(dev, - "Can't allocate skb on queue %d\n", - rxq->id); - dev->stats.rx_dropped++; - rxq->skb_alloc_err++; - continue; - } - copy_size = min(skb_size, rx_bytes); - - /* Copy data from buffer to SKB, skip Marvell header */ - memcpy(rxq->skb->data, data + MVNETA_MH_SIZE, - copy_size); - skb_put(rxq->skb, copy_size); - rxq->left_size = rx_bytes - copy_size; - mvneta_rx_csum(pp, rx_status, rxq->skb); - if (rxq->left_size == 0) { - int size = copy_size + MVNETA_MH_SIZE; - - dma_sync_single_range_for_cpu(dev->dev.parent, - phys_addr, 0, - size, - DMA_FROM_DEVICE); + dma_sync_single_range_for_cpu(dev->dev.parent, + phys_addr, 0, + rx_bytes, + DMA_FROM_DEVICE); - /* leave the descriptor and buffer untouched */ - } else { - /* refill descriptor with new buffer later */ - rx_desc->buf_phys_addr = 0; + rxq->skb = build_skb(data, PAGE_SIZE); + if (!rxq->skb) + break; - frag_num = 0; - frag_offset = copy_size + MVNETA_MH_SIZE; - frag_size = min(rxq->left_size, - (int)(PAGE_SIZE - frag_offset))
[net-next PATCH RFC 4/8] net: core: add recycle capabilities on skbs via page_pool API
From: Ilias Apalodimas This patch is changing struct sk_buff, and is thus per-definition controversial. Place a new member 'mem_info' of type struct xdp_mem_info, just after members (flags) head_frag and pfmemalloc, And not in between headers_start/end to ensure skb_copy() and pskb_copy() work as-is. Copying mem_info during skb_clone() is required. This makes sure that pages are correctly freed or recycled during the altered skb_free_head() invocation. The 'mem_info' name is chosen as this is not strictly tied to XDP, although the XDP return infrastructure is used. As a future plan, we could introduce a __u8 flags member to xdp_mem_info and move flags head_frag and pfmemalloc into this area. Signed-off-by: Ilias Apalodimas Signed-off-by: Jesper Dangaard Brouer --- include/linux/skbuff.h |6 +- include/net/xdp.h |1 + net/core/skbuff.c |7 +++ net/core/xdp.c |6 ++ 4 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 7dcfb5591dc3..95dac0ba6947 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -40,6 +40,7 @@ #include #include #include +#include /* The interface for checksum offload between the stack and networking drivers * is as follows... @@ -744,6 +745,10 @@ struct sk_buff { head_frag:1, xmit_more:1, pfmemalloc:1; + /* TODO: Future idea, extend mem_info with __u8 flags, and +* move bits head_frag and pfmemalloc there. +*/ + struct xdp_mem_info mem_info; /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() @@ -827,7 +832,6 @@ struct sk_buff { #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif - union { __u32 mark; __u32 reserved_tailroom; diff --git a/include/net/xdp.h b/include/net/xdp.h index 5c33b9e0efab..4a0ca7a3d5e5 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -128,6 +128,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) void xdp_return_frame(struct xdp_frame *xdpf); void xdp_return_frame_rx_napi(struct xdp_frame *xdpf); void xdp_return_buff(struct xdp_buff *xdp); +void xdp_return_skb_page(void *data, struct xdp_mem_info *mem_info); int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, struct net_device *dev, u32 queue_index); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b4ee5c8b928f..71aca186e44c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -70,6 +70,7 @@ #include #include #include +#include #include #include @@ -544,6 +545,11 @@ static void skb_free_head(struct sk_buff *skb) { unsigned char *head = skb->head; + if (skb->mem_info.type == MEM_TYPE_PAGE_POOL) { + xdp_return_skb_page(head, &skb->mem_info); + return; + } + if (skb->head_frag) skb_free_frag(head); else @@ -859,6 +865,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb) n->nohdr = 0; n->peeked = 0; C(pfmemalloc); + C(mem_info); n->destructor = NULL; C(tail); C(end); diff --git a/net/core/xdp.c b/net/core/xdp.c index e79526314864..1703be4c2611 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -379,6 +379,12 @@ void xdp_return_buff(struct xdp_buff *xdp) } EXPORT_SYMBOL_GPL(xdp_return_buff); +void xdp_return_skb_page(void *data, struct xdp_mem_info *mem_info) +{ + __xdp_return(data, mem_info, false, 0); +} +EXPORT_SYMBOL(xdp_return_skb_page); + int xdp_attachment_query(struct xdp_attachment_info *info, struct netdev_bpf *bpf) {
[net-next PATCH RFC 7/8] xdp: bpf: cpumap redirect must update skb->mem_info
XDP-redirect to CPUMAP is about creating the SKB outside the driver (and on another CPU) via xdp_frame info. Transfer the xdp_frame mem info to the new SKB mem_info field. Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Ilias Apalodimas --- kernel/bpf/cpumap.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 24aac0d0f412..e3e05b6ccc42 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -199,6 +199,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, /* Essential SKB info: protocol and skb->dev */ skb->protocol = eth_type_trans(skb, xdpf->dev_rx); + skb->mem_info = xdpf->mem; + /* Optional SKB info, currently missing: * - HW checksum info (skb->ip_summed) * - HW RX hash (skb_set_hash)
[net-next PATCH RFC 0/8] page_pool DMA handling and allow to recycles frames via SKB
This RFC patchset shows the plans for allowing page_pool to handle and maintain DMA map/unmap of the pages it serves to the driver. For this to work a return hook in the network core is introduced, which also involves extending sk_buff with the necessary information. The overall purpose is to simplify drivers, by providing a page allocation API that does recycling, such that each driver doesn't have to reinvent its own recycling scheme. Using page_pool in a driver does not require implementing XDP support, but it makes it trivially easy to do so. The recycles code leverage the XDP recycle APIs. The Marvell mvneta driver was used in this patchset to demonstrate how to use the API, and tested on the EspressoBIN board. We also have patches enabling XDP for this driver, but they are not part of this patchset as we want review of the general idea of the page_pool return SKB hook. A driver using page_pool and XDP redirecting into CPUMAP or veth, will also take advantage of the new SKB return hook, this is currently only mlx5. The longer term plans involves allowing other types of allocators to use this return hook. Particularly allowing zero-copy AF_XDP frames to travel further into the netstack, if userspace page have been restricted to read-only. --- Ilias Apalodimas (4): page_pool: add helper functions for DMA net: mvneta: use page pool API for sw buffer manager net: core: add recycle capabilities on skbs via page_pool API net: mvneta: remove copybreak, prefetch and use build_skb Jesper Dangaard Brouer (4): xdp: reduce size of struct xdp_mem_info mvneta: activate page recycling via skb using page_pool xdp: bpf: cpumap redirect must update skb->mem_info veth: xdp_frames redirected into veth need to transfer xdp_mem_info drivers/net/ethernet/marvell/Kconfig |1 drivers/net/ethernet/marvell/mvneta.c | 158 + drivers/net/veth.c|1 include/linux/skbuff.h|6 + include/net/page_pool.h |6 + include/net/xdp.h |5 + kernel/bpf/cpumap.c |2 net/core/page_pool.c |7 + net/core/skbuff.c |7 + net/core/xdp.c| 14 ++- 10 files changed, 125 insertions(+), 82 deletions(-) --
[net-next PATCH RFC 3/8] xdp: reduce size of struct xdp_mem_info
It is possible to compress/reduce the size of struct xdp_mem_info. This change reduce struct xdp_mem_info from 8 bytes to 4 bytes. The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE which can safely fit in u16. The member xdp_mem_info.type could be reduced more than u16, as it stores the enum xdp_mem_type, but due to alignment it is only reduced to u16. Signed-off-by: Jesper Dangaard Brouer --- include/net/xdp.h |4 ++-- net/core/xdp.c|8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index 0f25b3675c5c..5c33b9e0efab 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -46,8 +46,8 @@ enum xdp_mem_type { #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH struct xdp_mem_info { - u32 type; /* enum xdp_mem_type, but known size type */ - u32 id; + u16 type; /* enum xdp_mem_type, but known size type */ + u16 id; }; struct page_pool; diff --git a/net/core/xdp.c b/net/core/xdp.c index 4b2b194f4f1f..e79526314864 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -42,11 +42,11 @@ struct xdp_mem_allocator { static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed) { - const u32 *k = data; - const u32 key = *k; + const u16 *k = data; + const u16 key = *k; BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id) -!= sizeof(u32)); +!= sizeof(u16)); /* Use cyclic increasing ID as direct hash key */ return key; @@ -56,7 +56,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg, const void *ptr) { const struct xdp_mem_allocator *xa = ptr; - u32 mem_id = *(u32 *)arg->key; + u16 mem_id = *(u16 *)arg->key; return xa->mem.id != mem_id; }
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
> On Dec 6, 2018, at 12:17 PM, Andy Lutomirski wrote: > > On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit wrote: >>> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski wrote: >>> >>> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: >> If we are going to unmap the linear alias, why not do it at vmalloc() >> time rather than vfree() time? > > That’s not totally nuts. Do we ever have code that expects __va() to > work on module data? Perhaps crypto code trying to encrypt static > data because our APIs don’t understand virtual addresses. I guess if > highmem is ever used for modules, then we should be fine. > > RO instead of not present might be safer. But I do like the idea of > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and > making it do all of this. Yeah, doing it for everything automatically seemed like it was/is going to be a lot of work to debug all the corner cases where things expect memory to be mapped but don't explicitly say it. And in particular, the XPFO series only does it for user memory, whereas an additional flag like this would work for extra paranoid allocations of kernel memory too. >>> >>> I just read the code, and I looks like vmalloc() is already using >>> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for >>> example, we already don't have modules in the direct map. >>> >>> So I say we go for it. This should be quite simple to implement -- >>> the pageattr code already has almost all the needed logic on x86. The >>> only arch support we should need is a pair of functions to remove a >>> vmalloc address range from the address map (if it was present in the >>> first place) and a function to put it back. On x86, this should only >>> be a few lines of code. >>> >>> What do you all think? This should solve most of the problems we have. >>> >>> If we really wanted to optimize this, we'd make it so that >>> module_alloc() allocates memory the normal way, then, later on, we >>> call some function that, all at once, removes the memory from the >>> direct map and applies the right permissions to the vmalloc alias (or >>> just makes the vmalloc alias not-present so we can add permissions >>> later without flushing), and flushes the TLB. And we arrange for >>> vunmap to zap the vmalloc range, then put the memory back into the >>> direct map, then free the pages back to the page allocator, with the >>> flush in the appropriate place. >>> >>> I don't see why the page allocator needs to know about any of this. >>> It's already okay with the permissions being changed out from under it >>> on x86, and it seems fine. Rick, do you want to give some variant of >>> this a try? >> >> Setting it as read-only may work (and already happens for the read-only >> module data). I am not sure about setting it as non-present. >> >> At some point, a discussion about a threat-model, as Rick indicated, would >> be required. I presume ROP attacks can easily call set_all_modules_text_rw() >> and override all the protections. > > I am far from an expert on exploit techniques, but here's a > potentially useful model: let's assume there's an attacker who can > write controlled data to a controlled kernel address but cannot > directly modify control flow. It would be nice for such an attacker > to have a very difficult time of modifying kernel text or of > compromising control flow. So we're assuming a feature like kernel > CET or that the attacker finds it very difficult to do something like > modifying some thread's IRET frame. > > Admittedly, for the kernel, this is an odd threat model, since an > attacker can presumably quite easily learn the kernel stack address of > one of their tasks, do some syscall, and then modify their kernel > thread's stack such that it will IRET right back to a fully controlled > register state with RSP pointing at an attacker-supplied kernel stack. > So this threat model gives very strong ROP powers. unless we have > either CET or some software technique to harden all the RET > instructions in the kernel. > > I wonder if there's a better model to use. Maybe with stack-protector > we get some degree of protection? Or is all of this is rather weak > until we have CET or a RAP-like feature. I believe that seeing the end-goal would make reasoning about patches easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping up. I’m not sure CET or other CFI would be enough even with this threat-model. The page-tables (the very least) need to be write-protected, as otherwise controlled data writes may just modify them. There are various possible solutions I presume: write_rare for page-tables, hypervisor-assisted security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of hardware enclave. What do you think?
RE: [Intel-wired-lan] [PATCH] ixgbe: Fix race when the VF driver does a reset
> -Original Message- > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On > Behalf Of Ross Lagerwall > Sent: Wednesday, December 5, 2018 5:54 AM > To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org > Cc: Ross Lagerwall ; David S. Miller > > Subject: [Intel-wired-lan] [PATCH] ixgbe: Fix race when the VF driver does a > reset > > When the VF driver does a reset, it (at least the Linux one) writes to the > VFCTRL register to issue a reset and then immediately sends a reset message > using the mailbox API. This is racy because when the PF driver detects that > the VFCTRL register reset pin has been asserted, it clears the mailbox > memory. Depending on ordering, the reset message sent by the VF could be > cleared by the PF driver. It then responds to the cleared message with a > NACK which causes the VF driver to malfunction. > Fix this by deferring clearing the mailbox memory until the reset message is > received. > > Fixes: 939b701ad633 ("ixgbe: fix driver behaviour after issuing VFLR") > Signed-off-by: Ross Lagerwall > --- > drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) Tested-by: Andrew Bowers
RE: [Intel-wired-lan] [PATCH v3 2/2] i40e: DRY rx_ptype handling code
> -Original Message- > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On > Behalf Of Michal Miroslaw > Sent: Tuesday, December 4, 2018 9:31 AM > To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org > Subject: [Intel-wired-lan] [PATCH v3 2/2] i40e: DRY rx_ptype handling code > > Move rx_ptype extracting to i40e_process_skb_fields() to avoid duplicating > the code. > > Signed-off-by: Michał Mirosław > Signed-off-by: Michał Mirosław > --- > v3: > * no changes > v2: > * fix prototype in i40e_txrx_common.h > --- > drivers/net/ethernet/intel/i40e/i40e_txrx.c| 12 > drivers/net/ethernet/intel/i40e/i40e_txrx_common.h | 3 +-- > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 +- > 3 files changed, 6 insertions(+), 15 deletions(-) Tested-by: Andrew Bowers
RE: [Intel-wired-lan] [PATCH v3 1/2] i40e: fix VLAN.TCI == 0 RX HW offload
> -Original Message- > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On > Behalf Of Michal Miroslaw > Sent: Tuesday, December 4, 2018 9:31 AM > To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org > Subject: [Intel-wired-lan] [PATCH v3 1/2] i40e: fix VLAN.TCI == 0 RX HW > offload > > This fixes two bugs in hardware VLAN offload: > 1. VLAN.TCI == 0 was being dropped > 2. there was a race between disabling of VLAN RX feature in hardware > and processing RX queue, where packets processed in this window > could have their VLAN information dropped > > Fix moves the VLAN handling into i40e_process_skb_fields() to save on > duplicated code. i40e_receive_skb() becomes trivial and so is removed. > > Signed-off-by: Michał Mirosław > Signed-off-by: Michał Mirosław > --- > v3: > * fix whitespace for checkpatch > v2: > * no changes > --- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 31 +-- > .../ethernet/intel/i40e/i40e_txrx_common.h| 2 -- > drivers/net/ethernet/intel/i40e/i40e_xsk.c| 6 +--- > 3 files changed, 9 insertions(+), 30 deletions(-) Tested-by: Andrew Bowers
[PATCH net-next 1/4] tc-testing: Add command timeout feature to tdc
Using an attribute set in the tdc_config.py file, limit the amount of time tdc will wait for an executed command to complete and prevent the script from hanging entirely. This timeout will be applied to all executed commands. Signed-off-by: Lucas Bates --- tools/testing/selftests/tc-testing/tdc.py| 16 +++- tools/testing/selftests/tc-testing/tdc_config.py | 2 ++ 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py index 7607ba3..b862ee5 100755 --- a/tools/testing/selftests/tc-testing/tdc.py +++ b/tools/testing/selftests/tc-testing/tdc.py @@ -131,12 +131,16 @@ def exec_cmd(args, pm, stage, command): stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=ENVIR) -(rawout, serr) = proc.communicate() -if proc.returncode != 0 and len(serr) > 0: -foutput = serr.decode("utf-8", errors="ignore") -else: -foutput = rawout.decode("utf-8", errors="ignore") +try: +(rawout, serr) = proc.communicate(timeout=NAMES['TIMEOUT']) +if proc.returncode != 0 and len(serr) > 0: +foutput = serr.decode("utf-8", errors="ignore") +else: +foutput = rawout.decode("utf-8", errors="ignore") +except subprocess.TimeoutExpired: +foutput = "Command \"{}\" timed out\n".format(command) +proc.returncode = 255 proc.stdout.close() proc.stderr.close() @@ -438,6 +442,8 @@ def check_default_settings(args, remaining, pm): NAMES['TC'] = args.path if args.device != None: NAMES['DEV2'] = args.device +if 'TIMEOUT' not in NAMES: +NAMES['TIMEOUT'] = None if not os.path.isfile(NAMES['TC']): print("The specified tc path " + NAMES['TC'] + " does not exist.") exit(1) diff --git a/tools/testing/selftests/tc-testing/tdc_config.py b/tools/testing/selftests/tc-testing/tdc_config.py index d651bc1..6d91e48 100644 --- a/tools/testing/selftests/tc-testing/tdc_config.py +++ b/tools/testing/selftests/tc-testing/tdc_config.py @@ -15,6 +15,8 @@ NAMES = { 'DEV1': 'v0p1', 'DEV2': '', 'BATCH_FILE': './batch.txt', + # Length of time in seconds to wait before terminating a command + 'TIMEOUT': 12, # Name of the namespace to use 'NS': 'tcut', # Directory containing eBPF test programs -- 2.7.4
[PATCH net-next 3/4] tc-testing: Implement the TdcResults module in tdc
In tdc and the valgrind plugin, begin using the TdcResults module to track executed test cases. Signed-off-by: Lucas Bates --- tools/testing/selftests/tc-testing/TdcPlugin.py| 3 +- .../tc-testing/plugin-lib/valgrindPlugin.py| 22 +++- tools/testing/selftests/tc-testing/tdc.py | 117 - 3 files changed, 91 insertions(+), 51 deletions(-) diff --git a/tools/testing/selftests/tc-testing/TdcPlugin.py b/tools/testing/selftests/tc-testing/TdcPlugin.py index 3ee9a6d..1d9e279 100644 --- a/tools/testing/selftests/tc-testing/TdcPlugin.py +++ b/tools/testing/selftests/tc-testing/TdcPlugin.py @@ -18,11 +18,12 @@ class TdcPlugin: if self.args.verbose > 1: print(' -- {}.post_suite'.format(self.sub_class)) -def pre_case(self, test_ordinal, testid): +def pre_case(self, test_ordinal, testid, test_name): '''run commands before test_runner does one test''' if self.args.verbose > 1: print(' -- {}.pre_case'.format(self.sub_class)) self.args.testid = testid +self.args.test_name = test_name self.args.test_ordinal = test_ordinal def post_case(self): diff --git a/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py b/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py index 477a7bd..e00c798 100644 --- a/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py +++ b/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py @@ -11,6 +11,7 @@ from string import Template import subprocess import time from TdcPlugin import TdcPlugin +from TdcResults import * from tdc_config import * @@ -21,6 +22,7 @@ class SubPlugin(TdcPlugin): def __init__(self): self.sub_class = 'valgrind/SubPlugin' self.tap = '' +self._tsr = TestSuiteReport() super().__init__() def pre_suite(self, testcount, testidlist): @@ -34,10 +36,14 @@ class SubPlugin(TdcPlugin): def post_suite(self, index): '''run commands after test_runner goes into a test loop''' super().post_suite(index) -self._add_to_tap('\n|---\n') if self.args.verbose > 1: print('{}.post_suite'.format(self.sub_class)) -print('{}'.format(self.tap)) +#print('{}'.format(self.tap)) +for xx in range(index - 1, self.testcount): +res = TestResult('{}-mem'.format(self.testidlist[xx]), 'Test skipped') +res.set_result(ResultState.skip) +res.set_errormsg('Skipped because of prior setup/teardown failure') +self._add_results(res) if self.args.verbose < 4: subprocess.check_output('rm -f vgnd-*.log', shell=True) @@ -128,8 +134,17 @@ class SubPlugin(TdcPlugin): nle_num = int(nle_mo.group(1)) mem_results = '' +res = TestResult('{}-mem'.format(self.args.testid), + '{} memory leak check'.format(self.args.test_name)) if (def_num > 0) or (ind_num > 0) or (pos_num > 0) or (nle_num > 0): mem_results += 'not ' +res.set_result(ResultState.fail) +res.set_failmsg('Memory leak detected') +res.append_failmsg(content) +else: +res.set_result(ResultState.success) + +self._add_results(res) mem_results += 'ok {} - {}-mem # {}\n'.format( self.args.test_ordinal, self.args.testid, 'memory leak check') @@ -138,5 +153,8 @@ class SubPlugin(TdcPlugin): print('{}'.format(content)) self._add_to_tap(content) +def _add_results(self, res): +self._tsr.add_resultdata(res) + def _add_to_tap(self, more_tap_output): self.tap += more_tap_output diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py index b862ee5..e6e4ce8 100755 --- a/tools/testing/selftests/tc-testing/tdc.py +++ b/tools/testing/selftests/tc-testing/tdc.py @@ -23,6 +23,7 @@ from tdc_config import * from tdc_helper import * import TdcPlugin +from TdcResults import * class PluginMgrTestFail(Exception): @@ -60,10 +61,10 @@ class PluginMgr: for pgn_inst in reversed(self.plugin_instances): pgn_inst.post_suite(index) -def call_pre_case(self, test_ordinal, testid): +def call_pre_case(self, test_ordinal, testid, test_name): for pgn_inst in self.plugin_instances: try: -pgn_inst.pre_case(test_ordinal, testid) +pgn_inst.pre_case(test_ordinal, testid, test_name) except Exception as ee: print('exception {} in call to pre_case for {} plugin'. format(ee, pgn_inst.__class__)) @@ -102,7 +103,6 @@ class PluginMgr: self.argparser = argparse.ArgumentParser( description='Linux TC unit tests') - def replace_keywords(cmd): """ For a given executable command, substitute any known @@ -187,6 +187,7 @@ de
[PATCH net-next 2/4] tc-testing: Add new TdcResults module
This module includes new classes for tdc to use in keeping track of test case results, instead of generating and tracking a lengthy string. The new module can be extended to support multiple formal test result formats to be friendlier to automation. Signed-off-by: Lucas Bates --- tools/testing/selftests/tc-testing/TdcResults.py | 133 +++ 1 file changed, 133 insertions(+) create mode 100644 tools/testing/selftests/tc-testing/TdcResults.py diff --git a/tools/testing/selftests/tc-testing/TdcResults.py b/tools/testing/selftests/tc-testing/TdcResults.py new file mode 100644 index 000..250fbb2 --- /dev/null +++ b/tools/testing/selftests/tc-testing/TdcResults.py @@ -0,0 +1,133 @@ +#!/usr/bin/env python3 + +from enum import Enum + +class ResultState(Enum): +noresult = -1 +skip = 0 +success = 1 +fail = 2 + +class TestResult: +def __init__(self, test_id="", test_name=""): + self.test_id = test_id + self.test_name = test_name + self.result = ResultState.noresult + self.failmsg = "" + self.errormsg = "" + self.steps = [] + +def set_result(self, result): +if (isinstance(result, ResultState)): +self.result = result +return True +else: +raise TypeError('Unknown result type, must be type ResultState') + +def get_result(self): +return self.result + +def set_errormsg(self, errormsg): +self.errormsg = errormsg +return True + +def append_errormsg(self, errormsg): +self.errormsg = '{}\n{}'.format(self.errormsg, errormsg) + +def get_errormsg(self): +return self.errormsg + +def set_failmsg(self, failmsg): +self.failmsg = failmsg +return True + +def append_failmsg(self, failmsg): +self.failmsg = '{}\n{}'.format(self.failmsg, failmsg) + +def get_failmsg(self): +return self.failmsg + +def add_steps(self, newstep): +if type(newstep) == list: +self.steps.extend(newstep) +elif type(newstep) == str: +self.steps.append(step) +else: +raise TypeError('TdcResults.add_steps() requires a list or str') + +def get_executed_steps(self): +return self.steps + +class TestSuiteReport(): +_testsuite = [] + +def add_resultdata(self, result_data): +if isinstance(result_data, TestResult): +self._testsuite.append(result_data) +return True + +def count_tests(self): +return len(self._testsuite) + +def count_failures(self): +return sum(1 for t in self._testsuite if t.result == ResultState.fail) + +def count_skips(self): +return sum(1 for t in self._testsuite if t.result == ResultState.skip) + +def find_result(self, test_id): +return next((tr for tr in self._testsuite if tr.test_id == test_id), None) + +def update_result(self, result_data): +orig = self.find_result(result_data.test_id) +if orig != None: +idx = self._testsuite.index(orig) +self._testsuite[idx] = result_data +else: +self.add_resultdata(result_data) + +def format_tap(self): +ftap = "" +ftap += '1..{}\n'.format(self.count_tests()) +index = 1 +for t in self._testsuite: +if t.result == ResultState.fail: +ftap += 'not ' +ftap += 'ok {} {} - {}'.format(str(index), t.test_id, t.test_name) +if t.result == ResultState.skip or t.result == ResultState.noresult: +ftap += ' # skipped - {}\n'.format(t.errormsg) +elif t.result == ResultState.fail: +if len(t.steps) > 0: +ftap += '\tCommands executed in this test case:' +for step in t.steps: +ftap += '\n\t\t{}'.format(step) +ftap += '\n\t{}'.format(t.failmsg) +ftap += '\n' +index += 1 +return ftap + +def format_xunit(self): +from xml.sax.saxutils import escape +xunit = "\n" +xunit += '\t\n'.format(self.count_tests(), self.count_skips()) +for t in self._testsuite: +xunit += '\t\t 0: +xunit += 'Commands executed in this test case:\n' +for step in t.steps: +xunit += '\t{}\n'.format(escape(step)) +xunit += 'FAILURE: {}\n'.format(escape(t.failmsg)) +xunit += '\t\t\t\n' +if t.errormsg: +xunit += '\t\t\t\n{}\n'.format(escape(t.errormsg)) +xunit += '\t\t\t\n' +if t.result == ResultState.skip: +xunit += '\t\t\t\n' +xunit += '\t\t\n' +xunit += '\t\n' +xunit += '\n' +return xunit + -- 2.7.4
[PATCH net-next 0/4] tc-testing: implement command timeouts and better results tracking
Patch 1 adds a timeout feature for any command tdc launches in a subshell. This prevents tdc from hanging indefinitely. Patches 2-4 introduce a new method for tracking and generating test case results, and implements it across the core script and all applicable plugins. Lucas Bates (4): tc-testing: Add command timeout feature to tdc tc-testing: Add new TdcResults module tc-testing: Implement the TdcResults module in tdc tc-testing: gitignore, ignore generated test results tools/testing/selftests/tc-testing/.gitignore | 3 + tools/testing/selftests/tc-testing/TdcPlugin.py| 3 +- tools/testing/selftests/tc-testing/TdcResults.py | 133 + .../tc-testing/plugin-lib/valgrindPlugin.py| 22 +++- tools/testing/selftests/tc-testing/tdc.py | 133 + tools/testing/selftests/tc-testing/tdc_config.py | 2 + 6 files changed, 240 insertions(+), 56 deletions(-) create mode 100644 tools/testing/selftests/tc-testing/TdcResults.py -- 2.7.4
[PATCH net-next 4/4] tc-testing: gitignore, ignore generated test results
Ignore any .tap or .xml test result files generated by tdc. Additionally, ignore plugin symlinks. Signed-off-by: Lucas Bates --- tools/testing/selftests/tc-testing/.gitignore | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/tc-testing/.gitignore b/tools/testing/selftests/tc-testing/.gitignore index 7a60b85..c5cc160 100644 --- a/tools/testing/selftests/tc-testing/.gitignore +++ b/tools/testing/selftests/tc-testing/.gitignore @@ -1,2 +1,5 @@ __pycache__/ *.pyc +plugins/ +*.xml +*.tap -- 2.7.4
[PATCH net-next,v5 11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase
This is a preparation patch to reuse the existing TC flower codebase from ethtool_rx_flow_spec. This patch is merely moving the core ethtool_rx_flow_spec parser after tc flower offload driver code so we can skip a few forward function declarations in the follow up patch. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. drivers/net/ethernet/qlogic/qede/qede_filter.c | 264 - 1 file changed, 132 insertions(+), 132 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c index 833c9ec58a6e..ed77950f6cf9 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_filter.c +++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c @@ -1791,72 +1791,6 @@ static int qede_flow_spec_to_tuple_udpv6(struct qede_dev *edev, return 0; } -static int qede_flow_spec_to_tuple(struct qede_dev *edev, - struct qede_arfs_tuple *t, - struct ethtool_rx_flow_spec *fs) -{ - memset(t, 0, sizeof(*t)); - - if (qede_flow_spec_validate_unused(edev, fs)) - return -EOPNOTSUPP; - - switch ((fs->flow_type & ~FLOW_EXT)) { - case TCP_V4_FLOW: - return qede_flow_spec_to_tuple_tcpv4(edev, t, fs); - case UDP_V4_FLOW: - return qede_flow_spec_to_tuple_udpv4(edev, t, fs); - case TCP_V6_FLOW: - return qede_flow_spec_to_tuple_tcpv6(edev, t, fs); - case UDP_V6_FLOW: - return qede_flow_spec_to_tuple_udpv6(edev, t, fs); - default: - DP_VERBOSE(edev, NETIF_MSG_IFUP, - "Can't support flow of type %08x\n", fs->flow_type); - return -EOPNOTSUPP; - } - - return 0; -} - -static int qede_flow_spec_validate(struct qede_dev *edev, - struct ethtool_rx_flow_spec *fs, - struct qede_arfs_tuple *t) -{ - if (fs->location >= QEDE_RFS_MAX_FLTR) { - DP_INFO(edev, "Location out-of-bounds\n"); - return -EINVAL; - } - - /* Check location isn't already in use */ - if (test_bit(fs->location, edev->arfs->arfs_fltr_bmap)) { - DP_INFO(edev, "Location already in use\n"); - return -EINVAL; - } - - /* Check if the filtering-mode could support the filter */ - if (edev->arfs->filter_count && - edev->arfs->mode != t->mode) { - DP_INFO(edev, - "flow_spec would require filtering mode %08x, but %08x is configured\n", - t->mode, edev->arfs->filter_count); - return -EINVAL; - } - - /* If drop requested then no need to validate other data */ - if (fs->ring_cookie == RX_CLS_FLOW_DISC) - return 0; - - if (ethtool_get_flow_spec_ring_vf(fs->ring_cookie)) - return 0; - - if (fs->ring_cookie >= QEDE_RSS_COUNT(edev)) { - DP_INFO(edev, "Queue out-of-bounds\n"); - return -EINVAL; - } - - return 0; -} - /* Must be called while qede lock is held */ static struct qede_arfs_fltr_node * qede_flow_find_fltr(struct qede_dev *edev, struct qede_arfs_tuple *t) @@ -1896,72 +1830,6 @@ static void qede_flow_set_destination(struct qede_dev *edev, "Configuring N-tuple for VF 0x%02x\n", n->vfid - 1); } -int qede_add_cls_rule(struct qede_dev *edev, struct ethtool_rxnfc *info) -{ - struct ethtool_rx_flow_spec *fsp = &info->fs; - struct qede_arfs_fltr_node *n; - struct qede_arfs_tuple t; - int min_hlen, rc; - - __qede_lock(edev); - - if (!edev->arfs) { - rc = -EPERM; - goto unlock; - } - - /* Translate the flow specification into something fittign our DB */ - rc = qede_flow_spec_to_tuple(edev, &t, fsp); - if (rc) - goto unlock; - - /* Make sure location is valid and filter isn't already set */ - rc = qede_flow_spec_validate(edev, fsp, &t); - if (rc) - goto unlock; - - if (qede_flow_find_fltr(edev, &t)) { - rc = -EINVAL; - goto unlock; - } - - n = kzalloc(sizeof(*n), GFP_KERNEL); - if (!n) { - rc = -ENOMEM; - goto unlock; - } - - min_hlen = qede_flow_get_min_header_size(&t); - n->data = kzalloc(min_hlen, GFP_KERNEL); - if (!n->data) { - kfree(n); - rc = -ENOMEM; - goto unlock; - } - - n->sw_id = fsp->location; - set_bit(n->sw_id, edev->arfs->arfs_fltr_bmap); - n->buf_len = min_hlen; - - memcpy(&n->tuple, &t, sizeof(n->tuple)); - - qede_flow_set_destination(edev, n, fsp); - - /* Build a minimal header according to the flow */ - n->tuple.build_hdr(&n->tuple, n->data); -
[PATCH net-next,v5 01/12] flow_offload: add flow_rule and flow_match structures and use them
This patch wraps the dissector key and mask - that flower uses to represent the matching side - around the flow_match structure. To avoid a follow up patch that would edit the same LoCs in the drivers, this patch also wraps this new flow match structure around the flow rule object. This new structure will also contain the flow actions in follow up patches. This introduces two new interfaces: bool flow_rule_match_key(rule, dissector_id) that returns true if a given matching key is set on, and: flow_rule_match_XYZ(rule, &match); To fetch the matching side XYZ into the match container structure, to retrieve the key and the mask with one single call. Signed-off-by: Pablo Neira Ayuso --- v5: fix double kfree in cls_flower error path, reported by kbuild robot via Julia Lawal. drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 174 - .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 194 -- drivers/net/ethernet/intel/i40e/i40e_main.c| 178 - drivers/net/ethernet/intel/iavf/iavf_main.c| 195 -- drivers/net/ethernet/intel/igb/igb_main.c | 64 ++-- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 420 + .../net/ethernet/mellanox/mlxsw/spectrum_flower.c | 202 +- drivers/net/ethernet/netronome/nfp/flower/action.c | 11 +- drivers/net/ethernet/netronome/nfp/flower/match.c | 417 ++-- .../net/ethernet/netronome/nfp/flower/offload.c| 145 +++ drivers/net/ethernet/qlogic/qede/qede_filter.c | 85 ++--- include/net/flow_offload.h | 115 ++ include/net/pkt_cls.h | 11 +- net/core/Makefile | 2 +- net/core/flow_offload.c| 143 +++ net/sched/cls_flower.c | 47 ++- 16 files changed, 1196 insertions(+), 1207 deletions(-) create mode 100644 include/net/flow_offload.h create mode 100644 net/core/flow_offload.c diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c index 749f63beddd8..b82143d6cdde 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c @@ -177,18 +177,12 @@ static int bnxt_tc_parse_actions(struct bnxt *bp, return 0; } -#define GET_KEY(flow_cmd, key_type)\ - skb_flow_dissector_target((flow_cmd)->dissector, key_type,\ - (flow_cmd)->key) -#define GET_MASK(flow_cmd, key_type) \ - skb_flow_dissector_target((flow_cmd)->dissector, key_type,\ - (flow_cmd)->mask) - static int bnxt_tc_parse_flow(struct bnxt *bp, struct tc_cls_flower_offload *tc_flow_cmd, struct bnxt_tc_flow *flow) { - struct flow_dissector *dissector = tc_flow_cmd->dissector; + struct flow_rule *rule = tc_cls_flower_offload_flow_rule(tc_flow_cmd); + struct flow_dissector *dissector = rule->match.dissector; /* KEY_CONTROL and KEY_BASIC are needed for forming a meaningful key */ if ((dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL)) == 0 || @@ -198,140 +192,120 @@ static int bnxt_tc_parse_flow(struct bnxt *bp, return -EOPNOTSUPP; } - if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_BASIC)) { - struct flow_dissector_key_basic *key = - GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC); - struct flow_dissector_key_basic *mask = - GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC); + if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) { + struct flow_match_basic match; - flow->l2_key.ether_type = key->n_proto; - flow->l2_mask.ether_type = mask->n_proto; + flow_rule_match_basic(rule, &match); + flow->l2_key.ether_type = match.key->n_proto; + flow->l2_mask.ether_type = match.mask->n_proto; - if (key->n_proto == htons(ETH_P_IP) || - key->n_proto == htons(ETH_P_IPV6)) { - flow->l4_key.ip_proto = key->ip_proto; - flow->l4_mask.ip_proto = mask->ip_proto; + if (match.key->n_proto == htons(ETH_P_IP) || + match.key->n_proto == htons(ETH_P_IPV6)) { + flow->l4_key.ip_proto = match.key->ip_proto; + flow->l4_mask.ip_proto = match.mask->ip_proto; } } - if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) { - struct flow_dissector_key_eth_addrs *key = - GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS); - struct flow_dissector_key_eth_addrs
[PATCH net-next,v5 10/12] dsa: bcm_sf2: use flow_rule infrastructure
Update this driver to use the flow_rule infrastructure, hence we can use the same code to populate hardware IR from ethtool_rx_flow and the cls_flower interfaces. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. drivers/net/dsa/bcm_sf2_cfp.c | 102 +++--- 1 file changed, 67 insertions(+), 35 deletions(-) diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c index e14663ab6dbc..6d8059dc77b7 100644 --- a/drivers/net/dsa/bcm_sf2_cfp.c +++ b/drivers/net/dsa/bcm_sf2_cfp.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "bcm_sf2.h" #include "bcm_sf2_regs.h" @@ -257,7 +258,8 @@ static int bcm_sf2_cfp_act_pol_set(struct bcm_sf2_priv *priv, } static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv, - struct ethtool_tcpip4_spec *v4_spec, + struct flow_dissector_key_ipv4_addrs *addrs, + struct flow_dissector_key_ports *ports, unsigned int slice_num, bool mask) { @@ -278,7 +280,7 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv, * UDF_n_A6 [23:8] * UDF_n_A5 [7:0] */ - reg = be16_to_cpu(v4_spec->pdst) >> 8; + reg = be16_to_cpu(ports->dst) >> 8; if (mask) offset = CORE_CFP_MASK_PORT(3); else @@ -289,9 +291,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv, * UDF_n_A4 [23:8] * UDF_n_A3 [7:0] */ - reg = (be16_to_cpu(v4_spec->pdst) & 0xff) << 24 | - (u32)be16_to_cpu(v4_spec->psrc) << 8 | - (be32_to_cpu(v4_spec->ip4dst) & 0xff00) >> 8; + reg = (be16_to_cpu(ports->dst) & 0xff) << 24 | + (u32)be16_to_cpu(ports->src) << 8 | + (be32_to_cpu(addrs->dst) & 0xff00) >> 8; if (mask) offset = CORE_CFP_MASK_PORT(2); else @@ -302,9 +304,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv, * UDF_n_A2 [23:8] * UDF_n_A1 [7:0] */ - reg = (u32)(be32_to_cpu(v4_spec->ip4dst) & 0xff) << 24 | - (u32)(be32_to_cpu(v4_spec->ip4dst) >> 16) << 8 | - (be32_to_cpu(v4_spec->ip4src) & 0xff00) >> 8; + reg = (u32)(be32_to_cpu(addrs->dst) & 0xff) << 24 | + (u32)(be32_to_cpu(addrs->dst) >> 16) << 8 | + (be32_to_cpu(addrs->src) & 0xff00) >> 8; if (mask) offset = CORE_CFP_MASK_PORT(1); else @@ -317,8 +319,8 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv, * Slice ID [3:2] * Slice valid [1:0] */ - reg = (u32)(be32_to_cpu(v4_spec->ip4src) & 0xff) << 24 | - (u32)(be32_to_cpu(v4_spec->ip4src) >> 16) << 8 | + reg = (u32)(be32_to_cpu(addrs->src) & 0xff) << 24 | + (u32)(be32_to_cpu(addrs->src) >> 16) << 8 | SLICE_NUM(slice_num) | SLICE_VALID; if (mask) offset = CORE_CFP_MASK_PORT(0); @@ -332,9 +334,13 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv *priv, int port, unsigned int queue_num, struct ethtool_rx_flow_spec *fs) { - struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec; + struct ethtool_rx_flow_spec_input input = {}; const struct cfp_udf_layout *layout; unsigned int slice_num, rule_index; + struct ethtool_rx_flow_rule *flow; + struct flow_match_ipv4_addrs ipv4; + struct flow_match_ports ports; + struct flow_match_ip ip; u8 ip_proto, ip_frag; u8 num_udf; u32 reg; @@ -343,13 +349,9 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv *priv, int port, switch (fs->flow_type & ~FLOW_EXT) { case TCP_V4_FLOW: ip_proto = IPPROTO_TCP; - v4_spec = &fs->h_u.tcp_ip4_spec; - v4_m_spec = &fs->m_u.tcp_ip4_spec; break; case UDP_V4_FLOW: ip_proto = IPPROTO_UDP; - v4_spec = &fs->h_u.udp_ip4_spec; - v4_m_spec = &fs->m_u.udp_ip4_spec; break; default: return -EINVAL; @@ -367,11 +369,22 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv *priv, int port, if (rule_index > bcm_sf2_cfp_rule_size(priv)) return -ENOSPC; + input.fs = fs; + flow = ethtool_rx_flow_rule_create(&input); + if (IS_ERR(flow)) + return PTR_ERR(flow); + + flow_rule_match_ipv4_addrs(flow->rule, &ipv4); + flow_rule_match_ports(flow->rule, &ports); + flow_rule_match_ip(flow->rule, &ip); + layout = &udf_tcpip4_layout; /* We only
[PATCH net-next,v5 04/12] cls_api: add translator to flow_action representation
This patch implements a new function to translate from native TC action to the new flow_action representation. Moreover, this patch also updates cls_flower to use this new function. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. include/net/pkt_cls.h | 3 ++ net/sched/cls_api.c| 99 ++ net/sched/cls_flower.c | 14 +++ 3 files changed, 116 insertions(+) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 9ceac97e5eff..abb035f84321 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -622,6 +622,9 @@ tcf_match_indev(struct sk_buff *skb, int ifindex) unsigned int tcf_exts_num_actions(struct tcf_exts *exts); +int tc_setup_flow_action(struct flow_action *flow_action, +const struct tcf_exts *exts); + int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts, enum tc_setup_type type, void *type_data, bool err_stop); diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 3a4d36072fd5..00b7b639f713 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -32,6 +32,13 @@ #include #include #include +#include +#include +#include +#include +#include +#include +#include extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1]; @@ -2568,6 +2575,98 @@ int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts, } EXPORT_SYMBOL(tc_setup_cb_call); +int tc_setup_flow_action(struct flow_action *flow_action, +const struct tcf_exts *exts) +{ + const struct tc_action *act; + int i, j, k; + + if (!exts) + return 0; + + j = 0; + tcf_exts_for_each_action(i, act, exts) { + struct flow_action_entry *entry; + + entry = &flow_action->entries[j]; + if (is_tcf_gact_ok(act)) { + entry->id = FLOW_ACTION_ACCEPT; + } else if (is_tcf_gact_shot(act)) { + entry->id = FLOW_ACTION_DROP; + } else if (is_tcf_gact_trap(act)) { + entry->id = FLOW_ACTION_TRAP; + } else if (is_tcf_gact_goto_chain(act)) { + entry->id = FLOW_ACTION_GOTO; + entry->chain_index = tcf_gact_goto_chain_index(act); + } else if (is_tcf_mirred_egress_redirect(act)) { + entry->id = FLOW_ACTION_REDIRECT; + entry->dev = tcf_mirred_dev(act); + } else if (is_tcf_mirred_egress_mirror(act)) { + entry->id = FLOW_ACTION_MIRRED; + entry->dev = tcf_mirred_dev(act); + } else if (is_tcf_vlan(act)) { + switch (tcf_vlan_action(act)) { + case TCA_VLAN_ACT_PUSH: + entry->id = FLOW_ACTION_VLAN_PUSH; + entry->vlan.vid = tcf_vlan_push_vid(act); + entry->vlan.proto = tcf_vlan_push_proto(act); + entry->vlan.prio = tcf_vlan_push_prio(act); + break; + case TCA_VLAN_ACT_POP: + entry->id = FLOW_ACTION_VLAN_POP; + break; + case TCA_VLAN_ACT_MODIFY: + entry->id = FLOW_ACTION_VLAN_MANGLE; + entry->vlan.vid = tcf_vlan_push_vid(act); + entry->vlan.proto = tcf_vlan_push_proto(act); + entry->vlan.prio = tcf_vlan_push_prio(act); + break; + default: + goto err_out; + } + } else if (is_tcf_tunnel_set(act)) { + entry->id = FLOW_ACTION_TUNNEL_ENCAP; + entry->tunnel = tcf_tunnel_info(act); + } else if (is_tcf_tunnel_release(act)) { + entry->id = FLOW_ACTION_TUNNEL_DECAP; + entry->tunnel = tcf_tunnel_info(act); + } else if (is_tcf_pedit(act)) { + for (k = 0; k < tcf_pedit_nkeys(act); k++) { + switch (tcf_pedit_cmd(act, k)) { + case TCA_PEDIT_KEY_EX_CMD_SET: + entry->id = FLOW_ACTION_MANGLE; + break; + case TCA_PEDIT_KEY_EX_CMD_ADD: + entry->id = FLOW_ACTION_ADD; + break; + default: + goto err_out; + } + entry->mangle.htype = tcf_pedit_htype(act, k); + entry->mangl
[PATCH net-next,v5 07/12] cls_flower: don't expose TC actions to drivers anymore
Now that drivers have been converted to use the flow action infrastructure, remove this field from the tc_cls_flower_offload structure. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. include/net/pkt_cls.h | 1 - net/sched/cls_flower.c | 5 - 2 files changed, 6 deletions(-) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index a08c06e383db..9bd724bfa860 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -768,7 +768,6 @@ struct tc_cls_flower_offload { unsigned long cookie; struct flow_rule *rule; struct flow_stats stats; - struct tcf_exts *exts; u32 classid; }; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 806392598ae2..bb4d39689404 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -392,7 +392,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp, cls_flower.rule->match.dissector = &f->mask->dissector; cls_flower.rule->match.mask = &f->mask->key; cls_flower.rule->match.key = &f->mkey; - cls_flower.exts = &f->exts; cls_flower.classid = f->res.classid; err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts); @@ -427,7 +426,6 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f) tc_cls_common_offload_init(&cls_flower.common, tp, f->flags, NULL); cls_flower.command = TC_CLSFLOWER_STATS; cls_flower.cookie = (unsigned long) f; - cls_flower.exts = &f->exts; cls_flower.classid = f->res.classid; tc_setup_cb_call(block, &f->exts, TC_SETUP_CLSFLOWER, @@ -1490,7 +1488,6 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb, cls_flower.rule->match.dissector = &mask->dissector; cls_flower.rule->match.mask = &mask->key; cls_flower.rule->match.key = &f->mkey; - cls_flower.exts = &f->exts; err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts); @@ -1523,7 +1520,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain, { struct tc_cls_flower_offload cls_flower = {}; struct tcf_block *block = chain->block; - struct tcf_exts dummy_exts = { 0, }; cls_flower.rule = flow_rule_alloc(0); if (!cls_flower.rule) @@ -1535,7 +1531,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain, cls_flower.rule->match.dissector = &tmplt->dissector; cls_flower.rule->match.mask = &tmplt->mask; cls_flower.rule->match.key = &tmplt->dummy_key; - cls_flower.exts = &dummy_exts; /* We don't care if driver (any of them) fails to handle this * call. It serves just as a hint for it. -- 2.11.0
[PATCH net-next,v5 05/12] flow_offload: add statistics retrieval infrastructure and use it
This patch provides the flow_stats structure that acts as container for tc_cls_flower_offload, then we can use to restore the statistics on the existing TC actions. Hence, tcf_exts_stats_update() is not used from drivers anymore. Signed-off-by: Pablo Neira Ayuso --- v5: Fix bytes and packet parameter swap in flow_stats_update() call, reported by Venkat Duvvuru. drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 4 ++-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 6 +++--- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c | 2 +- drivers/net/ethernet/netronome/nfp/flower/offload.c | 5 ++--- include/net/flow_offload.h| 14 ++ include/net/pkt_cls.h | 1 + net/sched/cls_flower.c| 4 8 files changed, 28 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c index b82143d6cdde..09cd75f54eba 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c @@ -1366,8 +1366,8 @@ static int bnxt_tc_get_flow_stats(struct bnxt *bp, lastused = flow->lastused; spin_unlock(&flow->stats_lock); - tcf_exts_stats_update(tc_flow_cmd->exts, stats.bytes, stats.packets, - lastused); + flow_stats_update(&tc_flow_cmd->stats, stats.bytes, stats.packets, + lastused); return 0; } diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c index 39c5af5dad3d..8a2d66ee1d7b 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c @@ -807,9 +807,9 @@ int cxgb4_tc_flower_stats(struct net_device *dev, if (ofld_stats->packet_count != packets) { if (ofld_stats->prev_packet_count != packets) ofld_stats->last_used = jiffies; - tcf_exts_stats_update(cls->exts, bytes - ofld_stats->byte_count, - packets - ofld_stats->packet_count, - ofld_stats->last_used); + flow_stats_update(&cls->stats, bytes - ofld_stats->byte_count, + packets - ofld_stats->packet_count, + ofld_stats->last_used); ofld_stats->packet_count = packets; ofld_stats->byte_count = bytes; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index 1daaab91280f..2e1eaf6f5139 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -3227,7 +3227,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv, mlx5_fc_query_cached(counter, &bytes, &packets, &lastuse); - tcf_exts_stats_update(f->exts, bytes, packets, lastuse); + flow_stats_update(&f->stats, bytes, packets, lastuse); return 0; } diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c index e6c4c672b1ca..60900e53243b 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c @@ -460,7 +460,7 @@ int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp, if (err) goto err_rule_get_stats; - tcf_exts_stats_update(f->exts, bytes, packets, lastuse); + flow_stats_update(&f->stats, bytes, packets, lastuse); mlxsw_sp_acl_ruleset_put(mlxsw_sp, ruleset); return 0; diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index 708331234908..524b9ae1a639 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -532,9 +532,8 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device *netdev, ctx_id = be32_to_cpu(nfp_flow->meta.host_ctx_id); spin_lock_bh(&priv->stats_lock); - tcf_exts_stats_update(flow->exts, priv->stats[ctx_id].bytes, - priv->stats[ctx_id].pkts, - priv->stats[ctx_id].used); + flow_stats_update(&flow->stats, priv->stats[ctx_id].bytes, + priv->stats[ctx_id].pkts, priv->stats[ctx_id].used); priv->stats[ctx_id].pkts = 0; priv->stats[ctx_id].bytes = 0; diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h index dabc819b6cc9..f9ce39992dbd 100644 --- a/include/net/flow_offload.h +++ b/include/net/flow_offload.h @@ -179,4 +179,18 @@ static inline bool flow_rule_match_key(const struct flow_rule *rule, return dissector_uses_key(rule->ma
[PATCH net-next,v5 12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code
The qede driver supports for ethtool_rx_flow_spec and flower, both codebases look very similar. This patch uses the ethtool_rx_flow_rule() infrastructure to remove the duplicated ethtool_rx_flow_spec parser and consolidate ACL offload support around the flow_rule infrastructure. Furthermore, more code can be consolidated by merging qede_add_cls_rule() and qede_add_tc_flower_fltr(), these two functions also look very similar. This driver currently provides simple ACL support, such as 5-tuple matching, drop policy and queue to CPU. Drivers that support more features can benefit from this infrastructure to save even more redundant codebase. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. drivers/net/ethernet/qlogic/qede/qede_filter.c | 279 +++-- 1 file changed, 76 insertions(+), 203 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c index ed77950f6cf9..37c0651184ce 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_filter.c +++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c @@ -1665,132 +1665,6 @@ static int qede_set_v6_tuple_to_profile(struct qede_dev *edev, return 0; } -static int qede_flow_spec_to_tuple_ipv4_common(struct qede_dev *edev, - struct qede_arfs_tuple *t, - struct ethtool_rx_flow_spec *fs) -{ - if ((fs->h_u.tcp_ip4_spec.ip4src & -fs->m_u.tcp_ip4_spec.ip4src) != fs->h_u.tcp_ip4_spec.ip4src) { - DP_INFO(edev, "Don't support IP-masks\n"); - return -EOPNOTSUPP; - } - - if ((fs->h_u.tcp_ip4_spec.ip4dst & -fs->m_u.tcp_ip4_spec.ip4dst) != fs->h_u.tcp_ip4_spec.ip4dst) { - DP_INFO(edev, "Don't support IP-masks\n"); - return -EOPNOTSUPP; - } - - if ((fs->h_u.tcp_ip4_spec.psrc & -fs->m_u.tcp_ip4_spec.psrc) != fs->h_u.tcp_ip4_spec.psrc) { - DP_INFO(edev, "Don't support port-masks\n"); - return -EOPNOTSUPP; - } - - if ((fs->h_u.tcp_ip4_spec.pdst & -fs->m_u.tcp_ip4_spec.pdst) != fs->h_u.tcp_ip4_spec.pdst) { - DP_INFO(edev, "Don't support port-masks\n"); - return -EOPNOTSUPP; - } - - if (fs->h_u.tcp_ip4_spec.tos) { - DP_INFO(edev, "Don't support tos\n"); - return -EOPNOTSUPP; - } - - t->eth_proto = htons(ETH_P_IP); - t->src_ipv4 = fs->h_u.tcp_ip4_spec.ip4src; - t->dst_ipv4 = fs->h_u.tcp_ip4_spec.ip4dst; - t->src_port = fs->h_u.tcp_ip4_spec.psrc; - t->dst_port = fs->h_u.tcp_ip4_spec.pdst; - - return qede_set_v4_tuple_to_profile(edev, t); -} - -static int qede_flow_spec_to_tuple_tcpv4(struct qede_dev *edev, -struct qede_arfs_tuple *t, -struct ethtool_rx_flow_spec *fs) -{ - t->ip_proto = IPPROTO_TCP; - - if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs)) - return -EINVAL; - - return 0; -} - -static int qede_flow_spec_to_tuple_udpv4(struct qede_dev *edev, -struct qede_arfs_tuple *t, -struct ethtool_rx_flow_spec *fs) -{ - t->ip_proto = IPPROTO_UDP; - - if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs)) - return -EINVAL; - - return 0; -} - -static int qede_flow_spec_to_tuple_ipv6_common(struct qede_dev *edev, - struct qede_arfs_tuple *t, - struct ethtool_rx_flow_spec *fs) -{ - struct in6_addr zero_addr; - - memset(&zero_addr, 0, sizeof(zero_addr)); - - if ((fs->h_u.tcp_ip6_spec.psrc & -fs->m_u.tcp_ip6_spec.psrc) != fs->h_u.tcp_ip6_spec.psrc) { - DP_INFO(edev, "Don't support port-masks\n"); - return -EOPNOTSUPP; - } - - if ((fs->h_u.tcp_ip6_spec.pdst & -fs->m_u.tcp_ip6_spec.pdst) != fs->h_u.tcp_ip6_spec.pdst) { - DP_INFO(edev, "Don't support port-masks\n"); - return -EOPNOTSUPP; - } - - if (fs->h_u.tcp_ip6_spec.tclass) { - DP_INFO(edev, "Don't support tclass\n"); - return -EOPNOTSUPP; - } - - t->eth_proto = htons(ETH_P_IPV6); - memcpy(&t->src_ipv6, &fs->h_u.tcp_ip6_spec.ip6src, - sizeof(struct in6_addr)); - memcpy(&t->dst_ipv6, &fs->h_u.tcp_ip6_spec.ip6dst, - sizeof(struct in6_addr)); - t->src_port = fs->h_u.tcp_ip6_spec.psrc; - t->dst_port = fs->h_u.tcp_ip6_spec.pdst; - - return qede_set_v6_tuple_to_profile(edev, t, &zero_addr); -} - -static int qede_flow_spec_to_tuple_tcpv6(struct qede_dev *edev, -struct qede_arfs_tuple *t, -
[PATCH net-next,v5 09/12] ethtool: add ethtool_rx_flow_spec to flow_rule structure translator
This patch adds a function to translate the ethtool_rx_flow_spec structure to the flow_rule representation. This allows us to reuse code from the driver side given that both flower and ethtool_rx_flow interfaces use the same representation. This patch also includes support for the flow type flags FLOW_EXT, FLOW_MAC_EXT and FLOW_RSS. The ethtool_rx_flow_spec_input wrapper structure is used to convey the rss_context field, that is away from the ethtool_rx_flow_spec structure, and the ethtool_rx_flow_spec structure. Signed-off-by: Pablo Neira Ayuso --- v5: support for FLOW_RSS flowtype flag and set rss context in queue action. Suggested by Michal Kubecek. include/linux/ethtool.h | 15 +++ net/core/ethtool.c | 240 2 files changed, 255 insertions(+) diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index afd9596ce636..19a8de5326fb 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -400,4 +400,19 @@ struct ethtool_ops { void(*get_ethtool_phy_stats)(struct net_device *, struct ethtool_stats *, u64 *); }; + +struct ethtool_rx_flow_rule { + struct flow_rule*rule; + unsigned long priv[0]; +}; + +struct ethtool_rx_flow_spec_input { + const struct ethtool_rx_flow_spec *fs; + u32 rss_ctx; +}; + +struct ethtool_rx_flow_rule * +ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec_input *input); +void ethtool_rx_flow_rule_destroy(struct ethtool_rx_flow_rule *rule); + #endif /* _LINUX_ETHTOOL_H */ diff --git a/net/core/ethtool.c b/net/core/ethtool.c index d05402868575..2711d0737d3f 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -28,6 +28,7 @@ #include #include #include +#include /* * Some useful ethtool_ops methods that're device independent. @@ -2808,3 +2809,242 @@ int dev_ethtool(struct net *net, struct ifreq *ifr) return rc; } + +struct ethtool_rx_flow_key { + struct flow_dissector_key_basic basic; + union { + struct flow_dissector_key_ipv4_addrsipv4; + struct flow_dissector_key_ipv6_addrsipv6; + }; + struct flow_dissector_key_ports tp; + struct flow_dissector_key_ipip; + struct flow_dissector_key_vlan vlan; + struct flow_dissector_key_eth_addrs eth_addrs; +} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. */ + +struct ethtool_rx_flow_match { + struct flow_dissector dissector; + struct ethtool_rx_flow_key key; + struct ethtool_rx_flow_key mask; +}; + +struct ethtool_rx_flow_rule * +ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec_input *input) +{ + const struct ethtool_rx_flow_spec *fs = input->fs; + static struct in6_addr zero_addr = {}; + struct ethtool_rx_flow_match *match; + struct ethtool_rx_flow_rule *flow; + struct flow_action_entry *act; + + flow = kzalloc(sizeof(struct ethtool_rx_flow_rule) + + sizeof(struct ethtool_rx_flow_match), GFP_KERNEL); + if (!flow) + return ERR_PTR(-ENOMEM); + + /* ethtool_rx supports only one single action per rule. */ + flow->rule = flow_rule_alloc(1); + if (!flow->rule) { + kfree(flow); + return ERR_PTR(-ENOMEM); + } + + match = (struct ethtool_rx_flow_match *)flow->priv; + flow->rule->match.dissector = &match->dissector; + flow->rule->match.mask = &match->mask; + flow->rule->match.key = &match->key; + + match->mask.basic.n_proto = htons(0x); + + switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT | FLOW_RSS)) { + case TCP_V4_FLOW: + case UDP_V4_FLOW: { + const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec; + + match->key.basic.n_proto = htons(ETH_P_IP); + + v4_spec = &fs->h_u.tcp_ip4_spec; + v4_m_spec = &fs->m_u.tcp_ip4_spec; + + if (v4_m_spec->ip4src) { + match->key.ipv4.src = v4_spec->ip4src; + match->mask.ipv4.src = v4_m_spec->ip4src; + } + if (v4_m_spec->ip4dst) { + match->key.ipv4.dst = v4_spec->ip4dst; + match->mask.ipv4.dst = v4_m_spec->ip4dst; + } + if (v4_m_spec->ip4src || + v4_m_spec->ip4dst) { + match->dissector.used_keys |= + BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS); + match->dissector.offset[FLOW_DISSECTOR_KEY_IPV4_ADDRS] = + offsetof(struct ethtool_rx_flow_key, ipv4); + } + if (v4_m_spec->psrc) { +
[PATCH net-next,v5 06/12] drivers: net: use flow action infrastructure
This patch updates drivers to use the new flow action infrastructure. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 74 +++--- .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 250 +-- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 266 ++--- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 2 +- .../net/ethernet/mellanox/mlxsw/spectrum_flower.c | 54 +++-- drivers/net/ethernet/netronome/nfp/flower/action.c | 187 --- drivers/net/ethernet/qlogic/qede/qede_filter.c | 12 +- 7 files changed, 418 insertions(+), 427 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c index 09cd75f54eba..b7bd27edd80e 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c @@ -61,9 +61,9 @@ static u16 bnxt_flow_get_dst_fid(struct bnxt *pf_bp, struct net_device *dev) static int bnxt_tc_parse_redir(struct bnxt *bp, struct bnxt_tc_actions *actions, - const struct tc_action *tc_act) + const struct flow_action_entry *act) { - struct net_device *dev = tcf_mirred_dev(tc_act); + struct net_device *dev = act->dev; if (!dev) { netdev_info(bp->dev, "no dev in mirred action"); @@ -77,16 +77,16 @@ static int bnxt_tc_parse_redir(struct bnxt *bp, static int bnxt_tc_parse_vlan(struct bnxt *bp, struct bnxt_tc_actions *actions, - const struct tc_action *tc_act) + const struct flow_action_entry *act) { - switch (tcf_vlan_action(tc_act)) { - case TCA_VLAN_ACT_POP: + switch (act->id) { + case FLOW_ACTION_VLAN_POP: actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN; break; - case TCA_VLAN_ACT_PUSH: + case FLOW_ACTION_VLAN_PUSH: actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN; - actions->push_vlan_tci = htons(tcf_vlan_push_vid(tc_act)); - actions->push_vlan_tpid = tcf_vlan_push_proto(tc_act); + actions->push_vlan_tci = htons(act->vlan.vid); + actions->push_vlan_tpid = act->vlan.proto; break; default: return -EOPNOTSUPP; @@ -96,10 +96,10 @@ static int bnxt_tc_parse_vlan(struct bnxt *bp, static int bnxt_tc_parse_tunnel_set(struct bnxt *bp, struct bnxt_tc_actions *actions, - const struct tc_action *tc_act) + const struct flow_action_entry *act) { - struct ip_tunnel_info *tun_info = tcf_tunnel_info(tc_act); - struct ip_tunnel_key *tun_key = &tun_info->key; + const struct ip_tunnel_info *tun_info = act->tunnel; + const struct ip_tunnel_key *tun_key = &tun_info->key; if (ip_tunnel_info_af(tun_info) != AF_INET) { netdev_info(bp->dev, "only IPv4 tunnel-encap is supported"); @@ -113,51 +113,43 @@ static int bnxt_tc_parse_tunnel_set(struct bnxt *bp, static int bnxt_tc_parse_actions(struct bnxt *bp, struct bnxt_tc_actions *actions, -struct tcf_exts *tc_exts) +struct flow_action *flow_action) { - const struct tc_action *tc_act; + struct flow_action_entry *act; int i, rc; - if (!tcf_exts_has_actions(tc_exts)) { + if (!flow_action_has_entries(flow_action)) { netdev_info(bp->dev, "no actions"); return -EINVAL; } - tcf_exts_for_each_action(i, tc_act, tc_exts) { - /* Drop action */ - if (is_tcf_gact_shot(tc_act)) { + flow_action_for_each(i, act, flow_action) { + switch (act->id) { + case FLOW_ACTION_DROP: actions->flags |= BNXT_TC_ACTION_FLAG_DROP; return 0; /* don't bother with other actions */ - } - - /* Redirect action */ - if (is_tcf_mirred_egress_redirect(tc_act)) { - rc = bnxt_tc_parse_redir(bp, actions, tc_act); + case FLOW_ACTION_REDIRECT: + rc = bnxt_tc_parse_redir(bp, actions, act); if (rc) return rc; - continue; - } - - /* Push/pop VLAN */ - if (is_tcf_vlan(tc_act)) { - rc = bnxt_tc_parse_vlan(bp, actions, tc_act); + break; + case FLOW_ACTION_VLAN_POP: + case FLOW_ACTION_VLAN_PUSH: + case FLOW_ACTION_VLAN_MANGLE: + rc = bnxt_tc_parse_v
[PATCH net-next,v5 08/12] flow_offload: add wake-up-on-lan and queue to flow_action
These actions need to be added to support the ethtool_rx_flow interface. The queue action includes a field to specify the RSS context, that is set via FLOW_RSS flow type flag and the rss_context field in struct ethtool_rxnfc, plus the corresponding queue index. FLOW_RSS implies that rss_context is non-zero, therefore, queue.ctx == 0 means that FLOW_RSS was not set. Signed-off-by: Pablo Neira Ayuso --- v5: add queue structure and context field, per Michal Kubecek. include/net/flow_offload.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h index f9ce39992dbd..6489fb9eb394 100644 --- a/include/net/flow_offload.h +++ b/include/net/flow_offload.h @@ -116,6 +116,8 @@ enum flow_action_id { FLOW_ACTION_ADD, FLOW_ACTION_CSUM, FLOW_ACTION_MARK, + FLOW_ACTION_WAKE, + FLOW_ACTION_QUEUE, }; /* This is mirroring enum pedit_header_type definition for easy mapping between @@ -150,6 +152,10 @@ struct flow_action_entry { const struct ip_tunnel_info *tunnel;/* FLOW_ACTION_TUNNEL_ENCAP */ u32 csum_flags; /* FLOW_ACTION_CSUM */ u32 mark; /* FLOW_ACTION_MARK */ + struct {/* FLOW_ACTION_QUEUE */ + u32 ctx; + u32 index; + } queue; }; }; -- 2.11.0
[PATCH net-next,v5 00/12] add flow_rule infrastructure
Hi, This is another iteration of the in-kernel intermediate representation (IR) that allows to express ACL hardware offloads using one unified representation from the driver side for the ethtool and the tc frontends [1] [2] [3]. In words of Michal Kubecek: "... the ethtool interface can apply four types of action to matching packets: - put into a specific queue - discard - distribute accross the queues according to a RSS context - use the rule as a wake-on-lan filter" This new round now supports for these four types, that can be mapped to the flow_rule representation. Changes from previous version: * Michal Kubecek: - Add support for FLOW_RSS flag to the ethtool_rx_flow_spec to flow_rule translator. * Venkat Duvvuru: - Fix accidental swapping of flow_stats_update() bytes and packets parameter. * kbuild robot: - Fix double kfree in error path from cls_flower, via Julian Lawal. - Fix enum type mismatch in nfp driver reported by sparse checks. Please apply, thanks. Pablo Neira Ayuso (12): flow_offload: add flow_rule and flow_match structures and use them net/mlx5e: support for two independent packet edit actions flow_offload: add flow action infrastructure cls_api: add translator to flow_action representation flow_offload: add statistics retrieval infrastructure and use it drivers: net: use flow action infrastructure cls_flower: don't expose TC actions to drivers anymore flow_offload: add wake-up-on-lan and queue to flow_action ethtool: add ethtool_rx_flow_spec to flow_rule structure translator dsa: bcm_sf2: use flow_rule infrastructure qede: place ethtool_rx_flow_spec after code after TC flower codebase qede: use ethtool_rx_flow_rule() to remove duplicated parser code drivers/net/dsa/bcm_sf2_cfp.c | 102 ++- drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 252 +++ .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 450 ++--- drivers/net/ethernet/intel/i40e/i40e_main.c| 178 ++--- drivers/net/ethernet/intel/iavf/iavf_main.c| 195 +++--- drivers/net/ethernet/intel/igb/igb_main.c | 64 +- drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 743 ++--- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 2 +- .../net/ethernet/mellanox/mlxsw/spectrum_flower.c | 258 --- drivers/net/ethernet/netronome/nfp/flower/action.c | 198 +++--- drivers/net/ethernet/netronome/nfp/flower/match.c | 417 ++-- .../net/ethernet/netronome/nfp/flower/offload.c| 150 ++--- drivers/net/ethernet/qlogic/qede/qede_filter.c | 572 ++-- include/linux/ethtool.h| 15 + include/net/flow_offload.h | 202 ++ include/net/pkt_cls.h | 18 +- net/core/Makefile | 2 +- net/core/ethtool.c | 240 +++ net/core/flow_offload.c| 153 + net/sched/cls_api.c| 116 net/sched/cls_flower.c | 71 +- 21 files changed, 2403 insertions(+), 1995 deletions(-) create mode 100644 include/net/flow_offload.h create mode 100644 net/core/flow_offload.c -- 2.11.0
[PATCH net-next,v5 02/12] net/mlx5e: support for two independent packet edit actions
This patch adds pedit_headers_action structure to store the result of parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to populate the mlx5e hardware intermediate representation once all actions have been parsed. This patch comes in preparation for the new flow_action infrastructure, where each packet mangling comes in an separated action, ie. not packed as in tc pedit. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 81 ++--- 1 file changed, 59 insertions(+), 22 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index d2e6c6578b9c..1daaab91280f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -1751,6 +1751,12 @@ struct pedit_headers { struct udphdr udp; }; +struct pedit_headers_action { + struct pedit_headersvals; + struct pedit_headersmasks; + u32 pedits; +}; + static int pedit_header_offsets[] = { [TCA_PEDIT_KEY_EX_HDR_TYPE_ETH] = offsetof(struct pedit_headers, eth), [TCA_PEDIT_KEY_EX_HDR_TYPE_IP4] = offsetof(struct pedit_headers, ip4), @@ -1762,16 +1768,15 @@ static int pedit_header_offsets[] = { #define pedit_header(_ph, _htype) ((void *)(_ph) + pedit_header_offsets[_htype]) static int set_pedit_val(u8 hdr_type, u32 mask, u32 val, u32 offset, -struct pedit_headers *masks, -struct pedit_headers *vals) +struct pedit_headers_action *hdrs) { u32 *curr_pmask, *curr_pval; if (hdr_type >= __PEDIT_HDR_TYPE_MAX) goto out_err; - curr_pmask = (u32 *)(pedit_header(masks, hdr_type) + offset); - curr_pval = (u32 *)(pedit_header(vals, hdr_type) + offset); + curr_pmask = (u32 *)(pedit_header(&hdrs->masks, hdr_type) + offset); + curr_pval = (u32 *)(pedit_header(&hdrs->vals, hdr_type) + offset); if (*curr_pmask & mask) /* disallow acting twice on the same location */ goto out_err; @@ -1827,8 +1832,7 @@ static struct mlx5_fields fields[] = { * max from the SW pedit action. On success, it says how many HW actions were * actually parsed. */ -static int offload_pedit_fields(struct pedit_headers *masks, - struct pedit_headers *vals, +static int offload_pedit_fields(struct pedit_headers_action *hdrs, struct mlx5e_tc_flow_parse_attr *parse_attr, struct netlink_ext_ack *extack) { @@ -1843,10 +1847,10 @@ static int offload_pedit_fields(struct pedit_headers *masks, __be16 mask_be16; void *action; - set_masks = &masks[TCA_PEDIT_KEY_EX_CMD_SET]; - add_masks = &masks[TCA_PEDIT_KEY_EX_CMD_ADD]; - set_vals = &vals[TCA_PEDIT_KEY_EX_CMD_SET]; - add_vals = &vals[TCA_PEDIT_KEY_EX_CMD_ADD]; + set_masks = &hdrs[TCA_PEDIT_KEY_EX_CMD_SET].masks; + add_masks = &hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].masks; + set_vals = &hdrs[TCA_PEDIT_KEY_EX_CMD_SET].vals; + add_vals = &hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].vals; action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto); action = parse_attr->mod_hdr_actions; @@ -1942,12 +1946,14 @@ static int offload_pedit_fields(struct pedit_headers *masks, } static int alloc_mod_hdr_actions(struct mlx5e_priv *priv, -const struct tc_action *a, int namespace, +struct pedit_headers_action *hdrs, +int namespace, struct mlx5e_tc_flow_parse_attr *parse_attr) { int nkeys, action_size, max_actions; - nkeys = tcf_pedit_nkeys(a); + nkeys = hdrs[TCA_PEDIT_KEY_EX_CMD_SET].pedits + + hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].pedits; action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto); if (namespace == MLX5_FLOW_NAMESPACE_FDB) /* FDB offloading */ @@ -1971,18 +1977,15 @@ static const struct pedit_headers zero_masks = {}; static int parse_tc_pedit_action(struct mlx5e_priv *priv, const struct tc_action *a, int namespace, struct mlx5e_tc_flow_parse_attr *parse_attr, +struct pedit_headers_action *hdrs, struct netlink_ext_ack *extack) { - struct pedit_headers masks[__PEDIT_CMD_MAX], vals[__PEDIT_CMD_MAX], *cmd_masks; int nkeys, i, err = -EOPNOTSUPP; u32 mask, val, offset; u8 cmd, htype; nkeys = tcf_pedit_nkeys(a); - memset(masks, 0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX); - memset(vals, 0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX); - for (i = 0; i < nkeys; i++) {
[PATCH net-next,v5 03/12] flow_offload: add flow action infrastructure
This new infrastructure defines the nic actions that you can perform from existing network drivers. This infrastructure allows us to avoid a direct dependency with the native software TC action representation. Signed-off-by: Pablo Neira Ayuso --- v5: rebase on top of net-next head. include/net/flow_offload.h | 69 +- include/net/pkt_cls.h | 2 ++ net/core/flow_offload.c| 14 -- net/sched/cls_api.c| 17 net/sched/cls_flower.c | 7 +++-- 5 files changed, 103 insertions(+), 6 deletions(-) diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h index 461c66595763..dabc819b6cc9 100644 --- a/include/net/flow_offload.h +++ b/include/net/flow_offload.h @@ -100,11 +100,78 @@ void flow_rule_match_enc_keyid(const struct flow_rule *rule, void flow_rule_match_enc_opts(const struct flow_rule *rule, struct flow_match_enc_opts *out); +enum flow_action_id { + FLOW_ACTION_ACCEPT = 0, + FLOW_ACTION_DROP, + FLOW_ACTION_TRAP, + FLOW_ACTION_GOTO, + FLOW_ACTION_REDIRECT, + FLOW_ACTION_MIRRED, + FLOW_ACTION_VLAN_PUSH, + FLOW_ACTION_VLAN_POP, + FLOW_ACTION_VLAN_MANGLE, + FLOW_ACTION_TUNNEL_ENCAP, + FLOW_ACTION_TUNNEL_DECAP, + FLOW_ACTION_MANGLE, + FLOW_ACTION_ADD, + FLOW_ACTION_CSUM, + FLOW_ACTION_MARK, +}; + +/* This is mirroring enum pedit_header_type definition for easy mapping between + * tc pedit action. Legacy TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK is mapped to + * FLOW_ACT_MANGLE_UNSPEC, which is supported by no driver. + */ +enum flow_action_mangle_base { + FLOW_ACT_MANGLE_UNSPEC = 0, + FLOW_ACT_MANGLE_HDR_TYPE_ETH, + FLOW_ACT_MANGLE_HDR_TYPE_IP4, + FLOW_ACT_MANGLE_HDR_TYPE_IP6, + FLOW_ACT_MANGLE_HDR_TYPE_TCP, + FLOW_ACT_MANGLE_HDR_TYPE_UDP, +}; + +struct flow_action_entry { + enum flow_action_id id; + union { + u32 chain_index;/* FLOW_ACTION_GOTO */ + struct net_device *dev; /* FLOW_ACTION_REDIRECT */ + struct {/* FLOW_ACTION_VLAN */ + u16 vid; + __be16 proto; + u8 prio; + } vlan; + struct {/* FLOW_ACTION_PACKET_EDIT */ + enum flow_action_mangle_base htype; + u32 offset; + u32 mask; + u32 val; + } mangle; + const struct ip_tunnel_info *tunnel;/* FLOW_ACTION_TUNNEL_ENCAP */ + u32 csum_flags; /* FLOW_ACTION_CSUM */ + u32 mark; /* FLOW_ACTION_MARK */ + }; +}; + +struct flow_action { + unsigned intnum_entries; + struct flow_action_entryentries[0]; +}; + +static inline bool flow_action_has_entries(const struct flow_action *action) +{ + return action->num_entries; +} + +#define flow_action_for_each(__i, __act, __actions)\ +for (__i = 0, __act = &(__actions)->entries[0]; __i < (__actions)->num_entries; __act = &(__actions)->entries[__i++]) + struct flow_rule { struct flow_match match; + struct flow_action action; }; -struct flow_rule *flow_rule_alloc(void); +struct flow_rule *flow_rule_alloc(unsigned int num_actions); static inline bool flow_rule_match_key(const struct flow_rule *rule, enum flow_dissector_key_id key) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 359876ee32be..9ceac97e5eff 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -620,6 +620,8 @@ tcf_match_indev(struct sk_buff *skb, int ifindex) } #endif /* CONFIG_NET_CLS_IND */ +unsigned int tcf_exts_num_actions(struct tcf_exts *exts); + int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts, enum tc_setup_type type, void *type_data, bool err_stop); diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c index 2fbf6903d2f6..c3a00eac4804 100644 --- a/net/core/flow_offload.c +++ b/net/core/flow_offload.c @@ -3,9 +3,19 @@ #include #include -struct flow_rule *flow_rule_alloc(void) +struct flow_rule *flow_rule_alloc(unsigned int num_actions) { - return kzalloc(sizeof(struct flow_rule), GFP_KERNEL); + struct flow_rule *rule; + + rule = kzalloc(sizeof(struct flow_rule) + + sizeof(struct flow_action_entry) * num_actions, + GFP_KERNEL); + if (!rule) + return NULL; + + rule->action.num_entries = num_actions; + + return rule; } EXPORT_SYMBOL(flo
[PATCH net-next] neighbour: Improve garbage collection
From: David Ahern The existing garbage collection algorithm has a number of problems: 1. The gc algorithm will not evict PERMANENT entries as those entries are managed by userspace, yet the existing algorithm walks the entire hash table which means it always considers PERMANENT entries when looking for entries to evict. In some use cases (e.g., EVPN) there can be tens of thousands of PERMANENT entries leading to wasted CPU cycles when gc kicks in. As an example, with 32k permanent entries, neigh_alloc has been observed taking more than 4 msec per invocation. 2. Currently, when the number of neighbor entries hits gc_thresh2 and the last flush for the table was more than 5 seconds ago gc kicks in walks the entire hash table evicting *all* entries not in PERMANENT or REACHABLE state and not marked as externally learned. There is no discriminator on when the neigh entry was created or if it just moved from REACHABLE to another NUD_VALID state (e.g., NUD_STALE). It is possible for entries to be created or for established neighbor entries to be moved to STALE (e.g., an external node sends an ARP request) right before the 5 second window lapses: -|-x|--|- t-5 t t+5 If that happens those entries are evicted during gc causing unnecessary thrashing on neighbor entries and userspace caches trying to track them. Further, this contradicts the description of gc_thresh2 which says "Entries older than 5 seconds will be cleared". One workaround is to make gc_thresh2 == gc_thresh3 but that negates the whole point of having separate thresholds. 3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries when gc_thresh2 is exceeded is over kill and contributes to trashing especially during startup. This patch addresses these problems as follows: 1. use of a separate list_head to track entries that can be garbage collected along with a separate counter. PERMANENT entries are not added to this list. The gc_thresh parameters are only compared to the new counter, not the total entries in the table. The forced_gc function is updated to only walk this new gc_list looking for entries to evict. 2. Entries are added to the list head at the tail and removed from the front. 3. Entries are only evicted if they were last updated more than 5 seconds ago, adhering to the original intent of gc_thresh2. 4. Forced gc is stopped once the number of gc_entries drops below gc_thresh2. 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped when allocating a new neighbor for a PERMANENT entry. By extension this means there are no explicit limits on the number of PERMANENT entries that can be created, but this is no different than FIB entries or FDB entries. Signed-off-by: David Ahern --- Documentation/networking/ip-sysctl.txt | 4 +- include/net/neighbour.h| 4 ++ net/core/neighbour.c | 122 +++-- 3 files changed, 93 insertions(+), 37 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index af2a69439b93..acdfb5d2bcaa 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER Default: 512 neigh/default/gc_thresh3 - INTEGER - Maximum number of neighbor entries allowed. Increase this - when using large numbers of interfaces and when communicating + Maximum number of non-PERMANENT neighbor entries allowed. Increase + this when using large numbers of interfaces and when communicating with large numbers of directly-connected peers. Default: 1024 diff --git a/include/net/neighbour.h b/include/net/neighbour.h index f58b384aa6c9..846ad8da91eb 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -154,6 +154,8 @@ struct neighbour { struct hh_cache hh; int (*output)(struct neighbour *, struct sk_buff *); const struct neigh_ops *ops; + struct list_headgc_list; + boolon_gc_list; struct rcu_head rcu; struct net_device *dev; u8 primary_key[0]; @@ -214,6 +216,8 @@ struct neigh_table { struct timer_list proxy_timer; struct sk_buff_head proxy_queue; atomic_tentries; + atomic_tgc_entries; + struct list_headgc_list; rwlock_tlock; unsigned long last_rand; struct neigh_statistics __percpu *stats; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 6d479b5562be..ab11e94ec44d 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -118,6 +118,36 @@ unsigned long neigh_rand_r
Re: OMAP4430 SDP with KS8851: very slow networking
* Russell King - ARM Linux [181206 18:08]: > reverted, the problem is still there. Revert: > > ec0daae685b2 ("gpio: omap: Add level wakeup handling for omap4 based SoCs") > > on top, and networking returns to normal. So it appears to be this > last commit causing the issue. > > With that and b764a5863fd8 applied, it still misbehaves. Then, poking > at the OMAP4_GPIO_IRQWAKEN0 register, changing it from 0 to 4 with > devmem2 restores normal behaviour - ping times are normal and NFS is > happy. > > # devmem2 0x48055044 w 4 OK thanks. > Given that this GPIO device is not runtime suspended, and is > permanently active (which is what I think we expect, given that it > has an IRQ claimed against it) does the hardware still attempt to > idle the GPIO block - if so, could that be why we need to program > the wakeup register, so the GPIO block signals that it's active? Yes we now idle non-irq GPIOs only from CPU_CLUSTER_PM_ENTER as the selected cpuidle state triggers the domain transitions with WFI. And that's why runtime_suspended_time does not increase for a GPIO instance with IRQs. I can reproduce the long ping latencies on duovero smsc connected to gpio_44, I'll try to debug it more. Regards, Tony
Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads
On Thu, 6 Dec 2018 11:36:05 +0100 Andrew Lunn wrote: > +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) > +{ > + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead; > + int err; > + > + rtnl_lock(); > + if (mtu <= dev->max_mtu) { > + err = dev_set_mtu(dev, mtu); > + if (err) > + netdev_dbg(dev, "Unable to set MTU to include for DSA > overheads\n"); > + } > + rtnl_unlock(); > +} > + You don't need the debug message. Use err_ack instead? Debug messages are usually disabled in most distributions.
Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads
From: Andrew Lunn Date: Thu, 6 Dec 2018 21:48:46 +0100 > David has already accepted the patchset, so i will add a followup > patch. Yeah sorry for jumping the gun, the changes looked pretty straightforward to me. :-/
Re: [PATCH net-next v2 0/8] Pass extack to NETDEV_PRE_UP
From: Petr Machata Date: Thu, 6 Dec 2018 17:05:35 + > Drivers may need to validate configuration of a device that's about to > be upped. An example is mlxsw, which needs to check the configuration of > a VXLAN device attached to an offloaded bridge. Should the validation > fail, there's currently no way to communicate details of the failure to > the user, beyond an error number. > > Therefore this patch set extends the NETDEV_PRE_UP event to include > extack, if available. ... Series applied, thank you.
Re: [PATCH net 0/4] mlxsw: Various fixes
From: Ido Schimmel Date: Thu, 6 Dec 2018 17:44:48 + > Patches #1 and #2 fix two VxLAN related issues. The first patch removes > warnings that can currently be triggered from user space. Second patch > avoids leaking a FID in an error path. > > Patch #3 fixes a too strict check that causes certain host routes not to > be promoted to perform GRE decapsulation in hardware. > > Last patch avoids a use-after-free when deleting a VLAN device via an > ioctl when it is enslaved to a bridge. I have a patchset for net-next > that reworks this code and makes the driver more robust. Series applied.
[PATCH] net: wireless: FTM: fix kernel-doc "cannot understand" warnings
From: Randy Dunlap Fix kernel-doc warnings in FTM due to missing "struct" keyword. Fixes 109 warnings from : ../include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats ' and fixes 88 warnings from : ../include/net/mac80211.h:477: warning: cannot understand function prototype: 'struct ieee80211_ftm_responder_params ' Fixes: 81e54d08d9d8 ("cfg80211: support FTM responder configuration/statistics") Fixes: bc847970f432 ("mac80211: support FTM responder configuration/statistics") Signed-off-by: Randy Dunlap Cc: Pradeep Kumar Chitrapu Cc: Johannes Berg Cc: David Spinadel --- include/net/cfg80211.h |2 +- include/net/mac80211.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- lnx-420-rc5.orig/include/net/cfg80211.h +++ lnx-420-rc5/include/net/cfg80211.h @@ -2815,7 +2815,7 @@ struct cfg80211_external_auth_params { }; /** - * cfg80211_ftm_responder_stats - FTM responder statistics + * struct cfg80211_ftm_responder_stats - FTM responder statistics * * @filled: bitflag of flags using the bits of &enum nl80211_ftm_stats to * indicate the relevant values in this struct for them --- lnx-420-rc5.orig/include/net/mac80211.h +++ lnx-420-rc5/include/net/mac80211.h @@ -467,7 +467,7 @@ struct ieee80211_mu_group_data { }; /** - * ieee80211_ftm_responder_params - FTM responder parameters + * struct ieee80211_ftm_responder_params - FTM responder parameters * * @lci: LCI subelement content * @civicloc: CIVIC location subelement content
Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads
On Thu, Dec 06, 2018 at 12:21:31PM -0800, Florian Fainelli wrote: > On 12/6/18 2:36 AM, Andrew Lunn wrote: > > DSA tagging of frames sent over the master interface to the switch > > increases the size of the frame. Such frames can then be bigger than > > the normal MTU of the master interface, and it may drop them. Use the > > overhead information from the tagger to set the MTU of the master > > device to include this overhead. > > > > Signed-off-by: Andrew Lunn > > --- > > net/dsa/master.c | 16 > > 1 file changed, 16 insertions(+) > > > > diff --git a/net/dsa/master.c b/net/dsa/master.c > > index c90ee3227dea..42f525bc68e2 100644 > > --- a/net/dsa/master.c > > +++ b/net/dsa/master.c > > @@ -158,8 +158,24 @@ static void dsa_master_ethtool_teardown(struct > > net_device *dev) > > cpu_dp->orig_ethtool_ops = NULL; > > } > > > > +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) > > +{ > > + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead; > > + int err; > > + > > + rtnl_lock(); > > + if (mtu <= dev->max_mtu) { > > + err = dev_set_mtu(dev, mtu); > > + if (err) > > + netdev_dbg(dev, "Unable to set MTU to include for DSA > > overheads\n"); > > + } > > Would it make sense to warn the user that there might be > transmit/receive issues with the DSA tagging protocol if either > dev_set_mtu() fails or mtu > dev->max_mtu? I thought about that. But we have setups which work today with the standard MTU. The master might not implement the set_mtu op, or might impose the standard MTU, but be quite happy to deal with our DSA packets. So i wanted to make this a hint to the master device, not a strong requirement. > Not that I think it matters too much to people because unbinding the > switch driver and expecting the CPU port to continue operating is > wishful thinking, but we should probably unwind that operation in > dsa_master_teardown(), right? That would make sense. David has already accepted the patchset, so i will add a followup patch. Andrew
Re: [PATCH RFC 2/6] net: dsa: microchip: Add MIB counter reading support
> I wonder what is the official way to clear the counters. I don't think there is one, other than unloading the driver and loading it again. Andrew
Re: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver
> I did try to implement this way. But the other switches do not have the same > format even though the length is the same. Then I need to change the > following > files for any new KSZ switch: include/linux/dsa.h, net/dsa/dsa.c, > net/dsa/dsa_priv.h, > and finally net/dsa/tag_ksz.c. You can always add two different tag drivers. They don't have to share code if it does not make sense. > Even then it will not work if Microchip wants to add 1588 PTP > capability to the switches. > > For KSZ9477 the length of the tail tag changes when the PTP function > is enabled. Typically this function is either enabled or disabled > all the time, but if users want to change that during normal > operation to see how the switch behaves, the transmit function > completely stops working correctly. We should figure out how to support PTP. I think that is the main issue here. > Older driver implementation is to monitor that register change and adjust the > length > dynamically. > > Another problem is the tail tag needs to include the timestamp for the 1-step > Pdelay_Resp to have accurate turnaround time when that message is sent out by > the > switch. This will require access to the main switch driver which will keep > track of those > PTP messages. > > PTP handles transmit timestamp in skb_tx_timestamp, which is typically called > after the > frame is sent, so it is too late. DSA calls dsa_skb_tx_timestamp before > sending, but it > only provides a clone to the driver that supports port_txstamp and so the > switch driver > may not be able to do anything. The current design assumes the hardware will insert the PTP timestamp into the frame using the clock inside the hardware. You then ask it what timestamp it actually used. If i understand you correctly, in your case, software was to provide the timestamp which then gets inserted into the frame. So you want to provide this timestamp as late as possible, when the frame reaches the head of the queue and is about to be sent out the master interface? > In dsa_switch_rcv() the CPU receive function is called first before > dsa_skb_defer_rx_timestamp(). That means the receive tail tag > operation has to be done first to retrieve the receive timestamp so > that it can be passed later. What i think you can do is in your tag rx function you can directly add the timestamp info to the skbuf. The dsa driver function .port_txtstamp can then always return false. Your tag function is going to need access to some driver state, but you should be able to get at that, following pointers, and placing some of the structures in global headers. Andrew
Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes
On Thu, Dec 6, 2018 at 9:10 PM David Miller wrote: > > From: Jouke Witteveen > Date: Thu, 6 Dec 2018 09:59:20 +0100 > > > On Thu, Dec 6, 2018 at 1:34 AM David Miller wrote: > >> > >> From: Jouke Witteveen > >> Date: Wed, 5 Dec 2018 23:38:17 +0100 > >> > >> > Can you elaborate a bit? I may not be aware of the policy you have in > >> > mind. > >> > >> When we have a user facing interface to do something, we don't create > >> another one unless it is absolutely, positively, unavoidable. > > > > Obviously, if I would have known this I would not have gone through > > the trouble of investigating and proposing this patch. It was an > > honest attempt at making the kernel better. > > Where could I have found this policy? I have looked on kernel.org/doc, > > but couldn't find it. > > It is not formally documented but it is a concern we raise every time > a duplicate piece of user facing functionality is proposed. Ok, thanks for getting back to me! Now I know. That said, when looking into udev I became more convinced that the kernel should send uevents on link state changes anyway. An example of another kernel interface that has two ways of sending out state changes is rfkill. It informs userspace of state changes via /dev/rfkill and via uevents. For the latter it also sets some informative environment variables, which my patch currently does not do. What would be needed to get you (or anyone else) to reconsider this patch (or a revision)? I can definitely see your point and am willing to accept your rejection. However, I also think there are substantial ergonomic benefits to a unified and consistent interface for device state changes and would like to give it one more shot, if possible. Thanks for your time, - Jouke
Re: [PATCH] mv88e6060: Warn about errors
On Thu 2018-12-06 12:21:59, David Miller wrote: > > Plain "printk" are never appropriate. > > Please explicitly use pr_warn() or similar. If there is a device context > available, either a generic device or a netdev, use one of the dev_*() > or netdev_*() variants. Can do, I guess is there's agreeement that such error is worth some kind of output to the logs? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: mv88e6060: Turn e6060 driver into e6065 driver
On Thu 2018-12-06 12:23:33, David Miller wrote: > From: Pavel Machek > Date: Thu, 6 Dec 2018 14:03:45 +0100 > > > @@ -79,7 +82,7 @@ static enum dsa_tag_protocol > > mv88e6060_get_tag_protocol(struct dsa_switch *ds, > > { > >//return DSA_TAG_PROTO_QCA; > >//return DSA_TAG_PROTO_TRAILER; > > These C++ style comments are not in any of my tree(s). > > Your patch submission really needs to shape up if you want your patches > to be considered seriously. This one should have been "RFD". It has way more serious problems then this, as changelog tried to explain. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
RE: [PATCH RFC 1/6] net: dsa: microchip: Prepare PHY for proper advertisement
> > +static void ksz9477_phy_setup(struct ksz_device *dev, int port, > > + struct phy_device *phy) > > +{ > > + if (port < dev->phy_port_cnt) { > > + /* SUPPORTED_Asym_Pause and SUPPORTED_Pause can be > removed to > > +* disable flow control when rate limiting is used. > > +*/ > > + } > > Hi Tristram > > Is this meant to be a TODO comment? > > What is supposed to happen here is that all forms of pause are disable > by default. The MAC driver needs to enable what it supports by calling > phy_support_sym_pause() or phy_support_asym_pause(). > > Ah, is this because there is not a real PHY driver? The kernel has been changed so I am not sure about the current behavior. I would like to turn on flow control by default. Before I just assigned "supported" to "advertising." I know linkmode_copy is being used now for that. But last time I checked "advertising" is already the same as "supported." There will be a situation that flow control should not be turned on as the switch uses bandwidth control to limit outgoing traffic. The issue is actually becoming more complex as KSZ9477 has a variant which does not support gigabit speed, although the same PHY device id is being used. That means the driver has to fake it by returning a different id and also registering a different PHY driver to handle that. Marketing also likes to display the correct chip name during kernel booting so that users do not get confused.
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, Dec 6, 2018 at 12:20 PM Edgecombe, Rick P wrote: > > On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote: > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: > > > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: > > > > > If we are going to unmap the linear alias, why not do it at vmalloc() > > > > > time rather than vfree() time? > > > > > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > > > work on module data? Perhaps crypto code trying to encrypt static > > > > data because our APIs don’t understand virtual addresses. I guess if > > > > highmem is ever used for modules, then we should be fine. > > > > > > > > RO instead of not present might be safer. But I do like the idea of > > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and > > > > making it do all of this. > > > > > > Yeah, doing it for everything automatically seemed like it was/is > > > going to be a lot of work to debug all the corner cases where things > > > expect memory to be mapped but don't explicitly say it. And in > > > particular, the XPFO series only does it for user memory, whereas an > > > additional flag like this would work for extra paranoid allocations > > > of kernel memory too. > > > > > > > I just read the code, and I looks like vmalloc() is already using > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for > > example, we already don't have modules in the direct map. > > > > So I say we go for it. This should be quite simple to implement -- > > the pageattr code already has almost all the needed logic on x86. The > > only arch support we should need is a pair of functions to remove a > > vmalloc address range from the address map (if it was present in the > > first place) and a function to put it back. On x86, this should only > > be a few lines of code. > > > > What do you all think? This should solve most of the problems we have. > > > > If we really wanted to optimize this, we'd make it so that > > module_alloc() allocates memory the normal way, then, later on, we > > call some function that, all at once, removes the memory from the > > direct map and applies the right permissions to the vmalloc alias (or > > just makes the vmalloc alias not-present so we can add permissions > > later without flushing), and flushes the TLB. And we arrange for > > vunmap to zap the vmalloc range, then put the memory back into the > > direct map, then free the pages back to the page allocator, with the > > flush in the appropriate place. > > > > I don't see why the page allocator needs to know about any of this. > > It's already okay with the permissions being changed out from under it > > on x86, and it seems fine. Rick, do you want to give some variant of > > this a try? > Hi, > > Sorry, I've been having email troubles today. > > I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set > NP/RO in the directmap, so it would be sort of inconsistent whether the > directmap of vmalloc range allocations were readable or not. I couldn't see > any > places where it would cause problems today though. > > I was ready to assume that all TLBs don't cache NP, because I don't know how > usages where a page fault is used to load something could work without lots of > flushes. Or the architecture just fixes up the spurious faults, I suppose. I'm only well-educated on the x86 mmu. > If that's the case, then all archs with directmap permissions could > share a single vmalloc special permission flush implementation that works like > Andy described originally. It could be controlled with an > ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and > set_pages_rw on any archs with directmap permissions. So seems simpler to me > (and what I have been doing) unless I'm missing the problem. Hmm. The only reason I've proposed anything fancier was because I was thinking of minimizing flushes, but I think I'm being silly. This sequence ought to work optimally: - vmalloc(..., VM_HAS_DIRECT_MAP_PERMS); /* no flushes */ - Write some data, via vmalloc's return address. - Use some set_memory_whatever() functions to update permissions, which will flush, hopefully just once. - Run the module code! - vunmap -- this will do a single flush that will fix everything. This does require that set_pages_np() or set_memory_np() or whatever exists and that it's safe to do that, then flush, and then set_pages_rw(). So maybe you want set_pages_np_noflush() and set_pages_rw_noflush() to make it totally clear what's supposed to happen. --Andy
Re: mv88e6060: Turn e6060 driver into e6065 driver
From: Pavel Machek Date: Thu, 6 Dec 2018 14:03:45 +0100 > @@ -79,7 +82,7 @@ static enum dsa_tag_protocol > mv88e6060_get_tag_protocol(struct dsa_switch *ds, > { >//return DSA_TAG_PROTO_QCA; >//return DSA_TAG_PROTO_TRAILER; These C++ style comments are not in any of my tree(s). Your patch submission really needs to shape up if you want your patches to be considered seriously. Thank you.
Re: [PATCH] mv88e6060: Warn about errors
Plain "printk" are never appropriate. Please explicitly use pr_warn() or similar. If there is a device context available, either a generic device or a netdev, use one of the dev_*() or netdev_*() variants.
Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads
On 12/6/18 2:36 AM, Andrew Lunn wrote: > DSA tagging of frames sent over the master interface to the switch > increases the size of the frame. Such frames can then be bigger than > the normal MTU of the master interface, and it may drop them. Use the > overhead information from the tagger to set the MTU of the master > device to include this overhead. > > Signed-off-by: Andrew Lunn > --- > net/dsa/master.c | 16 > 1 file changed, 16 insertions(+) > > diff --git a/net/dsa/master.c b/net/dsa/master.c > index c90ee3227dea..42f525bc68e2 100644 > --- a/net/dsa/master.c > +++ b/net/dsa/master.c > @@ -158,8 +158,24 @@ static void dsa_master_ethtool_teardown(struct > net_device *dev) > cpu_dp->orig_ethtool_ops = NULL; > } > > +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp) > +{ > + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead; > + int err; > + > + rtnl_lock(); > + if (mtu <= dev->max_mtu) { > + err = dev_set_mtu(dev, mtu); > + if (err) > + netdev_dbg(dev, "Unable to set MTU to include for DSA > overheads\n"); > + } Would it make sense to warn the user that there might be transmit/receive issues with the DSA tagging protocol if either dev_set_mtu() fails or mtu > dev->max_mtu? > + rtnl_unlock(); > +} > + > int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) > { > + dsa_master_set_mtu(dev, cpu_dp); Not that I think it matters too much to people because unbinding the switch driver and expecting the CPU port to continue operating is wishful thinking, but we should probably unwind that operation in dsa_master_teardown(), right? > + > /* If we use a tagging format that doesn't have an ethertype >* field, make sure that all packets from this point on get >* sent to the tag format's receive function. > -- Florian
Re: [PATCH] tcp: fix code style in tcp_recvmsg()
From: Pedro Tammela Date: Thu, 6 Dec 2018 10:45:28 -0200 > 2 goto labels are indented with a tab. remove the tabs and > keep the code style consistent. > > Signed-off-by: Pedro Tammela Applied to net-next.
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote: > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: > > > > If we are going to unmap the linear alias, why not do it at vmalloc() > > > > time rather than vfree() time? > > > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > > work on module data? Perhaps crypto code trying to encrypt static > > > data because our APIs don’t understand virtual addresses. I guess if > > > highmem is ever used for modules, then we should be fine. > > > > > > RO instead of not present might be safer. But I do like the idea of > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and > > > making it do all of this. > > > > Yeah, doing it for everything automatically seemed like it was/is > > going to be a lot of work to debug all the corner cases where things > > expect memory to be mapped but don't explicitly say it. And in > > particular, the XPFO series only does it for user memory, whereas an > > additional flag like this would work for extra paranoid allocations > > of kernel memory too. > > > > I just read the code, and I looks like vmalloc() is already using > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for > example, we already don't have modules in the direct map. > > So I say we go for it. This should be quite simple to implement -- > the pageattr code already has almost all the needed logic on x86. The > only arch support we should need is a pair of functions to remove a > vmalloc address range from the address map (if it was present in the > first place) and a function to put it back. On x86, this should only > be a few lines of code. > > What do you all think? This should solve most of the problems we have. > > If we really wanted to optimize this, we'd make it so that > module_alloc() allocates memory the normal way, then, later on, we > call some function that, all at once, removes the memory from the > direct map and applies the right permissions to the vmalloc alias (or > just makes the vmalloc alias not-present so we can add permissions > later without flushing), and flushes the TLB. And we arrange for > vunmap to zap the vmalloc range, then put the memory back into the > direct map, then free the pages back to the page allocator, with the > flush in the appropriate place. > > I don't see why the page allocator needs to know about any of this. > It's already okay with the permissions being changed out from under it > on x86, and it seems fine. Rick, do you want to give some variant of > this a try? Hi, Sorry, I've been having email troubles today. I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set NP/RO in the directmap, so it would be sort of inconsistent whether the directmap of vmalloc range allocations were readable or not. I couldn't see any places where it would cause problems today though. I was ready to assume that all TLBs don't cache NP, because I don't know how usages where a page fault is used to load something could work without lots of flushes. If that's the case, then all archs with directmap permissions could share a single vmalloc special permission flush implementation that works like Andy described originally. It could be controlled with an ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and set_pages_rw on any archs with directmap permissions. So seems simpler to me (and what I have been doing) unless I'm missing the problem. If you all think so I can indeed take a shot at it, I just don't see what the problem was with the original solution, that seems less likely to break anything. Thanks, Rick
Re: [PATCH net-next 0/2] Adjust MTU of DSA master interface
From: Andrew Lunn Date: Thu, 6 Dec 2018 11:36:03 +0100 > DSA makes use of additional headers to direct a frame in/out of a > specific port of the switch. When the slave interfaces uses an MTU of > 1500, the master interface can be asked to handle frames with an MTU > of 1504, or 1508 bytes. Some Ethernet interfaces won't > transmit/receive frames which are bigger than their MTU. > > Automate the increasing of the MTU on the master interface, by adding > to each tagging driver how much overhead they need, and then calling > dev_set_mtu() of the master interface to increase its MTU as needed. Series applied, thanks Andrew.
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit wrote: > > > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski wrote: > > > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: > >> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: > If we are going to unmap the linear alias, why not do it at vmalloc() > time rather than vfree() time? > >>> > >>> That’s not totally nuts. Do we ever have code that expects __va() to > >>> work on module data? Perhaps crypto code trying to encrypt static > >>> data because our APIs don’t understand virtual addresses. I guess if > >>> highmem is ever used for modules, then we should be fine. > >>> > >>> RO instead of not present might be safer. But I do like the idea of > >>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and > >>> making it do all of this. > >> > >> Yeah, doing it for everything automatically seemed like it was/is > >> going to be a lot of work to debug all the corner cases where things > >> expect memory to be mapped but don't explicitly say it. And in > >> particular, the XPFO series only does it for user memory, whereas an > >> additional flag like this would work for extra paranoid allocations > >> of kernel memory too. > > > > I just read the code, and I looks like vmalloc() is already using > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for > > example, we already don't have modules in the direct map. > > > > So I say we go for it. This should be quite simple to implement -- > > the pageattr code already has almost all the needed logic on x86. The > > only arch support we should need is a pair of functions to remove a > > vmalloc address range from the address map (if it was present in the > > first place) and a function to put it back. On x86, this should only > > be a few lines of code. > > > > What do you all think? This should solve most of the problems we have. > > > > If we really wanted to optimize this, we'd make it so that > > module_alloc() allocates memory the normal way, then, later on, we > > call some function that, all at once, removes the memory from the > > direct map and applies the right permissions to the vmalloc alias (or > > just makes the vmalloc alias not-present so we can add permissions > > later without flushing), and flushes the TLB. And we arrange for > > vunmap to zap the vmalloc range, then put the memory back into the > > direct map, then free the pages back to the page allocator, with the > > flush in the appropriate place. > > > > I don't see why the page allocator needs to know about any of this. > > It's already okay with the permissions being changed out from under it > > on x86, and it seems fine. Rick, do you want to give some variant of > > this a try? > > Setting it as read-only may work (and already happens for the read-only > module data). I am not sure about setting it as non-present. > > At some point, a discussion about a threat-model, as Rick indicated, would > be required. I presume ROP attacks can easily call set_all_modules_text_rw() > and override all the protections. > I am far from an expert on exploit techniques, but here's a potentially useful model: let's assume there's an attacker who can write controlled data to a controlled kernel address but cannot directly modify control flow. It would be nice for such an attacker to have a very difficult time of modifying kernel text or of compromising control flow. So we're assuming a feature like kernel CET or that the attacker finds it very difficult to do something like modifying some thread's IRET frame. Admittedly, for the kernel, this is an odd threat model, since an attacker can presumably quite easily learn the kernel stack address of one of their tasks, do some syscall, and then modify their kernel thread's stack such that it will IRET right back to a fully controlled register state with RSP pointing at an attacker-supplied kernel stack. So this threat model gives very strong ROP powers. unless we have either CET or some software technique to harden all the RET instructions in the kernel. I wonder if there's a better model to use. Maybe with stack-protector we get some degree of protection? Or is all of this is rather weak until we have CET or a RAP-like feature.
Re: [PATCH][net-next] tun: align write-heavy flow entry members to a cache line
From: Li RongQing Date: Thu, 6 Dec 2018 16:08:17 +0800 > tun flow entry 'updated' fields are written when receive > every packet. Thus if a flow is receiving packets from a > particular flow entry, it'll cause false-sharing with > all the other who has looked it up, so move it in its own > cache line > > and update 'queue_index' and 'update' field only when > they are changed to reduce the cache false-sharing. > > Signed-off-by: Zhang Yu > Signed-off-by: Wang Li > Signed-off-by: Li RongQing Applied.
Re: [PATCH][net-next] tun: remove unnecessary check in tun_flow_update
From: Li RongQing Date: Thu, 6 Dec 2018 16:28:11 +0800 > caller has guaranted that rxhash is not zero > > Signed-off-by: Li RongQing Applied.
RE: [PATCH RFC 2/6] net: dsa: microchip: Add MIB counter reading support
> > +static void ksz9477_r_mib_cnt(struct ksz_device *dev, int port, u16 addr, > > + u64 *cnt) > > +{ > > + u32 data; > > + int timeout; > > + struct ksz_port *p = &dev->ports[port]; > > + > > + /* retain the flush/freeze bit */ > > + data = p->freeze ? MIB_COUNTER_FLUSH_FREEZE : 0; > > + data |= MIB_COUNTER_READ; > > + data |= (addr << MIB_COUNTER_INDEX_S); > > + ksz_pwrite32(dev, port, REG_PORT_MIB_CTRL_STAT__4, data); > > + > > + timeout = 1000; > > + do { > > + ksz_pread32(dev, port, REG_PORT_MIB_CTRL_STAT__4, > > + &data); > > + usleep_range(1, 10); > > + if (!(data & MIB_COUNTER_READ)) > > + break; > > + } while (timeout-- > 0); > > Could you use readx_poll_timeout() here? > > > +void ksz_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *buf) > > +{ > > + struct ksz_device *dev = ds->priv; > > + struct ksz_port_mib *mib; > > + > > + mib = &dev->ports[port].mib; > > + > > + /* freeze MIB counters if supported */ > > + if (dev->dev_ops->freeze_mib) > > + dev->dev_ops->freeze_mib(dev, port, true); > > + mutex_lock(&mib->cnt_mutex); > > + port_r_cnt(dev, port); > > + mutex_unlock(&mib->cnt_mutex); > > + if (dev->dev_ops->freeze_mib) > > + dev->dev_ops->freeze_mib(dev, port, false); > > Should the freeze be protected by the mutex as well? > > > + memcpy(buf, mib->counters, dev->mib_cnt * sizeof(u64)); > > I wonder if this memcpy should also be protected by the mutex. As soon > as the mutex is dropped, the scheduled work could start updating > mib->counters in non-atomic ways? > I will update as suggested. > > +} > > + > > int ksz_port_bridge_join(struct dsa_switch *ds, int port, > > struct net_device *br) > > { > > @@ -255,6 +349,7 @@ int ksz_enable_port(struct dsa_switch *ds, int port, > struct phy_device *phy) > > /* setup slave port */ > > dev->dev_ops->port_setup(dev, port, false); > > dev->dev_ops->phy_setup(dev, port, phy); > > + dev->dev_ops->port_init_cnt(dev, port); > > This is probably not the correct place to do this. MIB counters should > not be cleared by an ifdown/ifup cycle. They should only be cleared > when the driver is probed. I wonder what is the official way to clear the counters. For network debugging It is good to clear the counters to start fresh to see which frame is not being sent or received. Typically the device is reset when it is shutdown as there are hardware problems. I would think it is the job of applications like SNMP Manager to keep track of MIB counters throughout the life of a running system.
Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes
From: Jouke Witteveen Date: Thu, 6 Dec 2018 09:59:20 +0100 > On Thu, Dec 6, 2018 at 1:34 AM David Miller wrote: >> >> From: Jouke Witteveen >> Date: Wed, 5 Dec 2018 23:38:17 +0100 >> >> > Can you elaborate a bit? I may not be aware of the policy you have in >> > mind. >> >> When we have a user facing interface to do something, we don't create >> another one unless it is absolutely, positively, unavoidable. > > Obviously, if I would have known this I would not have gone through > the trouble of investigating and proposing this patch. It was an > honest attempt at making the kernel better. > Where could I have found this policy? I have looked on kernel.org/doc, > but couldn't find it. It is not formally documented but it is a concern we raise every time a duplicate piece of user facing functionality is proposed.
RE: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver
> >>> Update tag_ksz.c to access switch driver's tail tagging operations. > >> > >> Hi Tristram > >> > >> Humm, i'm not sure we want this, the tagging spit into two places. I > >> need to take a closer look at the previous patch, to see why it cannot > >> be done here. > > > > O.K, i think i get what is going on. > > > > I would however implement it differently. > > > > One net/dsa/tag_X.c file can export two dsa_device_ops structures, > > allowing you to share common code for the two taggers. You could call > > these DSA_TAG_PROTO_KSZ_1_BYTE, and DSA_TAG_PROTO_KSZ_2_BYTE, > and the > > .get_tag_protocol call would then return the correct one for the > > switch. > > Agreed, that is what is done by net/dsa/tag_brcm.c because there are two > formats for the Broadcom tag: > > - TAG_BRCM: the 4-bytes Broadcom tag is between MAC SA and Ethertype > - TAG_BRCM_PREPEND: the 4-bytes Broadcom tag is before the MAC DA > I did try to implement this way. But the other switches do not have the same format even though the length is the same. Then I need to change the following files for any new KSZ switch: include/linux/dsa.h, net/dsa/dsa.c, net/dsa/dsa_priv.h, and finally net/dsa/tag_ksz.c. Even then it will not work if Microchip wants to add 1588 PTP capability to the switches. For KSZ9477 the length of the tail tag changes when the PTP function is enabled. Typically this function is either enabled or disabled all the time, but if users want to change that during normal operation to see how the switch behaves, the transmit function completely stops working correctly. Older driver implementation is to monitor that register change and adjust the length dynamically. Another problem is the tail tag needs to include the timestamp for the 1-step Pdelay_Resp to have accurate turnaround time when that message is sent out by the switch. This will require access to the main switch driver which will keep track of those PTP messages. PTP handles transmit timestamp in skb_tx_timestamp, which is typically called after the frame is sent, so it is too late. DSA calls dsa_skb_tx_timestamp before sending, but it only provides a clone to the driver that supports port_txstamp and so the switch driver may not be able to do anything. > And the code to process them is basically using relative offsets from > the start of the frame to access correct data. > > This is done largely for performance reasons because we have 1/2 > Gigabit/secs capable CPU ports and so we want to avoid as little cache > trashing as possible and immediately get the right rcv() function to > process the packets. > The SoC I used for this driver development actually has problem sending Gigabit traffic so I do not see the effect of any slowdown, and the updated MAC driver change for a hardware problem does not help and greatly degrades the transmit performance. > > > > It might also be possible to merge in tag_trailer, or at least share > > some code. > > Actually in previous old DSA implementation I just hijacked this file to add the tail tag operations without creating a new file like tag_ksz.c. > > What i don't yet understand is how you are passing PTP information > > around. The commit messages need to explain that, since it is not > > obvious, and it is the first time we have needed PTP info in a tag > > driver. It seems the official 1588 PTP timestamp API for a PHY driver is only implemented in only PHY driver, net/phy/dp83640.c, in the whole kernel. DSA uses similar mechanism to support 1588 PTP. In dsa_switch_rcv() the CPU receive function is called first before dsa_skb_defer_rx_timestamp(). That means the receive tail tag operation has to be done first to retrieve the receive timestamp so that it can be passed later. It is probably not good to change the socket buffer length inside the port_rxtstamp function, and I do not see any other way to insert that transmit timestamp. A customer has already inquired about implementing 1588 PTP in the DSA driver. I hope this mechanism is approved so that I can start doing that.
Re: [PATCH net-next 0/2] platform data controls for mdio-gpio
On Thu, Dec 06, 2018 at 09:22:27AM -0800, Florian Fainelli wrote: > Hi Andrew, > > On 12/6/18 5:58 AM, Andrew Lunn wrote: > > Soon to be mainlined is an x86 platform with a Marvell switch, and a > > bit-banging MDIO bus. In order to make this work, the phy_mask of the > > MDIO bus needs to be set to prevent scanning for PHYs, and the > > phy_ignore_ta_mask needs to be set because the switch has broken > > turnaround. > > This looks good, I would just make one/two changes which is to match the > internal phy_mask and phy_ignore_ta_mask types from the struct mii_bus > and use u32 instead of int. Yes, that makes sense. v2 to follow. Andrew
Re: [PATCH] gianfar: Add gfar_change_carrier()
> I can have a look at using dormant, but what is change_carrier > supposed to do if not this? It is intended for interfaces which are stacked, like the team driver, and for devices which don't have a phy, e.g. tun, and dummy. > I didn't find a tool for DORMANT, I guess i will have to write one > myself(using SIOCGIFFLAGS, SIOCSIFFLAGS)? ip link should be able to set it. Try ip link set mode dormant dev eth0 Andrew
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski wrote: > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: >> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: If we are going to unmap the linear alias, why not do it at vmalloc() time rather than vfree() time? >>> >>> That’s not totally nuts. Do we ever have code that expects __va() to >>> work on module data? Perhaps crypto code trying to encrypt static >>> data because our APIs don’t understand virtual addresses. I guess if >>> highmem is ever used for modules, then we should be fine. >>> >>> RO instead of not present might be safer. But I do like the idea of >>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and >>> making it do all of this. >> >> Yeah, doing it for everything automatically seemed like it was/is >> going to be a lot of work to debug all the corner cases where things >> expect memory to be mapped but don't explicitly say it. And in >> particular, the XPFO series only does it for user memory, whereas an >> additional flag like this would work for extra paranoid allocations >> of kernel memory too. > > I just read the code, and I looks like vmalloc() is already using > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for > example, we already don't have modules in the direct map. > > So I say we go for it. This should be quite simple to implement -- > the pageattr code already has almost all the needed logic on x86. The > only arch support we should need is a pair of functions to remove a > vmalloc address range from the address map (if it was present in the > first place) and a function to put it back. On x86, this should only > be a few lines of code. > > What do you all think? This should solve most of the problems we have. > > If we really wanted to optimize this, we'd make it so that > module_alloc() allocates memory the normal way, then, later on, we > call some function that, all at once, removes the memory from the > direct map and applies the right permissions to the vmalloc alias (or > just makes the vmalloc alias not-present so we can add permissions > later without flushing), and flushes the TLB. And we arrange for > vunmap to zap the vmalloc range, then put the memory back into the > direct map, then free the pages back to the page allocator, with the > flush in the appropriate place. > > I don't see why the page allocator needs to know about any of this. > It's already okay with the permissions being changed out from under it > on x86, and it seems fine. Rick, do you want to give some variant of > this a try? Setting it as read-only may work (and already happens for the read-only module data). I am not sure about setting it as non-present. At some point, a discussion about a threat-model, as Rick indicated, would be required. I presume ROP attacks can easily call set_all_modules_text_rw() and override all the protections.
Re: [PATCH bpf-next] tools: bpftool: add a command to dump the trace pipe
On Thu, Dec 06, 2018 at 05:20:54PM +, Quentin Monnet wrote: > 2018-12-05 19:18 UTC-0800 ~ Alexei Starovoitov > > > On Wed, Dec 05, 2018 at 06:15:23PM +, Quentin Monnet wrote: > + > +/* Allow room for NULL terminating byte and pipe file name */ > +snprintf(format, sizeof(format), "%%*s %%%zds %%99s %%*s %%*d > %%*d\\n", > + PATH_MAX - strlen(pipe_name) - 1); > >>> > >>> before scanning trace_pipe could you add a check that trace_options are > >>> compatible? > >>> Otherwise there will be a lot of garbage printed. > >>> afaik default is rarely changed, so the patch is ok as-is. > >>> The followup some time in the future would be perfect. > >> > >> Sure. What do you mean exactly by compatible options? I can check that > >> "trace_printk" is set, is there any other option that would be relevant? > > > > See Documentation/trace/ftrace.rst > > a lot of the flags will change the format significantly. > > Like 'bin' will make it binary. > > I'm not suggesting to support all possible output formats. > > Only to check that trace flags match scanf. > > fscanf() is only used to retrieve the name of the sysfs directory where > the pipe is located, when listing all the mount points on the system. It > is not used to dump the content from the pipe (which is done with > getline(), so formatting does not matter much). > > If the "bin" option is set, "bpftool prog tracelog" will dump the same > binary content as "cat /sys/kernel/debug/tracing/trace_pipe", which is > the expected behaviour (at least with the current patch). Let me know if > you would like me to change this somehow. I misread the patch :) thanks for explaining. all good then.
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, 6 Dec 2018 at 20:30, Will Deacon wrote: > > On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote: > > On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski wrote: > > > > > > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel > > > wrote: > > > > > > > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski wrote: > > > > > > > > > > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > > > > work on module data? Perhaps crypto code trying to encrypt static > > > > > data because our APIs don’t understand virtual addresses. I guess if > > > > > highmem is ever used for modules, then we should be fine. > > > > > > > > > > > > > The crypto code shouldn't care, but I think it will probably break > > > > hibernate :-( > > > > > > How so? Hibernate works (or at least should work) on x86 PAE, where > > > __va doesn't work on module data, and, on x86, the direct map has some > > > RO parts with where the module is, so hibernate can't be writing to > > > the memory through the direct map with its final permissions. > > > > On arm64 at least, hibernate reads the contents of memory via the > > linear mapping. Not sure about other arches. > > Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from > the pte when we see that it's PROT_NONE? > As long as we can easily figure out whether a certain linear address is mapped or not, having a special case like that for these mappings doesn't sound unreasonable.
Re: [PATCH net-next 4/4] net: aquantia: add support of RSS configuration
On Thu, 6 Dec 2018 15:02:52 +, Igor Russkikh wrote: > From: Dmitry Bogdanov > > Add support of configuration of RSS hash key and RSS indirection table. > > Signed-off-by: Dmitry Bogdanov > Signed-off-by: Igor Russkikh > --- > .../ethernet/aquantia/atlantic/aq_ethtool.c | 42 +++ > 1 file changed, 42 insertions(+) > > diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c > b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c > index a5fd71692c8b..2f2e12c2b632 100644 > --- a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c > +++ b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c > @@ -202,6 +202,47 @@ static int aq_ethtool_get_rss(struct net_device *ndev, > u32 *indir, u8 *key, > return 0; > } > > +static int aq_ethtool_set_rss(struct net_device *netdev, const u32 *indir, > + const u8 *key, const u8 hfunc) > +{ > + struct aq_nic_s *aq_nic = netdev_priv(netdev); > + struct aq_nic_cfg_s *cfg; > + unsigned int i = 0U; > + u32 rss_entries; > + int err = 0; > + > + cfg = aq_nic_get_cfg(aq_nic); > + rss_entries = cfg->aq_rss.indirection_table_size; > + > + /* We do not allow change in unsupported parameters */ > + if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP) > + return -EOPNOTSUPP; > + /* Fill out the redirection table */ > + if (indir) { > + /* Verify user input. */ > + for (i = 0; i < rss_entries; i++) > + if (indir[i] >= cfg->num_rss_queues) > + return -EINVAL; nit: you shouldn't have to do this, see ethtool_copy_validate_indir(). > + for (i = 0; i < rss_entries; i++) > + cfg->aq_rss.indirection_table[i] = indir[i]; > + } > + > + /* Fill out the rss hash key */ > + if (key) { > + memcpy(cfg->aq_rss.hash_secret_key, key, > +sizeof(cfg->aq_rss.hash_secret_key)); > + err = aq_nic->aq_hw_ops->hw_rss_hash_set(aq_nic->aq_hw, > + &cfg->aq_rss); > + if (err) > + return err; > + } > + > + err = aq_nic->aq_hw_ops->hw_rss_set(aq_nic->aq_hw, &cfg->aq_rss); > + > + return err; > +}
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote: > On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski wrote: > > > > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel > > wrote: > > > > > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski wrote: > > > > > > > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > > > work on module data? Perhaps crypto code trying to encrypt static > > > > data because our APIs don’t understand virtual addresses. I guess if > > > > highmem is ever used for modules, then we should be fine. > > > > > > > > > > The crypto code shouldn't care, but I think it will probably break > > > hibernate :-( > > > > How so? Hibernate works (or at least should work) on x86 PAE, where > > __va doesn't work on module data, and, on x86, the direct map has some > > RO parts with where the module is, so hibernate can't be writing to > > the memory through the direct map with its final permissions. > > On arm64 at least, hibernate reads the contents of memory via the > linear mapping. Not sure about other arches. Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from the pte when we see that it's PROT_NONE? Will
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski wrote: > > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel > wrote: > > > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski wrote: > > > > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > > work on module data? Perhaps crypto code trying to encrypt static > > > data because our APIs don’t understand virtual addresses. I guess if > > > highmem is ever used for modules, then we should be fine. > > > > > > > The crypto code shouldn't care, but I think it will probably break > > hibernate :-( > > How so? Hibernate works (or at least should work) on x86 PAE, where > __va doesn't work on module data, and, on x86, the direct map has some > RO parts with where the module is, so hibernate can't be writing to > the memory through the direct map with its final permissions. On arm64 at least, hibernate reads the contents of memory via the linear mapping. Not sure about other arches.
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel wrote: > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski wrote: > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > work on module data? Perhaps crypto code trying to encrypt static > > data because our APIs don’t understand virtual addresses. I guess if > > highmem is ever used for modules, then we should be fine. > > > > The crypto code shouldn't care, but I think it will probably break hibernate > :-( How so? Hibernate works (or at least should work) on x86 PAE, where __va doesn't work on module data, and, on x86, the direct map has some RO parts with where the module is, so hibernate can't be writing to the memory through the direct map with its final permissions.
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen wrote: > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote: > > > If we are going to unmap the linear alias, why not do it at vmalloc() > > > time rather than vfree() time? > > > > That’s not totally nuts. Do we ever have code that expects __va() to > > work on module data? Perhaps crypto code trying to encrypt static > > data because our APIs don’t understand virtual addresses. I guess if > > highmem is ever used for modules, then we should be fine. > > > > RO instead of not present might be safer. But I do like the idea of > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and > > making it do all of this. > > Yeah, doing it for everything automatically seemed like it was/is > going to be a lot of work to debug all the corner cases where things > expect memory to be mapped but don't explicitly say it. And in > particular, the XPFO series only does it for user memory, whereas an > additional flag like this would work for extra paranoid allocations > of kernel memory too. > I just read the code, and I looks like vmalloc() is already using highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for example, we already don't have modules in the direct map. So I say we go for it. This should be quite simple to implement -- the pageattr code already has almost all the needed logic on x86. The only arch support we should need is a pair of functions to remove a vmalloc address range from the address map (if it was present in the first place) and a function to put it back. On x86, this should only be a few lines of code. What do you all think? This should solve most of the problems we have. If we really wanted to optimize this, we'd make it so that module_alloc() allocates memory the normal way, then, later on, we call some function that, all at once, removes the memory from the direct map and applies the right permissions to the vmalloc alias (or just makes the vmalloc alias not-present so we can add permissions later without flushing), and flushes the TLB. And we arrange for vunmap to zap the vmalloc range, then put the memory back into the direct map, then free the pages back to the page allocator, with the flush in the appropriate place. I don't see why the page allocator needs to know about any of this. It's already okay with the permissions being changed out from under it on x86, and it seems fine. Rick, do you want to give some variant of this a try?
[PATCH] vhost/vsock: fix reset orphans race with close timeout
If a local process has closed a connected socket and hasn't received a RST packet yet, then the socket remains in the table until a timeout expires. When a vhost_vsock instance is released with the timeout still pending, the socket is never freed because vhost_vsock has already set the SOCK_DONE flag. Check if the close timer is pending and let it close the socket. This prevents the race which can leak sockets. Reported-by: Maximilian Riemensberger Cc: Graham Whaley Signed-off-by: Stefan Hajnoczi --- drivers/vhost/vsock.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 34bc3ab40c6d..731e2ea2aeca 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -563,13 +563,21 @@ static void vhost_vsock_reset_orphans(struct sock *sk) * executing. */ - if (!vhost_vsock_get(vsk->remote_addr.svm_cid)) { - sock_set_flag(sk, SOCK_DONE); - vsk->peer_shutdown = SHUTDOWN_MASK; - sk->sk_state = SS_UNCONNECTED; - sk->sk_err = ECONNRESET; - sk->sk_error_report(sk); - } + /* If the peer is still valid, no need to reset connection */ + if (vhost_vsock_get(vsk->remote_addr.svm_cid)) + return; + + /* If the close timeout is pending, let it expire. This avoids races +* with the timeout callback. +*/ + if (vsk->close_work_scheduled) + return; + + sock_set_flag(sk, SOCK_DONE); + vsk->peer_shutdown = SHUTDOWN_MASK; + sk->sk_state = SS_UNCONNECTED; + sk->sk_err = ECONNRESET; + sk->sk_error_report(sk); } static int vhost_vsock_dev_release(struct inode *inode, struct file *file) -- 2.19.2
Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski wrote: > > > On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel > > wrote: > > > >> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski wrote: > >> > >>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon wrote: > >>> > On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote: > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P > wrote: > > > >> On Tue, 2018-12-04 at 16:03 +, Will Deacon wrote: > >> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote: > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe > > wrote: > > Since vfree will lazily flush the TLB, but not lazily free the > underlying > pages, > it often leaves stale TLB entries to freed pages that could get > re-used. > This is > undesirable for cases where the memory being freed has special > permissions > such > as executable. > >>> > >>> So I am trying to finish my patch-set for preventing transient W+X > >>> mappings > >>> from taking space, by handling kprobes & ftrace that I missed (thanks > >>> again > >>> for > >>> pointing it out). > >>> > >>> But all of the sudden, I don’t understand why we have the problem > >>> that this > >>> (your) patch-set deals with at all. We already change the mappings to > >>> make > >>> the memory wrAcked-by: Ard Biesheuvel > > itable before freeing the memory, so why can’t we make it > >>> non-executable at the same time? Actually, why do we make the module > >>> memory, > >>> including its data executable before freeing it??? > >> > >> Yeah, this is really confusing, but I have a suspicion it's a > >> combination > >> of the various different configurations and hysterical raisins. We > >> can't > >> rely on module_alloc() allocating from the vmalloc area (see nios2) nor > >> can we rely on disable_ro_nx() being available at build time. > >> > >> If we *could* rely on module allocations always using vmalloc(), then > >> we could pass in Rick's new flag and drop disable_ro_nx() altogether > >> afaict -- who cares about the memory attributes of a mapping that's > >> about > >> to disappear anyway? > >> > >> Is it just nios2 that does something different? > >> > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere > > would > > solve it as well, in fact that was what I first thought the solution > > should be > > until this was suggested. It's interesting that from the other thread > > Masami > > Hiramatsu referenced, set_memory_nx was suggested last year and would > > have > > inadvertently blocked this on x86. But, on the other architectures I > > have since > > learned it is a bit different. > > > > It looks like actually most arch's don't re-define set_memory_*, and so > > all of > > the frob_* functions are actually just noops. In which case allocating > > RWX is > > needed to make it work at all, because that is what the allocation is > > going to > > stay at. So in these archs, set_memory_nx won't solve it because it > > will do > > nothing. > > > > On x86 I think you cannot get rid of disable_ro_nx fully because there > > is the > > changing of the permissions on the directmap as well. You don't want > > some other > > caller getting a page that was left RO when freed and then trying to > > write to > > it, if I understand this. > > > > Exactly. > >>> > >>> Of course, I forgot about the linear mapping. On arm64, we've just queued > >>> support for reflecting changes to read-only permissions in the linear map > >>> [1]. So, whilst the linear map is always non-executable, we will need to > >>> make parts of it writable again when freeing the module. > >>> > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want, > but it would also call some arch hooks to put back the direct map > permissions before the flush. Does that seem reasonable? It would > need to be hooked up that implement set_memory_ro(), but that should > be quite easy. If nothing else, it could fall back to set_memory_ro() > in the absence of a better implementation. > >>> > >>> You mean set_memory_rw() here, right? Although, eliding the TLB > >>> invalidation > >>> would open up a window where the vmap mapping is executable and the linear > >>> mapping is writable, which is a bit rubbish. > >>> > >> > >> Right, and Rick pointed out the same issue. Instead, we should set > >> the direct map not-present or its ARM equivalent, then do the flush, > >> then make it RW. I assume this also works on arm and arm64, although > >> I don't