[PATCH net] ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output

2018-12-06 Thread Shmulik Ladkani
In 'seg6_output', stack variable 'struct flowi6 fl6' was missing
initialization.

Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection 
with lwtunnels")
Signed-off-by: Shmulik Ladkani 

---
Suggeting this fix, spotted during code review while experimenting
with SRv6, although havn't encountered a specific issue during
experiments.

Was there any genuine intention to actually keep 'fl6' uninitialized?
---

 net/ipv6/seg6_iptunnel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index a8854dd3e9c5..8181ee7e1e27 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -347,6 +347,7 @@ static int seg6_output(struct net *net, struct sock *sk, 
struct sk_buff *skb)
struct ipv6hdr *hdr = ipv6_hdr(skb);
struct flowi6 fl6;
 
+   memset(&fl6, 0, sizeof(fl6));
fl6.daddr = hdr->daddr;
fl6.saddr = hdr->saddr;
fl6.flowlabel = ip6_flowinfo(hdr);
-- 
2.19.1


Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-12-06 Thread Greg Ungerer

Hi John,

On 4/12/18 12:02 am, John Crispin wrote:

On 03/12/2018 15:00, René van Dorst wrote:

Quoting Bjørn Mork :

Greg Ungerer  writes:


The following change helped alot, but I still get some problems under
sustained load and some types of port setups. Still trying to figure
out what exactly is going on.

--- a/linux/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/linux/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1750,8 +1750,8 @@ static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth)
   if (likely(napi_schedule_prep(ð->rx_napi))) {
    __napi_schedule(ð->rx_napi);
-   mtk_rx_irq_disable(eth, MTK_RX_DONE_INT);
    }
+   mtk_rx_irq_disable(eth, MTK_RX_DONE_INT);
   return IRQ_HANDLED;
 }
@@ -1762,11 +1762,53 @@ static irqreturn_t mtk_handle_irq_tx(int irq, void 
*_eth)
   if (likely(napi_schedule_prep(ð->tx_napi))) {
    __napi_schedule(ð->tx_napi);
-   mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
    }
+   mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
   return IRQ_HANDLED;
 }


Yes, sorry I didn't point to that as well.  Just to be clear:  I have no
clue how this thing is actually wired up, or if you could use three
interrupts on the MT7621 too. I just messed with it until I got
something to work, based on Renés original idea and code.


My idea is a just a copy of mtk_handle_irq_{rx,tx} see [1]
You probably want to look at the staging driver or Ubiquity source with a 
3.10.x kernel [2] or padavan with 3.4.x kernel [3].
AFAIK mt7621 only has 1 IRQ for ethernet part.


correct there is only 1 single IRQ on mt7621


One of the main differences I see between the mainline mtk_eth_soc.c
and the older mediatek/openwrt driver is that the older driver uses
the PDMA module for TX transmission, while the mainline uses the
QDMA module. I have no documentation on the what the differences
are between the 2 (or why there is even 2 DMA engines in there?).
Can you shed any light on that?

I did a quick and dirty recode of the QDMA transmission parts of
the mainline driver code to use the PDMA engine instead. The most
immediate result is that it suffers the same IP header checksum
problem on TX packets :-(  But it also still suffers from the
same occasional TX watchdog timeout I see with only the mainline
driver and basic support of MT7621.

What I see with the TX watchdog timeouts is that there is valid
TX descriptors, but the frame engine is just not processing them.
It seems to be just sitting there idle. The CTX and DTX registers
look valid and consistent with the local last_free/next_free
pointers.

Regards
Greg



Re: [PATCH AUTOSEL 4.19 040/123] bpf: allocate local storage buffers using GFP_ATOMIC

2018-12-06 Thread Naresh Kamboju
On Fri, 7 Dec 2018 at 12:10, Naresh Kamboju  wrote:
>
> On Wed, 5 Dec 2018 at 15:08, Sasha Levin  wrote:
> >
> > From: Roman Gushchin 
> >
> > [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ]
> >
> > Naresh reported an issue with the non-atomic memory allocation of
> > cgroup local storage buffers:
> >
> > [   73.047526] BUG: sleeping function called from invalid context at
> > /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
> > [   73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: 
> > test_cgroup_sto
> > [   73.068342] INFO: lockdep is turned off.
> > [   73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
> > 4.20.0-rc2-next-20181113 #1
> > [   73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > 2.0b 07/27/2017
> > [   73.088018] Call Trace:
> > [   73.090463]  dump_stack+0x70/0xa5
> > [   73.093783]  ___might_sleep+0x152/0x240
> > [   73.097619]  __might_sleep+0x4a/0x80
> > [   73.101191]  __kmalloc_node+0x1cf/0x2f0
> > [   73.105031]  ? cgroup_storage_update_elem+0x46/0x90
> > [   73.109909]  cgroup_storage_update_elem+0x46/0x90
> >
> > cgroup_storage_update_elem() (as well as other update map update
> > callbacks) is called with disabled preemption, so GFP_ATOMIC
> > allocation should be used: e.g. alloc_htab_elem() in hashtab.c.
> >
> > Reported-by: Naresh Kamboju 
> > Tested-by: Naresh Kamboju 
> > Signed-off-by: Roman Gushchin 
> > Cc: Alexei Starovoitov 
> > Cc: Daniel Borkmann 
> > Signed-off-by: Alexei Starovoitov 
> > Signed-off-by: Sasha Levin 
>
> I have reported above issue on 4.20.0-rc2-next-20181113.
> Now this BUG re-occurring on 4.19.8-rc1 on x86_64 and arm64 devices.

This BUG: was seen on 4.19.1-rc1 also on x86_64 and arm64 devices.

- Naresh


Re: [PATCH AUTOSEL 4.19 040/123] bpf: allocate local storage buffers using GFP_ATOMIC

2018-12-06 Thread Naresh Kamboju
On Wed, 5 Dec 2018 at 15:08, Sasha Levin  wrote:
>
> From: Roman Gushchin 
>
> [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ]
>
> Naresh reported an issue with the non-atomic memory allocation of
> cgroup local storage buffers:
>
> [   73.047526] BUG: sleeping function called from invalid context at
> /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
> [   73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: 
> test_cgroup_sto
> [   73.068342] INFO: lockdep is turned off.
> [   73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
> 4.20.0-rc2-next-20181113 #1
> [   73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.0b 07/27/2017
> [   73.088018] Call Trace:
> [   73.090463]  dump_stack+0x70/0xa5
> [   73.093783]  ___might_sleep+0x152/0x240
> [   73.097619]  __might_sleep+0x4a/0x80
> [   73.101191]  __kmalloc_node+0x1cf/0x2f0
> [   73.105031]  ? cgroup_storage_update_elem+0x46/0x90
> [   73.109909]  cgroup_storage_update_elem+0x46/0x90
>
> cgroup_storage_update_elem() (as well as other update map update
> callbacks) is called with disabled preemption, so GFP_ATOMIC
> allocation should be used: e.g. alloc_htab_elem() in hashtab.c.
>
> Reported-by: Naresh Kamboju 
> Tested-by: Naresh Kamboju 
> Signed-off-by: Roman Gushchin 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Signed-off-by: Alexei Starovoitov 
> Signed-off-by: Sasha Levin 

I have reported above issue on 4.20.0-rc2-next-20181113.
Now this BUG re-occurring on 4.19.8-rc1 on x86_64 and arm64 devices.

[   70.288592] BUG: sleeping function called from invalid context at
/srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
[   70.301992] in_atomic(): 1, irqs_disabled(): 0, pid: 3001, name:
test_cgroup_sto
[   70.309424] INFO: lockdep is turned off.
[   70.313416] CPU: 0 PID: 3001 Comm: test_cgroup_sto Not tainted 4.19.8-rc1 #1
[   70.320483] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[   70.327953] Call Trace:
[   70.330402]  dump_stack+0x70/0xa5
[   70.333765]  ___might_sleep+0x152/0x240
[   70.337599]  __might_sleep+0x4a/0x80
[   70.341169]  __kmalloc_node+0x1d1/0x300
[   70.345003]  ? cgroup_storage_update_elem+0x46/0x90
[   70.349881]  cgroup_storage_update_elem+0x46/0x90
[   70.354585]  map_update_elem+0x1fd/0x450
[   70.358504]  __x64_sys_bpf+0x129/0x270
[   70.362258]  do_syscall_64+0x55/0x190
[   70.365923]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   70.370974] RIP: 0033:0x7f42e0ebb969
[   70.374544] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d df e4 2b 00 f7 d8 64 89
01 48
[   70.393281] RSP: 002b:7ffde61a0a08 EFLAGS: 0202 ORIG_RAX:
0141
[   70.400845] RAX: ffda RBX: 0003 RCX: 7f42e0ebb969
[   70.407971] RDX: 0048 RSI: 7ffde61a0a50 RDI: 0002
[   70.415094] RBP: 7ffde61a0a20 R08: 7ffde61a0a50 R09: 7ffde61a0a50
[   70.422216] R10: 7ffde61a0a50 R11: 0202 R12: 0005
[   70.429342] R13: 7ffde61a0c10 R14:  R15: 
selftests: bpf: test_cgroup_storage

Full test log links,
arm64 Juno
https://lkft.validation.linaro.org/scheduler/job/537820#L2971

x86_64 Supermicro SYS-5019S-ML/X11SSH-F
https://lkft.validation.linaro.org/scheduler/job/537772#L2724

Best regards
Naresh Kamboju

> ---
>  kernel/bpf/local_storage.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index 830d7f095748..fc1605aee5ea 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -138,7 +138,8 @@ static int cgroup_storage_update_elem(struct bpf_map 
> *map, void *_key,
> return -ENOENT;
>
> new = kmalloc_node(sizeof(struct bpf_storage_buffer) +
> -  map->value_size, __GFP_ZERO | GFP_USER,
> +  map->value_size,
> +  __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN,
>map->numa_node);
> if (!new)
> return -ENOMEM;
> --
> 2.17.1
>


[PATCHv2 net 3/3] sctp: fa_resize sctp stream instead of redo fa_alloc

2018-12-06 Thread Xin Long
Now when doing 4-shakehand or adding new streams, sctp has to allocate
new memory for asoc->stream and copy the old stream's information from
the old asoc->stream to the new one. It also cause the stream pointers
to change, by which a panic was even caused due to stream->out_curr's
change.

To fix this, flex_array_resize() is used in sctp_stream_alloc_out/in()
when asoc->stream has been allocated. Besides, with this asoc->stream
will only be allocated once, and grow or shrink dynamically later.

Note that flex_array_prealloc() is needed before growing as fa_alloc
does, while flex_array_clear() and flex_array_shrink() are called to
free the unused memory before shrinking.

Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
Reported-by: Ying Xu 
Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
Suggested-by: Neil Horman 
Signed-off-by: Xin Long 
Acked-by: Neil Horman 
---
 net/sctp/stream.c | 87 +--
 1 file changed, 40 insertions(+), 47 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 3892e76..aff30b2 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -37,6 +37,17 @@
 #include 
 #include 
 
+static void fa_zero(struct flex_array *fa, size_t index, size_t count)
+{
+   void *elem;
+
+   while (count--) {
+   elem = flex_array_get(fa, index);
+   memset(elem, 0, fa->element_size);
+   index++;
+   }
+}
+
 static struct flex_array *fa_alloc(size_t elem_size, size_t elem_count,
   gfp_t gfp)
 {
@@ -48,8 +59,9 @@ static struct flex_array *fa_alloc(size_t elem_size, size_t 
elem_count,
err = flex_array_prealloc(result, 0, elem_count, gfp);
if (err) {
flex_array_free(result);
-   result = NULL;
+   return NULL;
}
+   fa_zero(result, 0, elem_count);
}
 
return result;
@@ -61,27 +73,28 @@ static void fa_free(struct flex_array *fa)
flex_array_free(fa);
 }
 
-static void fa_copy(struct flex_array *fa, struct flex_array *from,
-   size_t index, size_t count)
+static int fa_resize(struct flex_array *fa, size_t count, gfp_t gfp)
 {
-   void *elem;
+   int nr = fa->total_nr_elements, n;
 
-   while (count--) {
-   elem = flex_array_get(from, index);
-   flex_array_put(fa, index, elem, 0);
-   index++;
+   if (count > nr) {
+   if (flex_array_resize(fa, count, gfp))
+   return -ENOMEM;
+   if (flex_array_prealloc(fa, nr, count - nr, gfp))
+   return -ENOMEM;
+   fa_zero(fa, nr, count - nr);
+
+   return 0;
}
-}
 
-static void fa_zero(struct flex_array *fa, size_t index, size_t count)
-{
-   void *elem;
+   /* Shrink the unused memory,
+* FLEX_ARRAY_FREE check is safe for sctp stream.
+*/
+   for (n = count; n < nr; n++)
+   flex_array_clear(fa, n);
+   flex_array_shrink(fa);
 
-   while (count--) {
-   elem = flex_array_get(fa, index);
-   memset(elem, 0, fa->element_size);
-   index++;
-   }
+   return flex_array_resize(fa, count, gfp);
 }
 
 /* Migrates chunks from stream queues to new stream queues if needed,
@@ -138,47 +151,27 @@ static void sctp_stream_outq_migrate(struct sctp_stream 
*stream,
 static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 gfp_t gfp)
 {
-   struct flex_array *out;
-   size_t elem_size = sizeof(struct sctp_stream_out);
-
-   out = fa_alloc(elem_size, outcnt, gfp);
-   if (!out)
-   return -ENOMEM;
+   if (!stream->out) {
+   stream->out = fa_alloc(sizeof(struct sctp_stream_out),
+  outcnt, gfp);
 
-   if (stream->out) {
-   fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
-   fa_free(stream->out);
+   return stream->out ? 0 : -ENOMEM;
}
 
-   if (outcnt > stream->outcnt)
-   fa_zero(out, stream->outcnt, (outcnt - stream->outcnt));
-
-   stream->out = out;
-
-   return 0;
+   return fa_resize(stream->out, outcnt, gfp);
 }
 
 static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
gfp_t gfp)
 {
-   struct flex_array *in;
-   size_t elem_size = sizeof(struct sctp_stream_in);
+   if (!stream->in) {
+   stream->in = fa_alloc(sizeof(struct sctp_stream_in),
+ incnt, gfp);
 
-   in = fa_alloc(elem_size, incnt, gfp);
-   if (!in)
-   return -ENOMEM;
-
-   if (stream->in) {
-   fa_copy(in, stream->in, 0, min(incnt, stream->incnt));
-   fa_

[PATCHv2 net 1/3] flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE

2018-12-06 Thread Xin Long
This patch is to separate the base data memory from struct flex_array and
save it into a page. With this change, total_nr_elements of a flex_array
can grow or shrink without having the old element's memory changed when
the new size of the flex_arry crosses FLEX_ARRAY_BASE_SIZE, which will
be added in the next patch.

Suggested-by: Neil Horman 
Signed-off-by: Xin Long 
Acked-by: Neil Horman 
---
 include/linux/flex_array.h | 29 +
 lib/flex_array.c   | 15 ---
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
index b94fa61..29ad65f 100644
--- a/include/linux/flex_array.h
+++ b/include/linux/flex_array.h
@@ -7,9 +7,10 @@
 #include 
 
 #define FLEX_ARRAY_PART_SIZE PAGE_SIZE
-#define FLEX_ARRAY_BASE_SIZE PAGE_SIZE
+#define FLEX_ARRAY_BASE_SIZE FLEX_ARRAY_PART_SIZE
 
 struct flex_array_part;
+struct flex_array_part_p;
 
 /*
  * This is meant to replace cases where an array-like
@@ -19,29 +20,17 @@ struct flex_array_part;
  */
 
 struct flex_array {
-   union {
-   struct {
-   int element_size;
-   int total_nr_elements;
-   int elems_per_part;
-   struct reciprocal_value reciprocal_elems;
-   struct flex_array_part *parts[];
-   };
-   /*
-* This little trick makes sure that
-* sizeof(flex_array) == PAGE_SIZE
-*/
-   char padding[FLEX_ARRAY_BASE_SIZE];
-   };
+   int element_size;
+   int total_nr_elements;
+   int elems_per_part;
+   struct reciprocal_value reciprocal_elems;
+   struct flex_array_part_p *part_p;
+#define parts part_p->p_part
 };
 
-/* Number of bytes left in base struct flex_array, excluding metadata */
-#define FLEX_ARRAY_BASE_BYTES_LEFT \
-   (FLEX_ARRAY_BASE_SIZE - offsetof(struct flex_array, parts))
-
 /* Number of pointers in base to struct flex_array_part pages */
 #define FLEX_ARRAY_NR_BASE_PTRS
\
-   (FLEX_ARRAY_BASE_BYTES_LEFT / sizeof(struct flex_array_part *))
+   (FLEX_ARRAY_BASE_SIZE / sizeof(struct flex_array_part *))
 
 /* Number of elements of size that fit in struct flex_array_part */
 #define FLEX_ARRAY_ELEMENTS_PER_PART(size) \
diff --git a/lib/flex_array.c b/lib/flex_array.c
index 2eed22f..8c0b9b6 100644
--- a/lib/flex_array.c
+++ b/lib/flex_array.c
@@ -30,6 +30,10 @@ struct flex_array_part {
char elements[FLEX_ARRAY_PART_SIZE];
 };
 
+struct flex_array_part_p {
+   struct flex_array_part *p_part[FLEX_ARRAY_NR_BASE_PTRS];
+};
+
 /*
  * If a user requests an allocation which is small
  * enough, we may simply use the space in the
@@ -39,7 +43,7 @@ struct flex_array_part {
 static inline int elements_fit_in_base(struct flex_array *fa)
 {
int data_size = fa->element_size * fa->total_nr_elements;
-   if (data_size <= FLEX_ARRAY_BASE_BYTES_LEFT)
+   if (data_size <= FLEX_ARRAY_BASE_SIZE)
return 1;
return 0;
 }
@@ -105,13 +109,17 @@ struct flex_array *flex_array_alloc(int element_size, 
unsigned int total,
ret = kzalloc(sizeof(struct flex_array), flags);
if (!ret)
return NULL;
+   ret->part_p = kzalloc(sizeof(struct flex_array_part_p), flags);
+   if (!ret->part_p) {
+   kfree(ret);
+   return NULL;
+   }
ret->element_size = element_size;
ret->total_nr_elements = total;
ret->elems_per_part = elems_per_part;
ret->reciprocal_elems = reciprocal_elems;
if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO))
-   memset(&ret->parts[0], FLEX_ARRAY_FREE,
-   FLEX_ARRAY_BASE_BYTES_LEFT);
+   memset(&ret->parts[0], FLEX_ARRAY_FREE, FLEX_ARRAY_BASE_SIZE);
return ret;
 }
 EXPORT_SYMBOL(flex_array_alloc);
@@ -148,6 +156,7 @@ EXPORT_SYMBOL(flex_array_free_parts);
 void flex_array_free(struct flex_array *fa)
 {
flex_array_free_parts(fa);
+   kfree(fa->part_p);
kfree(fa);
 }
 EXPORT_SYMBOL(flex_array_free);
-- 
2.1.0



[PATCHv2 net 0/3] net: add support for flex_array_resize in flex_array

2018-12-06 Thread Xin Long
Without the support for the total_nr_elements's growing or shrinking
dynamically, flex_array is not that 'flexible'. Like when users want
to change the size, they have to redo flex_array_alloc and copy all
the elements from the old to the new one.  The worse thing is every
element's memory gets changed.

To implement flex_array_resize based on current code, the difficult
thing is to process the size border of FLEX_ARRAY_BASE_BYTES_LEFT,
where the base data memory may change to an array for the 2nd level
data memory for growing, likewise for shrinking.

To make this part easier, we separate the base data memory and define
FLEX_ARRAY_BASE_SIZE as a same value of FLEX_ARRAY_PART_SIZE, as Neil
suggested.  When new size is crossing the border, the base memory is
allocated as the array for the 2nd level data memory and its part[0]
is pointed to the old base memory, and do the opposite for shrinking.

But it doesn't do any memory allocation or shrinking for elements in
flex_array_resize, as which should be done by flex_array_prealloc or
flex_array_shrink called by users.  No memory leaks can be caused by
that.

SCTP has benefited a lot from flex_array_resize() for managing its
stream memory so far.

v1->v2:
  Cc LKML and more developers.

Xin Long (3):
  flex_array: make FLEX_ARRAY_BASE_SIZE the same value of
FLEX_ARRAY_PART_SIZE
  flex_array: support flex_array_resize
  sctp: fa_resize sctp stream instead of redo fa_alloc

 include/linux/flex_array.h | 40 ++---
 lib/flex_array.c   | 73 --
 net/sctp/stream.c  | 87 +-
 3 files changed, 130 insertions(+), 70 deletions(-)

-- 
2.1.0



[PATCHv2 net 2/3] flex_array: support flex_array_resize

2018-12-06 Thread Xin Long
This function can dynamically change total_nr_elements of a flex_array,
and keep the old elements of the same memory. Returns 0 if it succeeds.

Note that it won't do any memory allocation or shrinking for elements,
which should be only done by flex_array_prealloc and flex_array_shrink.

Suggested-by: Neil Horman 
Signed-off-by: Xin Long 
Acked-by: Neil Horman 
---
 include/linux/flex_array.h | 11 +
 lib/flex_array.c   | 58 ++
 2 files changed, 69 insertions(+)

diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
index 29ad65f..19ff58d 100644
--- a/include/linux/flex_array.h
+++ b/include/linux/flex_array.h
@@ -130,6 +130,17 @@ void *flex_array_get(struct flex_array *fa, unsigned int 
element_nr);
  */
 int flex_array_shrink(struct flex_array *fa);
 
+/**
+ * flex_array_resize() - Resize without the old elements memory changed
+ * @fa:array to resize
+ * @total: total number of elements that this would change to
+ * @flags: page allocation flags to use for base array
+ *
+ * Return: Returns 0 if it succeeds.
+ *
+ */
+int flex_array_resize(struct flex_array *fa, unsigned int total, gfp_t flags);
+
 #define flex_array_put_ptr(fa, nr, src, gfp) \
flex_array_put(fa, nr, (void *)&(src), gfp)
 
diff --git a/lib/flex_array.c b/lib/flex_array.c
index 8c0b9b6..2f913e7 100644
--- a/lib/flex_array.c
+++ b/lib/flex_array.c
@@ -405,3 +405,61 @@ int flex_array_shrink(struct flex_array *fa)
return ret;
 }
 EXPORT_SYMBOL(flex_array_shrink);
+
+/**
+ * flex_array_resize - resize without the old elements memory changed
+ * @fa:the flex array to resize
+ * @total: total number of elements that this would change to
+ * @flags: page allocation flags to use for base array
+ *
+ * This function can dynamically change total_nr_elements of a flex_array,
+ * and keep the old elements of the same memory. Returns 0 if it succeeds.
+ * Note that it won't do any memory allocation or shrinking for elements,
+ * which should be only done by flex_array_prealloc and flex_array_shrink.
+ *
+ * Locking must be provided by the caller.
+ */
+int flex_array_resize(struct flex_array *fa, unsigned int total, gfp_t flags)
+{
+   int nr;
+
+   if (total > FLEX_ARRAY_NR_BASE_PTRS * fa->elems_per_part)
+   return -EINVAL;
+
+   if (elements_fit_in_base(fa)) {
+   struct flex_array_part_p *part_p;
+
+   nr = fa->total_nr_elements;
+   fa->total_nr_elements = total;
+   if (elements_fit_in_base(fa))
+   return 0;
+
+   part_p = kzalloc(sizeof(*part_p), flags);
+   if (!part_p) {
+   fa->total_nr_elements = nr;
+   return -ENOMEM;
+   }
+
+   part_p->p_part[0] = (struct flex_array_part *)&fa->parts[0];
+   fa->part_p = part_p;
+   } else {
+   struct flex_array_part *part;
+
+   fa->total_nr_elements = total;
+   if (!elements_fit_in_base(fa))
+   return 0;
+
+   for (nr = 1; nr < FLEX_ARRAY_NR_BASE_PTRS; nr++) {
+   part = fa->parts[nr];
+   if (part) {
+   fa->parts[nr] = NULL;
+   kfree(part);
+   }
+   }
+
+   fa->part_p = (struct flex_array_part_p *)fa->parts[0];
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(flex_array_resize);
-- 
2.1.0



Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

2018-12-06 Thread David Miller
From: David Miller 
Date: Thu, 06 Dec 2018 22:24:09 -0800 (PST)

> Series applied, thanks!

Erm... actually reverted.  Please fix these build failures:

ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_receive':
ip6_offload.c:(.text+0xda2): undefined reference to `udp6_gro_receive'
ld: ip6_offload.c:(.text+0xdb6): undefined reference to `udp6_gro_receive'
ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_complete':
ip6_offload.c:(.text+0x1953): undefined reference to `udp6_gro_complete'
ld: ip6_offload.c:(.text+0x1966): undefined reference to `udp6_gro_complete'
make: *** [Makefile:1036: vmlinux] Error 1

THanks.


Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

2018-12-06 Thread David Miller
From: Paolo Abeni 
Date: Wed,  5 Dec 2018 19:13:38 +0100

> The spectre v2 counter-measures, aka retpolines, are a source of measurable
> overhead[1]. We can partially address that when the function pointer refers to
> a builtin symbol resorting to a list of tests vs well-known builtin function 
> and
> direct calls.
> 
> Experimental results show that replacing a single indirect call via
> retpoline with several branches and a direct call gives performance gains
> even when multiple branches are added - 5 or more, as reported in [2].
> 
> This may lead to some uglification around the indirect calls. In netconf 2018
> Eric Dumazet described a technique to hide the most relevant part of the 
> needed
> boilerplate with some macro help.
> 
> This series is a [re-]implementation of such idea, exposing the introduced
> helpers in a new header file. They are later leveraged to avoid the indirect
> call overhead in the GRO path, when possible.
> 
> Overall this gives > 10% performance improvement for UDP GRO benchmark and
> smaller but measurable for TCP syn flood.
> 
> The added infra can be used in follow-up patches to cope with retpoline 
> overhead
> in other points of the networking stack (e.g. at the qdisc layer) and possibly
> even in other subsystems.
 ...

Series applied, thanks!


Re: [PATCH rdma-next 0/3] Packet based credit mode

2018-12-06 Thread Leon Romanovsky
On Thu, Dec 06, 2018 at 08:27:06PM -0700, Jason Gunthorpe wrote:
> On Fri, Nov 30, 2018 at 01:22:03PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > >From Danit,
> >
> > Packet based credit mode is an alternative end-to-end credit mode for QPs
> > set during their creation. Credits are transported from the responder
> > to the requester to optimize the use of its receive resources.
> > In packet-based credit mode, credits are issued on a per packet basis.
> >
> > The advantage of this feature comes while sending large RDMA messages
> > through switches that are short in memory.
> >
> > The first commit exposes QP creation flag and the HCA capability. The
> > second commit adds support for a new DV QP creation flag. The last
> > commit report packet based credit mode capability via the MLX5DV device
> > capabilities.
> >
> > Thanks
> >
> > Danit Goldberg (3):
> >   net/mlx5: Expose packet based credit mode
> >   IB/mlx5: Add packet based credit mode support
> >   IB/mlx5: Report packet based credit mode device capability
>
> This looks fine to me, can you update the shared branch please

Done, thanks
3fd3c80acc17 net/mlx5: Expose packet based credit mode

>
> Thanks,
> Jason


signature.asc
Description: PGP signature


[PATCH v3] ptp: fix an IS_ERR() vs NULL check

2018-12-06 Thread Dan Carpenter
We recently modified pps_register_source() to return error pointers
instead of NULL but it seems like there was a merge issue and part of
the commit was lost.  Anyway, the ptp_clock_register() function needs to
be updated to check for IS_ERR() as well.

Fixes: 3b1ad360acad ("pps: using ERR_PTR instead of NULL while 
pps_register_source fails")
Signed-off-by: Dan Carpenter 
---
v2:  Use the correct Fixes tag.
v3:  Add Greg to the CC list for real this time.

 drivers/ptp/ptp_clock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
index 8a81eecc0ecd..48f3594a7458 100644
--- a/drivers/ptp/ptp_clock.c
+++ b/drivers/ptp/ptp_clock.c
@@ -265,8 +265,8 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info 
*info,
pps.mode = PTP_PPS_MODE;
pps.owner = info->owner;
ptp->pps_source = pps_register_source(&pps, PTP_PPS_DEFAULTS);
-   if (!ptp->pps_source) {
-   err = -EINVAL;
+   if (IS_ERR(ptp->pps_source)) {
+   err = PTR_ERR(ptp->pps_source);
pr_err("failed to register pps source\n");
goto no_pps;
}
-- 
2.11.0


Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-06 Thread Herbert Xu
On Wed, Dec 05, 2018 at 02:51:02PM +1100, NeilBrown wrote:
> 
> If the sequence:
>obj = rhashtable_walk_next(iter);
>rhashtable_walk_stop(iter);
>rhashtable_remove_fast(ht, &obj->head, params);
>rhashtable_walk_start(iter);
> 
>  races with another thread inserting or removing
>  an object on the same hash chain, a subsequent
>  rhashtable_walk_next() is not guaranteed to get the "next"
>  object. It is possible that an object could be
>  repeated, or missed.
> 
>  This can be made more reliable by keeping the objects in a hash chain
>  sorted by memory address.  A subsequent rhashtable_walk_next()
>  call can reliably find the correct position in the list, and thus
>  find the 'next' object.
> 
>  It is not possible to take this approach with an rhltable as keeping
>  the hash chain in order is not so easy.  When the first object with a
>  given key is removed, it is replaced in the chain with the next
>  object with the same key, and the address of that object may not be
>  correctly ordered.
>  I have not yet found any way to achieve the same stability
>  with rhltables, that doesn't have a major impact on lookup
>  or insert.  No code currently in Linux would benefit from
>  such extra stability.
> 
>  With this patch:
>  - a new object is always inserted after the last object with a
>smaller address, or at the start.
>  - when rhashtable_walk_start() is called, it records that 'p' is not
>'safe', meaning that it cannot be dereferenced.  The revalidation
>that was previously done here is moved to rhashtable_walk_next()
>  - when rhashtable_walk_next() is called while p is not NULL and not
>safe, it walks the chain looking for the first object with an
>address greater than p and returns that.  If there is none, it moves
>to the next hash chain.
> 
> Signed-off-by: NeilBrown 
> ---
> 
> This is a resend of a patch that I sent back in July.  I couldn't
> applied then because it assumed another rhashtable patch which hadn't
> landed yet - it now has.

I thought we had agreed to drop this because nobody needs it
currently and it doesn't handle rhlist?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


linux-next: manual merge of the rcu tree with the net-next tree

2018-12-06 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the rcu tree got conflicts in:

  net/bridge/br_mdb.c
  net/bridge/br_multicast.c

between commits:

  19e3a9c90c53 ("net: bridge: convert multicast to generic rhashtable")
  4329596cb10d ("net: bridge: multicast: use non-bh rcu flavor")

from the net-next tree and commit:

  1a56f7d53b5c ("net/bridge: Replace call_rcu_bh() and rcu_barrier_bh()")

from the rcu tree.

I fixed it up (I used the versions form net-next) and can carry the fix
as necessary. This is now fixed as far as linux-next is concerned, but
any non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


pgp1wJLljZ6hr.pgp
Description: OpenPGP digital signature


Re: [PATCH bpf] selftests/bpf: add missing pointer dereference for map stacktrace fixup

2018-12-06 Thread Prashant Bhole




On 12/7/2018 1:14 PM, Stanislav Fomichev wrote:

I get a segfault without it, other fixups always do dereference, and
without dereference I don't understand how it can ever work.

Fixes: 7c85c448e7d74 ("selftests/bpf: test_verifier, check
bpf_map_lookup_elem access in bpf prog")

Signed-off-by: Stanislav Fomichev 
---
  tools/testing/selftests/bpf/test_verifier.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index df6f751cc1e8..d23929a1985d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14166,7 +14166,7 @@ static void do_test_fixup(struct bpf_test *test, enum 
bpf_map_type prog_type,
do {
prog[*fixup_map_stacktrace].imm = map_fds[12];
fixup_map_stacktrace++;
-   } while (fixup_map_stacktrace);
+   } while (*fixup_map_stacktrace);
}
  }


It was my mistake. Thanks for the fix!

-Prashant



[PATCH bpf] selftests/bpf: add missing pointer dereference for map stacktrace fixup

2018-12-06 Thread Stanislav Fomichev
I get a segfault without it, other fixups always do dereference, and
without dereference I don't understand how it can ever work.

Fixes: 7c85c448e7d74 ("selftests/bpf: test_verifier, check
bpf_map_lookup_elem access in bpf prog")

Signed-off-by: Stanislav Fomichev 
---
 tools/testing/selftests/bpf/test_verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index df6f751cc1e8..d23929a1985d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14166,7 +14166,7 @@ static void do_test_fixup(struct bpf_test *test, enum 
bpf_map_type prog_type,
do {
prog[*fixup_map_stacktrace].imm = map_fds[12];
fixup_map_stacktrace++;
-   } while (fixup_map_stacktrace);
+   } while (*fixup_map_stacktrace);
}
 }
 
-- 
2.20.0.rc2.403.gdbc3b29805-goog



Re: [RFC PATCH 4/6] dt-bindings: update mvneta binding document

2018-12-06 Thread Kishon Vijay Abraham I
Hi Russell,

On 05/12/18 9:00 PM, Rob Herring wrote:
> On Wed, Dec 5, 2018 at 5:00 AM Russell King - ARM Linux
>  wrote:
>>
>> On Mon, Dec 03, 2018 at 05:54:55PM -0600, Rob Herring wrote:
>>> On Mon, Nov 12, 2018 at 12:31:02PM +, Russell King wrote:
 Signed-off-by: Russell King 
>>>
>>> Needs a better subject and a commit msg.
>>
>> Hmm, not sure why it didn't contain:
>>
>> "dt-bindings: net: mvneta: add phys property
>>
>> Add an optional phys property to the mvneta binding documentation for
>> the common phy.
>>
>> Signed-off-by: Russell King "
>>
>> as the commit message.  With the correct commit message, are you happy
>> with it?
> 
> Yes.
> 
> Reviewed-by: Rob Herring 

Are you planning to resend this series?

Thanks
Kishon


Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-06 Thread Benjamin Herrenschmidt
On Fri, 2018-12-07 at 14:20 +1100, Benjamin Herrenschmidt wrote:
>

Apologies for the empty email, not too sure what happened, I did a
resend and the second one worked.

Cheers
Ben.



[PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-06 Thread Benjamin Herrenschmidt
This reverts commit 624ca9c33c8a853a4a589836e310d776620f4ab9.

This commit is completely bogus. The STACR register has two formats, old
and new, depending on the version of the IP block used. There's a pair of
device-tree properties that can be used to specify the format used:

has-inverted-stacr-oc
has-new-stacr-staopc

What this commit did was to change the bit definition used with the old
parts to match the new parts. This of course breaks the driver on all
the old ones.

Instead, the author should have set the appropriate properties in the
device-tree for the variant used on his board.

Signed-off-by: Benjamin Herrenschmidt 
---

Found while setting up some old ppc440 boxes for test/CI

 drivers/net/ethernet/ibm/emac/emac.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/emac/emac.h 
b/drivers/net/ethernet/ibm/emac/emac.h
index e2f80cc..0d2de6f 100644
--- a/drivers/net/ethernet/ibm/emac/emac.h
+++ b/drivers/net/ethernet/ibm/emac/emac.h
@@ -231,7 +231,7 @@ struct emac_regs {
 #define EMAC_STACR_PHYE0x4000
 #define EMAC_STACR_STAC_MASK   0x3000
 #define EMAC_STACR_STAC_READ   0x1000
-#define EMAC_STACR_STAC_WRITE  0x0800
+#define EMAC_STACR_STAC_WRITE  0x2000
 #define EMAC_STACR_OPBC_MASK   0x0C00
 #define EMAC_STACR_OPBC_50 0x
 #define EMAC_STACR_OPBC_66 0x0400
-- 
2.7.4




Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-06 Thread David Miller


Looks like your posting was empty?


Re: [PATCH net-next] neighbour: Improve garbage collection

2018-12-06 Thread David Miller
From: David Ahern 
Date: Thu,  6 Dec 2018 14:38:44 -0800

> The existing garbage collection algorithm has a number of problems:

Thanks for working on this!

I totally agree with what you are doing, especially the separate
gc_list.

But why do you need the on_gc_list boolean state?  That's equivalent
to "!list_empty(&n->gc_list)" and seems redundant.


Re: [PATCH net-next 3/4] net: use indirect call wrapper at GRO transport layer

2018-12-06 Thread kbuild test robot
Hi Paolo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Paolo-Abeni/net-mitigate-retpoline-overhead/20181206-183400
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   net/ipv6/ip6_offload.o: In function `ipv6_gro_receive':
>> ip6_offload.c:(.text+0x58e): undefined reference to `transport6_gro_receive1'
   ip6_offload.c:(.text+0x595): undefined reference to `transport6_gro_receive1'
   net/ipv6/ip6_offload.o: In function `ipv6_gro_complete':
>> ip6_offload.c:(.text+0xb26): undefined reference to 
>> `transport6_gro_complete1'
   ip6_offload.c:(.text+0xb30): undefined reference to 
`transport6_gro_complete1'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH bpf] selftests/bpf: skip sockmap tests on kernels without support

2018-12-06 Thread Stanislav Fomichev
Include "autoconf.h" header in the test_maps.c and selectively
disable test_sockmap if CONFIG_BPF_STREAM_PARSER is not specified
in the kernel config.
When building out of tree/without autoconf.h, fall back to 'enabled'.

Signed-off-by: Stanislav Fomichev 
---
 tools/testing/selftests/bpf/test_maps.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 4db2116e52be..3548de8a78ac 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -32,6 +32,13 @@
 #define ENOTSUPP 524
 #endif
 
+#ifdef HAVE_GENHDR
+# include "autoconf.h"
+#else
+/* fallback to all features enabled */
+# define CONFIG_BPF_STREAM_PARSER 1
+#endif
+
 static int map_flags;
 
 #define CHECK(condition, tag, format...) ({\
@@ -588,6 +595,7 @@ static void test_stackmap(int task, void *data)
close(fd);
 }
 
+#ifdef CONFIG_BPF_STREAM_PARSER
 #include 
 #include 
 #include 
@@ -1079,6 +1087,7 @@ static void test_sockmap(int tasks, void *data)
close(fd);
exit(1);
 }
+#endif
 
 #define MAP_SIZE (32 * 1024)
 
@@ -1541,7 +1550,9 @@ static void run_all_tests(void)
test_arraymap_percpu_many_keys();
 
test_devmap(0, NULL);
+#ifdef CONFIG_BPF_STREAM_PARSER
test_sockmap(0, NULL);
+#endif
 
test_map_large();
test_map_parallel();
-- 
2.20.0.rc2.403.gdbc3b29805-goog



Re: [PATCH rdma-next 0/3] Packet based credit mode

2018-12-06 Thread Jason Gunthorpe
On Fri, Nov 30, 2018 at 01:22:03PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> >From Danit,
> 
> Packet based credit mode is an alternative end-to-end credit mode for QPs
> set during their creation. Credits are transported from the responder
> to the requester to optimize the use of its receive resources.
> In packet-based credit mode, credits are issued on a per packet basis.
> 
> The advantage of this feature comes while sending large RDMA messages
> through switches that are short in memory.
> 
> The first commit exposes QP creation flag and the HCA capability. The
> second commit adds support for a new DV QP creation flag. The last
> commit report packet based credit mode capability via the MLX5DV device
> capabilities.
> 
> Thanks
> 
> Danit Goldberg (3):
>   net/mlx5: Expose packet based credit mode
>   IB/mlx5: Add packet based credit mode support
>   IB/mlx5: Report packet based credit mode device capability

This looks fine to me, can you update the shared branch please

Thanks,
Jason


[PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-06 Thread Benjamin Herrenschmidt



Re: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver

2018-12-06 Thread Richard Cochran
On Thu, Dec 06, 2018 at 08:00:26PM +, tristram...@microchip.com wrote:
> A customer has already inquired about implementing 1588 PTP in the DSA 
> driver.  I hope
> this mechanism is approved so that I can start doing that.

If you need changes to the PTP core, you had better discuss this with
the PTP maintainer.

Thanks,
Richard


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Edgecombe, Rick P
On Thu, 2018-12-06 at 15:08 -0800, Nadav Amit wrote:
> > On Dec 6, 2018, at 12:17 PM, Andy Lutomirski  wrote:
> > 
> > On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit  wrote:
> > > > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski  wrote:
> > > > 
> > > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
> > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > > > If we are going to unmap the linear alias, why not do it at
> > > > > > > vmalloc()
> > > > > > > time rather than vfree() time?
> > > > > > 
> > > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > > work on module data?  Perhaps crypto code trying to encrypt static
> > > > > > data because our APIs don’t understand virtual addresses.  I guess
> > > > > > if
> > > > > > highmem is ever used for modules, then we should be fine.
> > > > > > 
> > > > > > RO instead of not present might be safer.  But I do like the idea of
> > > > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP
> > > > > > and
> > > > > > making it do all of this.
> > > > > 
> > > > > Yeah, doing it for everything automatically seemed like it was/is
> > > > > going to be a lot of work to debug all the corner cases where things
> > > > > expect memory to be mapped but don't explicitly say it. And in
> > > > > particular, the XPFO series only does it for user memory, whereas an
> > > > > additional flag like this would work for extra paranoid allocations
> > > > > of kernel memory too.
> > > > 
> > > > I just read the code, and I looks like vmalloc() is already using
> > > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > > > example, we already don't have modules in the direct map.
> > > > 
> > > > So I say we go for it.  This should be quite simple to implement --
> > > > the pageattr code already has almost all the needed logic on x86.  The
> > > > only arch support we should need is a pair of functions to remove a
> > > > vmalloc address range from the address map (if it was present in the
> > > > first place) and a function to put it back.  On x86, this should only
> > > > be a few lines of code.
> > > > 
> > > > What do you all think?  This should solve most of the problems we have.
> > > > 
> > > > If we really wanted to optimize this, we'd make it so that
> > > > module_alloc() allocates memory the normal way, then, later on, we
> > > > call some function that, all at once, removes the memory from the
> > > > direct map and applies the right permissions to the vmalloc alias (or
> > > > just makes the vmalloc alias not-present so we can add permissions
> > > > later without flushing), and flushes the TLB.  And we arrange for
> > > > vunmap to zap the vmalloc range, then put the memory back into the
> > > > direct map, then free the pages back to the page allocator, with the
> > > > flush in the appropriate place.
> > > > 
> > > > I don't see why the page allocator needs to know about any of this.
> > > > It's already okay with the permissions being changed out from under it
> > > > on x86, and it seems fine.  Rick, do you want to give some variant of
> > > > this a try?
> > > 
> > > Setting it as read-only may work (and already happens for the read-only
> > > module data). I am not sure about setting it as non-present.
> > > 
> > > At some point, a discussion about a threat-model, as Rick indicated, would
> > > be required. I presume ROP attacks can easily call
> > > set_all_modules_text_rw()
> > > and override all the protections.
> > 
> > I am far from an expert on exploit techniques, but here's a
> > potentially useful model: let's assume there's an attacker who can
> > write controlled data to a controlled kernel address but cannot
> > directly modify control flow.  It would be nice for such an attacker
> > to have a very difficult time of modifying kernel text or of
> > compromising control flow.  So we're assuming a feature like kernel
> > CET or that the attacker finds it very difficult to do something like
> > modifying some thread's IRET frame.
> > 
> > Admittedly, for the kernel, this is an odd threat model, since an
> > attacker can presumably quite easily learn the kernel stack address of
> > one of their tasks, do some syscall, and then modify their kernel
> > thread's stack such that it will IRET right back to a fully controlled
> > register state with RSP pointing at an attacker-supplied kernel stack.
> > So this threat model gives very strong ROP powers. unless we have
> > either CET or some software technique to harden all the RET
> > instructions in the kernel.
> > 
> > I wonder if there's a better model to use.  Maybe with stack-protector
> > we get some degree of protection?  Or is all of this is rather weak
> > until we have CET or a RAP-like feature.
> 
> I believe that seeing the end-goal would make reasoning about patches
> easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
> up.
> 
> I’m not sure CET or other CFI would be eno

linux-next: manual merge of the net-next tree with the net tree

2018-12-06 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/dsa/master.c

between commit:

  a3d7e01da060 ("net: dsa: Fix tagging attribute location")

from the net tree and commit:

  dc0fe7d47f9f ("net: dsa: Set the master device's MTU to account for DSA 
overheads")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/dsa/master.c
index 5e8c9bef78bd,42f525bc68e2..
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@@ -158,31 -158,24 +158,47 @@@ static void dsa_master_ethtool_teardown
cpu_dp->orig_ethtool_ops = NULL;
  }
  
 +static ssize_t tagging_show(struct device *d, struct device_attribute *attr,
 +  char *buf)
 +{
 +  struct net_device *dev = to_net_dev(d);
 +  struct dsa_port *cpu_dp = dev->dsa_ptr;
 +
 +  return sprintf(buf, "%s\n",
 + dsa_tag_protocol_to_str(cpu_dp->tag_ops));
 +}
 +static DEVICE_ATTR_RO(tagging);
 +
 +static struct attribute *dsa_slave_attrs[] = {
 +  &dev_attr_tagging.attr,
 +  NULL
 +};
 +
 +static const struct attribute_group dsa_group = {
 +  .name   = "dsa",
 +  .attrs  = dsa_slave_attrs,
 +};
 +
+ void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
+ {
+   unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead;
+   int err;
+ 
+   rtnl_lock();
+   if (mtu <= dev->max_mtu) {
+   err = dev_set_mtu(dev, mtu);
+   if (err)
+   netdev_dbg(dev, "Unable to set MTU to include for DSA 
overheads\n");
+   }
+   rtnl_unlock();
+ }
+ 
  int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp)
  {
 +  int ret;
 +
+   dsa_master_set_mtu(dev,  cpu_dp);
+ 
/* If we use a tagging format that doesn't have an ethertype
 * field, make sure that all packets from this point on get
 * sent to the tag format's receive function.


pgpu9_qpMQdSC.pgp
Description: OpenPGP digital signature


Re: [PATCH v2 perf,bpf 1/3] perf, bpf: Introduce PERF_RECORD_BPF_EVENT

2018-12-06 Thread kbuild test robot
Hi Song,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.20-rc5 next-20181206]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Song-Liu/perf-bpf-Introduce-PERF_RECORD_BPF_EVENT/20181207-083615
config: i386-randconfig-x070-201848 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   kernel/events/core.c: In function 'perf_event_bpf_output':
>> kernel/events/core.c:7694:2: error: implicit declaration of function 
>> 'bpf_get_prog_name'; did you mean 'bpf_prog_free'? 
>> [-Werror=implicit-function-declaration]
 bpf_get_prog_name(bpf_event->prog, name);
 ^
 bpf_prog_free
   kernel/events/core.c: In function 'perf_event_bpf_event_subprog':
>> kernel/events/core.c:7737:12: warning: cast from pointer to integer of 
>> different size [-Wpointer-to-int-cast]
   .addr = (u64)prog->bpf_func,
   ^
   cc1: some warnings being treated as errors

vim +7694 kernel/events/core.c

  7679  
  7680  static void perf_event_bpf_output(struct perf_event *event,
  7681 void *data)
  7682  {
  7683  struct perf_bpf_event *bpf_event = data;
  7684  struct perf_output_handle handle;
  7685  struct perf_sample_data sample;
  7686  char name[KSYM_NAME_LEN];
  7687  int name_len;
  7688  int ret;
  7689  
  7690  if (!perf_event_bpf_match(event))
  7691  return;
  7692  
  7693  /* get prog name and round up to 64 bit aligned */
> 7694  bpf_get_prog_name(bpf_event->prog, name);
  7695  name_len = strlen(name) + 1;
  7696  while (!IS_ALIGNED(name_len, sizeof(u64)))
  7697  name[name_len++] = '\0';
  7698  bpf_event->event_id.len += name_len;
  7699  
  7700  perf_event_header__init_id(&bpf_event->event_id.header, 
&sample, event);
  7701  ret = perf_output_begin(&handle, event,
  7702  bpf_event->event_id.header.size);
  7703  if (ret)
  7704  return;
  7705  
  7706  perf_output_put(&handle, bpf_event->event_id);
  7707  
  7708  __output_copy(&handle, name, name_len);
  7709  
  7710  perf_event__output_id_sample(event, &handle, &sample);
  7711  
  7712  perf_output_end(&handle);
  7713  }
  7714  
  7715  static void perf_event_bpf(struct perf_bpf_event *bpf_event)
  7716  {
  7717  perf_iterate_sb(perf_event_bpf_output,
  7718 bpf_event,
  7719 NULL);
  7720  }
  7721  
  7722  static void perf_event_bpf_event_subprog(
  7723  enum perf_bpf_event_type type,
  7724  struct bpf_prog *prog, u32 id, u32 sub_id)
  7725  {
  7726  struct perf_bpf_event bpf_event = (struct perf_bpf_event){
  7727  .prog = prog,
  7728  .event_id = {
  7729  .header = {
  7730  .type = PERF_RECORD_BPF_EVENT,
  7731  .size = sizeof(bpf_event.event_id),
  7732  },
  7733  .type = type,
  7734  /* .flags = 0 */
  7735  .id = id,
  7736  .sub_id = sub_id,
> 7737  .addr = (u64)prog->bpf_func,
  7738  .len = prog->jited_len,
  7739  },
  7740  };
  7741  
  7742  memcpy(bpf_event.event_id.tag, prog->tag, BPF_TAG_SIZE);
  7743  perf_event_bpf(&bpf_event);
  7744  }
  7745  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH] crypto: user - support incremental algorithm dumps

2018-12-06 Thread Eric Biggers
From: Eric Biggers 

CRYPTO_MSG_GETALG in NLM_F_DUMP mode sometimes doesn't return all
registered crypto algorithms, because it doesn't support incremental
dumps.  crypto_dump_report() only permits itself to be called once, yet
the netlink subsystem allocates at most ~64 KiB for the skb being dumped
to.  Thus only the first recvmsg() returns data, and it may only include
a subset of the crypto algorithms even if the user buffer passed to
recvmsg() is large enough to hold all of them.

Fix this by using one of the arguments in the netlink_callback structure
to keep track of the current position in the algorithm list.  Then
userspace can do multiple recvmsg() on the socket after sending the dump
request.  This is the way netlink dumps work elsewhere in the kernel;
it's unclear why this was different (probably just an oversight).

Also fix an integer overflow when calculating the dump buffer size hint.

Fixes: a38f7907b926 ("crypto: Add userspace configuration API")
Signed-off-by: Eric Biggers 
---
 crypto/crypto_user_base.c | 37 -
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/crypto/crypto_user_base.c b/crypto/crypto_user_base.c
index 7021efbb35a1e..5311fd7fae34b 100644
--- a/crypto/crypto_user_base.c
+++ b/crypto/crypto_user_base.c
@@ -231,30 +231,33 @@ static int crypto_report(struct sk_buff *in_skb, struct 
nlmsghdr *in_nlh,
 
 static int crypto_dump_report(struct sk_buff *skb, struct netlink_callback *cb)
 {
-   struct crypto_alg *alg;
+   const size_t start_pos = cb->args[0];
+   size_t pos = 0;
struct crypto_dump_info info;
-   int err;
-
-   if (cb->args[0])
-   goto out;
-
-   cb->args[0] = 1;
+   struct crypto_alg *alg;
+   int res;
 
info.in_skb = cb->skb;
info.out_skb = skb;
info.nlmsg_seq = cb->nlh->nlmsg_seq;
info.nlmsg_flags = NLM_F_MULTI;
 
+   down_read(&crypto_alg_sem);
list_for_each_entry(alg, &crypto_alg_list, cra_list) {
-   err = crypto_report_alg(alg, &info);
-   if (err)
-   goto out_err;
+   if (pos >= start_pos) {
+   res = crypto_report_alg(alg, &info);
+   if (res == -EMSGSIZE)
+   break;
+   if (res)
+   goto out;
+   }
+   pos++;
}
-
+   cb->args[0] = pos;
+   res = skb->len;
 out:
-   return skb->len;
-out_err:
-   return err;
+   up_read(&crypto_alg_sem);
+   return res;
 }
 
 static int crypto_dump_report_done(struct netlink_callback *cb)
@@ -442,7 +445,7 @@ static int crypto_user_rcv_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if ((type == (CRYPTO_MSG_GETALG - CRYPTO_MSG_BASE) &&
(nlh->nlmsg_flags & NLM_F_DUMP))) {
struct crypto_alg *alg;
-   u16 dump_alloc = 0;
+   unsigned long dump_alloc = 0;
 
if (link->dump == NULL)
return -EINVAL;
@@ -450,16 +453,16 @@ static int crypto_user_rcv_msg(struct sk_buff *skb, 
struct nlmsghdr *nlh,
down_read(&crypto_alg_sem);
list_for_each_entry(alg, &crypto_alg_list, cra_list)
dump_alloc += CRYPTO_REPORT_MAXSIZE;
+   up_read(&crypto_alg_sem);
 
{
struct netlink_dump_control c = {
.dump = link->dump,
.done = link->done,
-   .min_dump_alloc = dump_alloc,
+   .min_dump_alloc = min(dump_alloc, 65535UL),
};
err = netlink_dump_start(crypto_nlsk, skb, nlh, &c);
}
-   up_read(&crypto_alg_sem);
 
return err;
}
-- 
2.20.0.rc2.403.gdbc3b29805-goog



[net-next PATCH RFC 8/8] veth: xdp_frames redirected into veth need to transfer xdp_mem_info

2018-12-06 Thread Jesper Dangaard Brouer
XDP frames redirected into a veth device, that choose XDP_PASS end-up
creating an SKB from the xdp_frame.  The xdp_frame mem info need to be
transferred into the SKB.

Signed-off-by: Jesper Dangaard Brouer 
Signed-off-by: Ilias Apalodimas 
---
 drivers/net/veth.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f412ea1cef18..925d300402ca 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -555,6 +555,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
goto err;
}
 
+   skb->mem_info = frame->mem;
xdp_scrub_frame(frame);
skb->protocol = eth_type_trans(skb, rq->dev);
 err:



[net-next PATCH RFC 2/8] net: mvneta: use page pool API for sw buffer manager

2018-12-06 Thread Jesper Dangaard Brouer
From: Ilias Apalodimas 

Use the page_pool api for allocations and DMA handling instead of
__dev_alloc_page()/dma_map_page() and free_page()/dma_unmap_page().

The page_pool API offers buffer recycling capabilities for XDP but
allocates one page per packet, unless the driver splits and manages
the allocated page.

Although XDP is not a part of the driver yet, the current implementation
is allocating one page per packet, thus there's no performance penalty from
using the API.

For now pages are unmapped via page_pool_unmap_page() before packets
travel into the network stack, as it doesn't have a return hook yet.
Given this call cleared the page_pool state, it is safe to let the
page be returned to the normal page allocator.

Signed-off-by: Ilias Apalodimas 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/marvell/Kconfig  |1 +
 drivers/net/ethernet/marvell/mvneta.c |   56 -
 2 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/marvell/Kconfig 
b/drivers/net/ethernet/marvell/Kconfig
index 3238aa7f5dac..3325abe67465 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -60,6 +60,7 @@ config MVNETA
depends on ARCH_MVEBU || COMPILE_TEST
select MVMDIO
select PHYLINK
+   select PAGE_POOL
---help---
  This driver supports the network interface units in the
  Marvell ARMADA XP, ARMADA 370, ARMADA 38x and
diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 5bfd349bf41a..2354421fe96f 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include "mvneta_bm.h"
+#include 
 #include 
 #include 
 #include 
@@ -624,6 +625,9 @@ struct mvneta_rx_queue {
struct sk_buff *skb;
int left_size;
 
+   /* page pool */
+   struct page_pool *page_pool;
+
/* error counters */
u32 skb_alloc_err;
u32 refill_err;
@@ -1813,17 +1817,11 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
dma_addr_t phys_addr;
struct page *page;
 
-   page = __dev_alloc_page(gfp_mask);
+   page = page_pool_dev_alloc_pages(rxq->page_pool);
if (!page)
return -ENOMEM;
 
-   /* map page for use */
-   phys_addr = dma_map_page(pp->dev->dev.parent, page, 0, PAGE_SIZE,
-DMA_FROM_DEVICE);
-   if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
-   __free_page(page);
-   return -ENOMEM;
-   }
+   phys_addr = page_pool_get_dma_addr(page);
 
phys_addr += pp->rx_offset_correction;
mvneta_rx_desc_fill(rx_desc, phys_addr, page, rxq);
@@ -1892,10 +1890,11 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
if (!data || !(rx_desc->buf_phys_addr))
continue;
 
-   dma_unmap_page(pp->dev->dev.parent, rx_desc->buf_phys_addr,
-  PAGE_SIZE, DMA_FROM_DEVICE);
-   __free_page(data);
+   page_pool_put_page(rxq->page_pool, data, false);
}
+
+   if (rxq->page_pool)
+   page_pool_destroy(rxq->page_pool);
 }
 
 static inline
@@ -2010,8 +2009,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
skb_add_rx_frag(rxq->skb, frag_num, page,
frag_offset, frag_size,
PAGE_SIZE);
-   dma_unmap_page(dev->dev.parent, phys_addr,
-  PAGE_SIZE, DMA_FROM_DEVICE);
+   page_pool_unmap_page(rxq->page_pool, page);
rxq->left_size -= frag_size;
}
} else {
@@ -2041,8 +2039,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
frag_offset, frag_size,
PAGE_SIZE);
 
-   dma_unmap_page(dev->dev.parent, phys_addr,
-  PAGE_SIZE, DMA_FROM_DEVICE);
+   page_pool_unmap_page(rxq->page_pool, page);
 
rxq->left_size -= frag_size;
}
@@ -2828,11 +2825,37 @@ static int mvneta_poll(struct napi_struct *napi, int 
budget)
return rx_done;
 }
 
+static int mvneta_create_page_pool(struct mvneta_port *pp,
+  struct mvneta_rx_queue *rxq, int num)
+{
+   struct page_pool_params pp_params = { 0 };
+   int err = 0;
+
+   pp_params.order = 0;
+   /* internal DMA mapping in page_pool */
+   pp_params.flags = PP_FLAG_DMA_MAP;
+   pp_params.pool_size = num;
+   pp_params.nid = NUMA_NO_NO

[net-next PATCH RFC 1/8] page_pool: add helper functions for DMA

2018-12-06 Thread Jesper Dangaard Brouer
From: Ilias Apalodimas 

Add helper functions for retreiving dma_addr_t stored in page_private and
unmapping dma addresses, mapped via the page_pool API.

Signed-off-by: Ilias Apalodimas 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/page_pool.h |6 ++
 net/core/page_pool.c|7 +++
 2 files changed, 13 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 694d055e01ef..439f9183d4cd 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -111,6 +111,8 @@ struct page_pool *page_pool_create(const struct 
page_pool_params *params);
 
 void page_pool_destroy(struct page_pool *pool);
 
+void page_pool_unmap_page(struct page_pool *pool, struct page *page);
+
 /* Never call this directly, use helpers below */
 void __page_pool_put_page(struct page_pool *pool,
  struct page *page, bool allow_direct);
@@ -141,4 +143,8 @@ static inline bool is_page_pool_compiled_in(void)
 #endif
 }
 
+static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
+{
+   return page_private(page);
+}
 #endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43a932cb609b..26e14a17a67c 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -184,6 +184,13 @@ static void __page_pool_clean_page(struct page_pool *pool,
set_page_private(page, 0);
 }
 
+/* unmap the page and clean our state */
+void page_pool_unmap_page(struct page_pool *pool, struct page *page)
+{
+   __page_pool_clean_page(pool, page);
+}
+EXPORT_SYMBOL(page_pool_unmap_page);
+
 /* Return a page to the page allocator, cleaning up our state */
 static void __page_pool_return_page(struct page_pool *pool, struct page *page)
 {



[net-next PATCH RFC 6/8] mvneta: activate page recycling via skb using page_pool

2018-12-06 Thread Jesper Dangaard Brouer
Previous mvneta patches have already started to use page_pool, but
this was primarily for RX page alloc-side and for doing DMA map/unmap
handling.  Pages traveling through the netstack was unmapped and
returned through the normal page allocator.

It is now time to activate that pages are recycled back. This involves
registering the page_pool with the XDP rxq memory model API, even-though
the driver doesn't support XDP itself yet.  And simply updating the
SKB->mem_info field with info from the xdp_rxq_info.

Signed-off-by: Jesper Dangaard Brouer 
Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/marvell/mvneta.c |   29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 78f1fcdc1f00..449c19829d67 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -628,6 +628,9 @@ struct mvneta_rx_queue {
/* page pool */
struct page_pool *page_pool;
 
+   /* XDP */
+   struct xdp_rxq_info xdp_rxq;
+
/* error counters */
u32 skb_alloc_err;
u32 refill_err;
@@ -1892,6 +1895,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
page_pool_put_page(rxq->page_pool, data, false);
}
 
+   if (xdp_rxq_info_is_reg(&rxq->xdp_rxq))
+   xdp_rxq_info_unreg(&rxq->xdp_rxq);
+
if (rxq->page_pool)
page_pool_destroy(rxq->page_pool);
 }
@@ -1978,11 +1984,11 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 
rx_desc->buf_phys_addr = 0;
frag_num = 0;
+   rxq->skb->mem_info = rxq->xdp_rxq.mem;
skb_reserve(rxq->skb, MVNETA_MH_SIZE + NET_SKB_PAD);
skb_put(rxq->skb, rx_bytes < PAGE_SIZE ? rx_bytes :
PAGE_SIZE);
mvneta_rx_csum(pp, rx_status, rxq->skb);
-   page_pool_unmap_page(rxq->page_pool, page);
rxq->left_size = rx_bytes < PAGE_SIZE ? 0 : rx_bytes -
PAGE_SIZE;
} else {
@@ -2001,7 +2007,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
skb_add_rx_frag(rxq->skb, frag_num, page,
0, frag_size,
PAGE_SIZE);
-
+   /* skb frags[] are not recycled, unmap now */
page_pool_unmap_page(rxq->page_pool, page);
 
rxq->left_size -= frag_size;
@@ -2815,10 +2821,25 @@ static int mvneta_create_page_pool(struct mvneta_port 
*pp,
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
   int num)
 {
-   int i = 0;
+   int err, i = 0;
+
+   err = mvneta_create_page_pool(pp, rxq, num);
+   if (err)
+   goto out;
 
-   if (mvneta_create_page_pool(pp, rxq, num))
+   err = xdp_rxq_info_reg(&rxq->xdp_rxq, pp->dev, rxq->id);
+   if (err) {
+   page_pool_destroy(rxq->page_pool);
+   goto out;
+   }
+
+   err = xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
+rxq->page_pool);
+   if (err) {
+   xdp_rxq_info_unreg(&rxq->xdp_rxq);
+   page_pool_destroy(rxq->page_pool);
goto out;
+   }
 
for (i = 0; i < num; i++) {
memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));



[net-next PATCH RFC 5/8] net: mvneta: remove copybreak, prefetch and use build_skb

2018-12-06 Thread Jesper Dangaard Brouer
From: Ilias Apalodimas 

The driver memcpy for packets < 256b and it's recycle tricks are not
needed anymore.  As previous patch introduces buffer recycling using
the page_pool API (although recycling doesn't get fully activated in
this patch).  After this switch to using build_skb().

This patch implicit fixes a driver bug where the memory is copied
prior to it's syncing for the CPU, in the < 256b case (as this case is
removed).

We also remove the data prefetch completely. The original driver had
the prefetch misplaced before any dma_sync operations took place.
Based on Jesper's analysis even if the prefetch is placed after
any DMA sync ops it ends up hurting performance.

Signed-off-by: Ilias Apalodimas 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/marvell/mvneta.c |   81 +
 1 file changed, 22 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 2354421fe96f..78f1fcdc1f00 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -643,7 +643,6 @@ static int txq_number = 8;
 static int rxq_def;
 
 static int rx_copybreak __read_mostly = 256;
-static int rx_header_size __read_mostly = 128;
 
 /* HW BM need that each port be identify by a unique ID */
 static int global_port_id;
@@ -1823,7 +1822,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
 
phys_addr = page_pool_get_dma_addr(page);
 
-   phys_addr += pp->rx_offset_correction;
+   phys_addr += pp->rx_offset_correction + NET_SKB_PAD;
mvneta_rx_desc_fill(rx_desc, phys_addr, page, rxq);
return 0;
 }
@@ -1944,14 +1943,12 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
struct page *page;
dma_addr_t phys_addr;
u32 rx_status, index;
-   int rx_bytes, skb_size, copy_size;
-   int frag_num, frag_size, frag_offset;
+   int frag_num, frag_size;
+   int rx_bytes;
 
index = rx_desc - rxq->descs;
page = (struct page *)rxq->buf_virt_addr[index];
data = page_address(page);
-   /* Prefetch header */
-   prefetch(data);
 
phys_addr = rx_desc->buf_phys_addr;
rx_status = rx_desc->status;
@@ -1969,49 +1966,25 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
rx_bytes = rx_desc->data_size -
   (ETH_FCS_LEN + MVNETA_MH_SIZE);
 
-   /* Allocate small skb for each new packet */
-   skb_size = max(rx_copybreak, rx_header_size);
-   rxq->skb = netdev_alloc_skb_ip_align(dev, skb_size);
-   if (unlikely(!rxq->skb)) {
-   netdev_err(dev,
-  "Can't allocate skb on queue %d\n",
-  rxq->id);
-   dev->stats.rx_dropped++;
-   rxq->skb_alloc_err++;
-   continue;
-   }
-   copy_size = min(skb_size, rx_bytes);
-
-   /* Copy data from buffer to SKB, skip Marvell header */
-   memcpy(rxq->skb->data, data + MVNETA_MH_SIZE,
-  copy_size);
-   skb_put(rxq->skb, copy_size);
-   rxq->left_size = rx_bytes - copy_size;
 
-   mvneta_rx_csum(pp, rx_status, rxq->skb);
-   if (rxq->left_size == 0) {
-   int size = copy_size + MVNETA_MH_SIZE;
-
-   dma_sync_single_range_for_cpu(dev->dev.parent,
- phys_addr, 0,
- size,
- DMA_FROM_DEVICE);
+   dma_sync_single_range_for_cpu(dev->dev.parent,
+ phys_addr, 0,
+ rx_bytes,
+ DMA_FROM_DEVICE);
 
-   /* leave the descriptor and buffer untouched */
-   } else {
-   /* refill descriptor with new buffer later */
-   rx_desc->buf_phys_addr = 0;
+   rxq->skb = build_skb(data, PAGE_SIZE);
+   if (!rxq->skb)
+   break;
 
-   frag_num = 0;
-   frag_offset = copy_size + MVNETA_MH_SIZE;
-   frag_size = min(rxq->left_size,
-   (int)(PAGE_SIZE - frag_offset))

[net-next PATCH RFC 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-06 Thread Jesper Dangaard Brouer
From: Ilias Apalodimas 

This patch is changing struct sk_buff, and is thus per-definition
controversial.

Place a new member 'mem_info' of type struct xdp_mem_info, just after
members (flags) head_frag and pfmemalloc, And not in between
headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
Copying mem_info during skb_clone() is required.  This makes sure that
pages are correctly freed or recycled during the altered
skb_free_head() invocation.

The 'mem_info' name is chosen as this is not strictly tied to XDP,
although the XDP return infrastructure is used.  As a future plan, we
could introduce a __u8 flags member to xdp_mem_info and move flags
head_frag and pfmemalloc into this area.

Signed-off-by: Ilias Apalodimas 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/skbuff.h |6 +-
 include/net/xdp.h  |1 +
 net/core/skbuff.c  |7 +++
 net/core/xdp.c |6 ++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7dcfb5591dc3..95dac0ba6947 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -744,6 +745,10 @@ struct sk_buff {
head_frag:1,
xmit_more:1,
pfmemalloc:1;
+   /* TODO: Future idea, extend mem_info with __u8 flags, and
+* move bits head_frag and pfmemalloc there.
+*/
+   struct xdp_mem_info mem_info;
 
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
@@ -827,7 +832,6 @@ struct sk_buff {
 #ifdef CONFIG_NETWORK_SECMARK
__u32   secmark;
 #endif
-
union {
__u32   mark;
__u32   reserved_tailroom;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5c33b9e0efab..4a0ca7a3d5e5 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -128,6 +128,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 void xdp_return_frame(struct xdp_frame *xdpf);
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
 void xdp_return_buff(struct xdp_buff *xdp);
+void xdp_return_skb_page(void *data, struct xdp_mem_info *mem_info);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 struct net_device *dev, u32 queue_index);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b4ee5c8b928f..71aca186e44c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -70,6 +70,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -544,6 +545,11 @@ static void skb_free_head(struct sk_buff *skb)
 {
unsigned char *head = skb->head;
 
+   if (skb->mem_info.type == MEM_TYPE_PAGE_POOL) {
+   xdp_return_skb_page(head, &skb->mem_info);
+   return;
+   }
+
if (skb->head_frag)
skb_free_frag(head);
else
@@ -859,6 +865,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
n->nohdr = 0;
n->peeked = 0;
C(pfmemalloc);
+   C(mem_info);
n->destructor = NULL;
C(tail);
C(end);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index e79526314864..1703be4c2611 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -379,6 +379,12 @@ void xdp_return_buff(struct xdp_buff *xdp)
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
 
+void xdp_return_skb_page(void *data, struct xdp_mem_info *mem_info)
+{
+   __xdp_return(data, mem_info, false, 0);
+}
+EXPORT_SYMBOL(xdp_return_skb_page);
+
 int xdp_attachment_query(struct xdp_attachment_info *info,
 struct netdev_bpf *bpf)
 {



[net-next PATCH RFC 7/8] xdp: bpf: cpumap redirect must update skb->mem_info

2018-12-06 Thread Jesper Dangaard Brouer
XDP-redirect to CPUMAP is about creating the SKB outside the driver
(and on another CPU) via xdp_frame info. Transfer the xdp_frame mem
info to the new SKB mem_info field.

Signed-off-by: Jesper Dangaard Brouer 
Signed-off-by: Ilias Apalodimas 
---
 kernel/bpf/cpumap.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 24aac0d0f412..e3e05b6ccc42 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -199,6 +199,8 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
/* Essential SKB info: protocol and skb->dev */
skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
+   skb->mem_info = xdpf->mem;
+
/* Optional SKB info, currently missing:
 * - HW checksum info   (skb->ip_summed)
 * - HW RX hash (skb_set_hash)



[net-next PATCH RFC 0/8] page_pool DMA handling and allow to recycles frames via SKB

2018-12-06 Thread Jesper Dangaard Brouer
This RFC patchset shows the plans for allowing page_pool to handle and
maintain DMA map/unmap of the pages it serves to the driver.  For this
to work a return hook in the network core is introduced, which also
involves extending sk_buff with the necessary information.

The overall purpose is to simplify drivers, by providing a page
allocation API that does recycling, such that each driver doesn't have
to reinvent its own recycling scheme.  Using page_pool in a driver
does not require implementing XDP support, but it makes it trivially
easy to do so.  The recycles code leverage the XDP recycle APIs.

The Marvell mvneta driver was used in this patchset to demonstrate how
to use the API, and tested on the EspressoBIN board.  We also have
patches enabling XDP for this driver, but they are not part of this
patchset as we want review of the general idea of the page_pool return
SKB hook.

A driver using page_pool and XDP redirecting into CPUMAP or veth, will
also take advantage of the new SKB return hook, this is currently only mlx5.


The longer term plans involves allowing other types of allocators to
use this return hook.  Particularly allowing zero-copy AF_XDP frames
to travel further into the netstack, if userspace page have been
restricted to read-only.

---

Ilias Apalodimas (4):
  page_pool: add helper functions for DMA
  net: mvneta: use page pool API for sw buffer manager
  net: core: add recycle capabilities on skbs via page_pool API
  net: mvneta: remove copybreak, prefetch and use build_skb

Jesper Dangaard Brouer (4):
  xdp: reduce size of struct xdp_mem_info
  mvneta: activate page recycling via skb using page_pool
  xdp: bpf: cpumap redirect must update skb->mem_info
  veth: xdp_frames redirected into veth need to transfer xdp_mem_info


 drivers/net/ethernet/marvell/Kconfig  |1 
 drivers/net/ethernet/marvell/mvneta.c |  158 +
 drivers/net/veth.c|1 
 include/linux/skbuff.h|6 +
 include/net/page_pool.h   |6 +
 include/net/xdp.h |5 +
 kernel/bpf/cpumap.c   |2 
 net/core/page_pool.c  |7 +
 net/core/skbuff.c |7 +
 net/core/xdp.c|   14 ++-
 10 files changed, 125 insertions(+), 82 deletions(-)

--


[net-next PATCH RFC 3/8] xdp: reduce size of struct xdp_mem_info

2018-12-06 Thread Jesper Dangaard Brouer
It is possible to compress/reduce the size of struct xdp_mem_info.
This change reduce struct xdp_mem_info from 8 bytes to 4 bytes.

The member xdp_mem_info.id can be reduced to u16, as the mem_id_ht
rhashtable in net/core/xdp.c is already limited by MEM_ID_MAX=0xFFFE
which can safely fit in u16.

The member xdp_mem_info.type could be reduced more than u16, as it stores
the enum xdp_mem_type, but due to alignment it is only reduced to u16.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h |4 ++--
 net/core/xdp.c|8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0f25b3675c5c..5c33b9e0efab 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -46,8 +46,8 @@ enum xdp_mem_type {
 #define XDP_XMIT_FLAGS_MASKXDP_XMIT_FLUSH
 
 struct xdp_mem_info {
-   u32 type; /* enum xdp_mem_type, but known size type */
-   u32 id;
+   u16 type; /* enum xdp_mem_type, but known size type */
+   u16 id;
 };
 
 struct page_pool;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4b2b194f4f1f..e79526314864 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -42,11 +42,11 @@ struct xdp_mem_allocator {
 
 static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
 {
-   const u32 *k = data;
-   const u32 key = *k;
+   const u16 *k = data;
+   const u16 key = *k;
 
BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
-!= sizeof(u32));
+!= sizeof(u16));
 
/* Use cyclic increasing ID as direct hash key */
return key;
@@ -56,7 +56,7 @@ static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
  const void *ptr)
 {
const struct xdp_mem_allocator *xa = ptr;
-   u32 mem_id = *(u32 *)arg->key;
+   u16 mem_id = *(u16 *)arg->key;
 
return xa->mem.id != mem_id;
 }



Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Nadav Amit
> On Dec 6, 2018, at 12:17 PM, Andy Lutomirski  wrote:
> 
> On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit  wrote:
>>> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski  wrote:
>>> 
>>> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
 On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
>> If we are going to unmap the linear alias, why not do it at vmalloc()
>> time rather than vfree() time?
> 
> That’s not totally nuts. Do we ever have code that expects __va() to
> work on module data?  Perhaps crypto code trying to encrypt static
> data because our APIs don’t understand virtual addresses.  I guess if
> highmem is ever used for modules, then we should be fine.
> 
> RO instead of not present might be safer.  But I do like the idea of
> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> making it do all of this.
 
 Yeah, doing it for everything automatically seemed like it was/is
 going to be a lot of work to debug all the corner cases where things
 expect memory to be mapped but don't explicitly say it. And in
 particular, the XPFO series only does it for user memory, whereas an
 additional flag like this would work for extra paranoid allocations
 of kernel memory too.
>>> 
>>> I just read the code, and I looks like vmalloc() is already using
>>> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
>>> example, we already don't have modules in the direct map.
>>> 
>>> So I say we go for it.  This should be quite simple to implement --
>>> the pageattr code already has almost all the needed logic on x86.  The
>>> only arch support we should need is a pair of functions to remove a
>>> vmalloc address range from the address map (if it was present in the
>>> first place) and a function to put it back.  On x86, this should only
>>> be a few lines of code.
>>> 
>>> What do you all think?  This should solve most of the problems we have.
>>> 
>>> If we really wanted to optimize this, we'd make it so that
>>> module_alloc() allocates memory the normal way, then, later on, we
>>> call some function that, all at once, removes the memory from the
>>> direct map and applies the right permissions to the vmalloc alias (or
>>> just makes the vmalloc alias not-present so we can add permissions
>>> later without flushing), and flushes the TLB.  And we arrange for
>>> vunmap to zap the vmalloc range, then put the memory back into the
>>> direct map, then free the pages back to the page allocator, with the
>>> flush in the appropriate place.
>>> 
>>> I don't see why the page allocator needs to know about any of this.
>>> It's already okay with the permissions being changed out from under it
>>> on x86, and it seems fine.  Rick, do you want to give some variant of
>>> this a try?
>> 
>> Setting it as read-only may work (and already happens for the read-only
>> module data). I am not sure about setting it as non-present.
>> 
>> At some point, a discussion about a threat-model, as Rick indicated, would
>> be required. I presume ROP attacks can easily call set_all_modules_text_rw()
>> and override all the protections.
> 
> I am far from an expert on exploit techniques, but here's a
> potentially useful model: let's assume there's an attacker who can
> write controlled data to a controlled kernel address but cannot
> directly modify control flow.  It would be nice for such an attacker
> to have a very difficult time of modifying kernel text or of
> compromising control flow.  So we're assuming a feature like kernel
> CET or that the attacker finds it very difficult to do something like
> modifying some thread's IRET frame.
> 
> Admittedly, for the kernel, this is an odd threat model, since an
> attacker can presumably quite easily learn the kernel stack address of
> one of their tasks, do some syscall, and then modify their kernel
> thread's stack such that it will IRET right back to a fully controlled
> register state with RSP pointing at an attacker-supplied kernel stack.
> So this threat model gives very strong ROP powers. unless we have
> either CET or some software technique to harden all the RET
> instructions in the kernel.
> 
> I wonder if there's a better model to use.  Maybe with stack-protector
> we get some degree of protection?  Or is all of this is rather weak
> until we have CET or a RAP-like feature.

I believe that seeing the end-goal would make reasoning about patches
easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
up.

I’m not sure CET or other CFI would be enough even with this threat-model.
The page-tables (the very least) need to be write-protected, as otherwise
controlled data writes may just modify them. There are various possible
solutions I presume: write_rare for page-tables, hypervisor-assisted
security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
hardware enclave.

What do you think?

RE: [Intel-wired-lan] [PATCH] ixgbe: Fix race when the VF driver does a reset

2018-12-06 Thread Bowers, AndrewX
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Ross Lagerwall
> Sent: Wednesday, December 5, 2018 5:54 AM
> To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> Cc: Ross Lagerwall ; David S. Miller
> 
> Subject: [Intel-wired-lan] [PATCH] ixgbe: Fix race when the VF driver does a
> reset
> 
> When the VF driver does a reset, it (at least the Linux one) writes to the
> VFCTRL register to issue a reset and then immediately sends a reset message
> using the mailbox API. This is racy because when the PF driver detects that
> the VFCTRL register reset pin has been asserted, it clears the mailbox
> memory. Depending on ordering, the reset message sent by the VF could be
> cleared by the PF driver. It then responds to the cleared message with a
> NACK which causes the VF driver to malfunction.
> Fix this by deferring clearing the mailbox memory until the reset message is
> received.
> 
> Fixes: 939b701ad633 ("ixgbe: fix driver behaviour after issuing VFLR")
> Signed-off-by: Ross Lagerwall 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)

Tested-by: Andrew Bowers 




RE: [Intel-wired-lan] [PATCH v3 2/2] i40e: DRY rx_ptype handling code

2018-12-06 Thread Bowers, AndrewX
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Michal Miroslaw
> Sent: Tuesday, December 4, 2018 9:31 AM
> To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> Subject: [Intel-wired-lan] [PATCH v3 2/2] i40e: DRY rx_ptype handling code
> 
> Move rx_ptype extracting to i40e_process_skb_fields() to avoid duplicating
> the code.
> 
> Signed-off-by: Michał Mirosław 
> Signed-off-by: Michał Mirosław 
> ---
> v3:
>  * no changes
> v2:
>  * fix prototype in i40e_txrx_common.h
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c| 12 
>  drivers/net/ethernet/intel/i40e/i40e_txrx_common.h |  3 +--
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c |  6 +-
>  3 files changed, 6 insertions(+), 15 deletions(-)

Tested-by: Andrew Bowers 




RE: [Intel-wired-lan] [PATCH v3 1/2] i40e: fix VLAN.TCI == 0 RX HW offload

2018-12-06 Thread Bowers, AndrewX
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Michal Miroslaw
> Sent: Tuesday, December 4, 2018 9:31 AM
> To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> Subject: [Intel-wired-lan] [PATCH v3 1/2] i40e: fix VLAN.TCI == 0 RX HW
> offload
> 
> This fixes two bugs in hardware VLAN offload:
>  1. VLAN.TCI == 0 was being dropped
>  2. there was a race between disabling of VLAN RX feature in hardware
> and processing RX queue, where packets processed in this window
> could have their VLAN information dropped
> 
> Fix moves the VLAN handling into i40e_process_skb_fields() to save on
> duplicated code. i40e_receive_skb() becomes trivial and so is removed.
> 
> Signed-off-by: Michał Mirosław 
> Signed-off-by: Michał Mirosław 
> ---
> v3:
>  * fix whitespace for checkpatch
> v2:
>  * no changes
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 31 +--
>  .../ethernet/intel/i40e/i40e_txrx_common.h|  2 --
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c|  6 +---
>  3 files changed, 9 insertions(+), 30 deletions(-)

Tested-by: Andrew Bowers 




[PATCH net-next 1/4] tc-testing: Add command timeout feature to tdc

2018-12-06 Thread Lucas Bates
Using an attribute set in the tdc_config.py file, limit the
amount of time tdc will wait for an executed command to
complete and prevent the script from hanging entirely.

This timeout will be applied to all executed commands.

Signed-off-by: Lucas Bates 
---
 tools/testing/selftests/tc-testing/tdc.py| 16 +++-
 tools/testing/selftests/tc-testing/tdc_config.py |  2 ++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index 7607ba3..b862ee5 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -131,12 +131,16 @@ def exec_cmd(args, pm, stage, command):
 stdout=subprocess.PIPE,
 stderr=subprocess.PIPE,
 env=ENVIR)
-(rawout, serr) = proc.communicate()
 
-if proc.returncode != 0 and len(serr) > 0:
-foutput = serr.decode("utf-8", errors="ignore")
-else:
-foutput = rawout.decode("utf-8", errors="ignore")
+try:
+(rawout, serr) = proc.communicate(timeout=NAMES['TIMEOUT'])
+if proc.returncode != 0 and len(serr) > 0:
+foutput = serr.decode("utf-8", errors="ignore")
+else:
+foutput = rawout.decode("utf-8", errors="ignore")
+except subprocess.TimeoutExpired:
+foutput = "Command \"{}\" timed out\n".format(command)
+proc.returncode = 255
 
 proc.stdout.close()
 proc.stderr.close()
@@ -438,6 +442,8 @@ def check_default_settings(args, remaining, pm):
 NAMES['TC'] = args.path
 if args.device != None:
 NAMES['DEV2'] = args.device
+if 'TIMEOUT' not in NAMES:
+NAMES['TIMEOUT'] = None
 if not os.path.isfile(NAMES['TC']):
 print("The specified tc path " + NAMES['TC'] + " does not exist.")
 exit(1)
diff --git a/tools/testing/selftests/tc-testing/tdc_config.py 
b/tools/testing/selftests/tc-testing/tdc_config.py
index d651bc1..6d91e48 100644
--- a/tools/testing/selftests/tc-testing/tdc_config.py
+++ b/tools/testing/selftests/tc-testing/tdc_config.py
@@ -15,6 +15,8 @@ NAMES = {
   'DEV1': 'v0p1',
   'DEV2': '',
   'BATCH_FILE': './batch.txt',
+  # Length of time in seconds to wait before terminating a command
+  'TIMEOUT': 12,
   # Name of the namespace to use
   'NS': 'tcut',
   # Directory containing eBPF test programs
-- 
2.7.4



[PATCH net-next 3/4] tc-testing: Implement the TdcResults module in tdc

2018-12-06 Thread Lucas Bates
In tdc and the valgrind plugin, begin using the TdcResults module
to track executed test cases.

Signed-off-by: Lucas Bates 
---
 tools/testing/selftests/tc-testing/TdcPlugin.py|   3 +-
 .../tc-testing/plugin-lib/valgrindPlugin.py|  22 +++-
 tools/testing/selftests/tc-testing/tdc.py  | 117 -
 3 files changed, 91 insertions(+), 51 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/TdcPlugin.py 
b/tools/testing/selftests/tc-testing/TdcPlugin.py
index 3ee9a6d..1d9e279 100644
--- a/tools/testing/selftests/tc-testing/TdcPlugin.py
+++ b/tools/testing/selftests/tc-testing/TdcPlugin.py
@@ -18,11 +18,12 @@ class TdcPlugin:
 if self.args.verbose > 1:
 print(' -- {}.post_suite'.format(self.sub_class))
 
-def pre_case(self, test_ordinal, testid):
+def pre_case(self, test_ordinal, testid, test_name):
 '''run commands before test_runner does one test'''
 if self.args.verbose > 1:
 print(' -- {}.pre_case'.format(self.sub_class))
 self.args.testid = testid
+self.args.test_name = test_name
 self.args.test_ordinal = test_ordinal
 
 def post_case(self):
diff --git a/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py 
b/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py
index 477a7bd..e00c798 100644
--- a/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py
+++ b/tools/testing/selftests/tc-testing/plugin-lib/valgrindPlugin.py
@@ -11,6 +11,7 @@ from string import Template
 import subprocess
 import time
 from TdcPlugin import TdcPlugin
+from TdcResults import *
 
 from tdc_config import *
 
@@ -21,6 +22,7 @@ class SubPlugin(TdcPlugin):
 def __init__(self):
 self.sub_class = 'valgrind/SubPlugin'
 self.tap = ''
+self._tsr = TestSuiteReport()
 super().__init__()
 
 def pre_suite(self, testcount, testidlist):
@@ -34,10 +36,14 @@ class SubPlugin(TdcPlugin):
 def post_suite(self, index):
 '''run commands after test_runner goes into a test loop'''
 super().post_suite(index)
-self._add_to_tap('\n|---\n')
 if self.args.verbose > 1:
 print('{}.post_suite'.format(self.sub_class))
-print('{}'.format(self.tap))
+#print('{}'.format(self.tap))
+for xx in range(index - 1, self.testcount):
+res = TestResult('{}-mem'.format(self.testidlist[xx]), 'Test 
skipped')
+res.set_result(ResultState.skip)
+res.set_errormsg('Skipped because of prior setup/teardown failure')
+self._add_results(res)
 if self.args.verbose < 4:
 subprocess.check_output('rm -f vgnd-*.log', shell=True)
 
@@ -128,8 +134,17 @@ class SubPlugin(TdcPlugin):
 nle_num = int(nle_mo.group(1))
 
 mem_results = ''
+res = TestResult('{}-mem'.format(self.args.testid),
+  '{} memory leak check'.format(self.args.test_name))
 if (def_num > 0) or (ind_num > 0) or (pos_num > 0) or (nle_num > 0):
 mem_results += 'not '
+res.set_result(ResultState.fail)
+res.set_failmsg('Memory leak detected')
+res.append_failmsg(content)
+else:
+res.set_result(ResultState.success)
+
+self._add_results(res)
 
 mem_results += 'ok {} - {}-mem # {}\n'.format(
 self.args.test_ordinal, self.args.testid, 'memory leak check')
@@ -138,5 +153,8 @@ class SubPlugin(TdcPlugin):
 print('{}'.format(content))
 self._add_to_tap(content)
 
+def _add_results(self, res):
+self._tsr.add_resultdata(res)
+
 def _add_to_tap(self, more_tap_output):
 self.tap += more_tap_output
diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index b862ee5..e6e4ce8 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -23,6 +23,7 @@ from tdc_config import *
 from tdc_helper import *
 
 import TdcPlugin
+from TdcResults import *
 
 
 class PluginMgrTestFail(Exception):
@@ -60,10 +61,10 @@ class PluginMgr:
 for pgn_inst in reversed(self.plugin_instances):
 pgn_inst.post_suite(index)
 
-def call_pre_case(self, test_ordinal, testid):
+def call_pre_case(self, test_ordinal, testid, test_name):
 for pgn_inst in self.plugin_instances:
 try:
-pgn_inst.pre_case(test_ordinal, testid)
+pgn_inst.pre_case(test_ordinal, testid, test_name)
 except Exception as ee:
 print('exception {} in call to pre_case for {} plugin'.
   format(ee, pgn_inst.__class__))
@@ -102,7 +103,6 @@ class PluginMgr:
 self.argparser = argparse.ArgumentParser(
 description='Linux TC unit tests')
 
-
 def replace_keywords(cmd):
 """
 For a given executable command, substitute any known
@@ -187,6 +187,7 @@ de

[PATCH net-next 2/4] tc-testing: Add new TdcResults module

2018-12-06 Thread Lucas Bates
This module includes new classes for tdc to use in keeping track
of test case results, instead of generating and tracking a lengthy
string.

The new module can be extended to support multiple formal
test result formats to be friendlier to automation.

Signed-off-by: Lucas Bates 
---
 tools/testing/selftests/tc-testing/TdcResults.py | 133 +++
 1 file changed, 133 insertions(+)
 create mode 100644 tools/testing/selftests/tc-testing/TdcResults.py

diff --git a/tools/testing/selftests/tc-testing/TdcResults.py 
b/tools/testing/selftests/tc-testing/TdcResults.py
new file mode 100644
index 000..250fbb2
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/TdcResults.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+
+from enum import Enum
+
+class ResultState(Enum):
+noresult = -1
+skip = 0
+success = 1
+fail = 2
+
+class TestResult:
+def __init__(self, test_id="", test_name=""):
+   self.test_id = test_id
+   self.test_name = test_name
+   self.result = ResultState.noresult
+   self.failmsg = ""
+   self.errormsg = ""
+   self.steps = []
+
+def set_result(self, result):
+if (isinstance(result, ResultState)):
+self.result = result
+return True
+else:
+raise TypeError('Unknown result type, must be type ResultState')
+
+def get_result(self):
+return self.result
+
+def set_errormsg(self, errormsg):
+self.errormsg = errormsg
+return True
+
+def append_errormsg(self, errormsg):
+self.errormsg = '{}\n{}'.format(self.errormsg, errormsg)
+
+def get_errormsg(self):
+return self.errormsg
+
+def set_failmsg(self, failmsg):
+self.failmsg = failmsg
+return True
+
+def append_failmsg(self, failmsg):
+self.failmsg = '{}\n{}'.format(self.failmsg, failmsg)
+
+def get_failmsg(self):
+return self.failmsg
+
+def add_steps(self, newstep):
+if type(newstep) == list:
+self.steps.extend(newstep)
+elif type(newstep) == str:
+self.steps.append(step)
+else:
+raise TypeError('TdcResults.add_steps() requires a list or str')
+
+def get_executed_steps(self):
+return self.steps
+
+class TestSuiteReport():
+_testsuite = []
+
+def add_resultdata(self, result_data):
+if isinstance(result_data, TestResult):
+self._testsuite.append(result_data)
+return True
+
+def count_tests(self):
+return len(self._testsuite)
+
+def count_failures(self):
+return sum(1 for t in self._testsuite if t.result == ResultState.fail)
+
+def count_skips(self):
+return sum(1 for t in self._testsuite if t.result == ResultState.skip)
+
+def find_result(self, test_id):
+return next((tr for tr in self._testsuite if tr.test_id == test_id), 
None)
+
+def update_result(self, result_data):
+orig = self.find_result(result_data.test_id)
+if orig != None:
+idx = self._testsuite.index(orig)
+self._testsuite[idx] = result_data
+else:
+self.add_resultdata(result_data)
+
+def format_tap(self):
+ftap = ""
+ftap += '1..{}\n'.format(self.count_tests())
+index = 1
+for t in self._testsuite:
+if t.result == ResultState.fail:
+ftap += 'not '
+ftap += 'ok {} {} - {}'.format(str(index), t.test_id, t.test_name)
+if t.result == ResultState.skip or t.result == 
ResultState.noresult:
+ftap += ' # skipped - {}\n'.format(t.errormsg)
+elif t.result == ResultState.fail:
+if len(t.steps) > 0:
+ftap += '\tCommands executed in this test case:'
+for step in t.steps:
+ftap += '\n\t\t{}'.format(step)
+ftap += '\n\t{}'.format(t.failmsg)
+ftap += '\n'
+index += 1
+return ftap
+
+def format_xunit(self):
+from xml.sax.saxutils import escape
+xunit = "\n"
+xunit += '\t\n'.format(self.count_tests(), self.count_skips())
+for t in self._testsuite:
+xunit += '\t\t 0:
+xunit += 'Commands executed in this test case:\n'
+for step in t.steps:
+xunit += '\t{}\n'.format(escape(step))
+xunit += 'FAILURE: {}\n'.format(escape(t.failmsg))
+xunit += '\t\t\t\n'
+if t.errormsg:
+xunit += '\t\t\t\n{}\n'.format(escape(t.errormsg))
+xunit += '\t\t\t\n'
+if t.result == ResultState.skip:
+xunit += '\t\t\t\n'
+xunit += '\t\t\n'
+xunit += '\t\n'
+xunit += '\n'
+return xunit
+
-- 
2.7.4



[PATCH net-next 0/4] tc-testing: implement command timeouts and better results tracking

2018-12-06 Thread Lucas Bates
Patch 1 adds a timeout feature for any command tdc launches in a subshell.
This prevents tdc from hanging indefinitely.

Patches 2-4 introduce a new method for tracking and generating test case
results, and implements it across the core script and all applicable
plugins.

Lucas Bates (4):
  tc-testing: Add command timeout feature to tdc
  tc-testing: Add new TdcResults module
  tc-testing: Implement the TdcResults module in tdc
  tc-testing: gitignore, ignore generated test results

 tools/testing/selftests/tc-testing/.gitignore  |   3 +
 tools/testing/selftests/tc-testing/TdcPlugin.py|   3 +-
 tools/testing/selftests/tc-testing/TdcResults.py   | 133 +
 .../tc-testing/plugin-lib/valgrindPlugin.py|  22 +++-
 tools/testing/selftests/tc-testing/tdc.py  | 133 +
 tools/testing/selftests/tc-testing/tdc_config.py   |   2 +
 6 files changed, 240 insertions(+), 56 deletions(-)
 create mode 100644 tools/testing/selftests/tc-testing/TdcResults.py

--
2.7.4



[PATCH net-next 4/4] tc-testing: gitignore, ignore generated test results

2018-12-06 Thread Lucas Bates
Ignore any .tap or .xml test result files generated by tdc.

Additionally, ignore plugin symlinks.

Signed-off-by: Lucas Bates 
---
 tools/testing/selftests/tc-testing/.gitignore | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/.gitignore 
b/tools/testing/selftests/tc-testing/.gitignore
index 7a60b85..c5cc160 100644
--- a/tools/testing/selftests/tc-testing/.gitignore
+++ b/tools/testing/selftests/tc-testing/.gitignore
@@ -1,2 +1,5 @@
 __pycache__/
 *.pyc
+plugins/
+*.xml
+*.tap
-- 
2.7.4



[PATCH net-next,v5 11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase

2018-12-06 Thread Pablo Neira Ayuso
This is a preparation patch to reuse the existing TC flower codebase
from ethtool_rx_flow_spec.

This patch is merely moving the core ethtool_rx_flow_spec parser after
tc flower offload driver code so we can skip a few forward function
declarations in the follow up patch.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 drivers/net/ethernet/qlogic/qede/qede_filter.c | 264 -
 1 file changed, 132 insertions(+), 132 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index 833c9ec58a6e..ed77950f6cf9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1791,72 +1791,6 @@ static int qede_flow_spec_to_tuple_udpv6(struct qede_dev 
*edev,
return 0;
 }
 
-static int qede_flow_spec_to_tuple(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   memset(t, 0, sizeof(*t));
-
-   if (qede_flow_spec_validate_unused(edev, fs))
-   return -EOPNOTSUPP;
-
-   switch ((fs->flow_type & ~FLOW_EXT)) {
-   case TCP_V4_FLOW:
-   return qede_flow_spec_to_tuple_tcpv4(edev, t, fs);
-   case UDP_V4_FLOW:
-   return qede_flow_spec_to_tuple_udpv4(edev, t, fs);
-   case TCP_V6_FLOW:
-   return qede_flow_spec_to_tuple_tcpv6(edev, t, fs);
-   case UDP_V6_FLOW:
-   return qede_flow_spec_to_tuple_udpv6(edev, t, fs);
-   default:
-   DP_VERBOSE(edev, NETIF_MSG_IFUP,
-  "Can't support flow of type %08x\n", fs->flow_type);
-   return -EOPNOTSUPP;
-   }
-
-   return 0;
-}
-
-static int qede_flow_spec_validate(struct qede_dev *edev,
-  struct ethtool_rx_flow_spec *fs,
-  struct qede_arfs_tuple *t)
-{
-   if (fs->location >= QEDE_RFS_MAX_FLTR) {
-   DP_INFO(edev, "Location out-of-bounds\n");
-   return -EINVAL;
-   }
-
-   /* Check location isn't already in use */
-   if (test_bit(fs->location, edev->arfs->arfs_fltr_bmap)) {
-   DP_INFO(edev, "Location already in use\n");
-   return -EINVAL;
-   }
-
-   /* Check if the filtering-mode could support the filter */
-   if (edev->arfs->filter_count &&
-   edev->arfs->mode != t->mode) {
-   DP_INFO(edev,
-   "flow_spec would require filtering mode %08x, but %08x 
is configured\n",
-   t->mode, edev->arfs->filter_count);
-   return -EINVAL;
-   }
-
-   /* If drop requested then no need to validate other data */
-   if (fs->ring_cookie == RX_CLS_FLOW_DISC)
-   return 0;
-
-   if (ethtool_get_flow_spec_ring_vf(fs->ring_cookie))
-   return 0;
-
-   if (fs->ring_cookie >= QEDE_RSS_COUNT(edev)) {
-   DP_INFO(edev, "Queue out-of-bounds\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
 /* Must be called while qede lock is held */
 static struct qede_arfs_fltr_node *
 qede_flow_find_fltr(struct qede_dev *edev, struct qede_arfs_tuple *t)
@@ -1896,72 +1830,6 @@ static void qede_flow_set_destination(struct qede_dev 
*edev,
   "Configuring N-tuple for VF 0x%02x\n", n->vfid - 1);
 }
 
-int qede_add_cls_rule(struct qede_dev *edev, struct ethtool_rxnfc *info)
-{
-   struct ethtool_rx_flow_spec *fsp = &info->fs;
-   struct qede_arfs_fltr_node *n;
-   struct qede_arfs_tuple t;
-   int min_hlen, rc;
-
-   __qede_lock(edev);
-
-   if (!edev->arfs) {
-   rc = -EPERM;
-   goto unlock;
-   }
-
-   /* Translate the flow specification into something fittign our DB */
-   rc = qede_flow_spec_to_tuple(edev, &t, fsp);
-   if (rc)
-   goto unlock;
-
-   /* Make sure location is valid and filter isn't already set */
-   rc = qede_flow_spec_validate(edev, fsp, &t);
-   if (rc)
-   goto unlock;
-
-   if (qede_flow_find_fltr(edev, &t)) {
-   rc = -EINVAL;
-   goto unlock;
-   }
-
-   n = kzalloc(sizeof(*n), GFP_KERNEL);
-   if (!n) {
-   rc = -ENOMEM;
-   goto unlock;
-   }
-
-   min_hlen = qede_flow_get_min_header_size(&t);
-   n->data = kzalloc(min_hlen, GFP_KERNEL);
-   if (!n->data) {
-   kfree(n);
-   rc = -ENOMEM;
-   goto unlock;
-   }
-
-   n->sw_id = fsp->location;
-   set_bit(n->sw_id, edev->arfs->arfs_fltr_bmap);
-   n->buf_len = min_hlen;
-
-   memcpy(&n->tuple, &t, sizeof(n->tuple));
-
-   qede_flow_set_destination(edev, n, fsp);
-
-   /* Build a minimal header according to the flow */
-   n->tuple.build_hdr(&n->tuple, n->data);
-

[PATCH net-next,v5 01/12] flow_offload: add flow_rule and flow_match structures and use them

2018-12-06 Thread Pablo Neira Ayuso
This patch wraps the dissector key and mask - that flower uses to
represent the matching side - around the flow_match structure.

To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.

This introduces two new interfaces:

bool flow_rule_match_key(rule, dissector_id)

that returns true if a given matching key is set on, and:

flow_rule_match_XYZ(rule, &match);

To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.

Signed-off-by: Pablo Neira Ayuso 
---
v5: fix double kfree in cls_flower error path, reported by kbuild robot
via Julia Lawal.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 174 -
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 194 --
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 -
 drivers/net/ethernet/intel/iavf/iavf_main.c| 195 --
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 420 +
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 202 +-
 drivers/net/ethernet/netronome/nfp/flower/action.c |  11 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 145 +++
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  85 ++---
 include/net/flow_offload.h | 115 ++
 include/net/pkt_cls.h  |  11 +-
 net/core/Makefile  |   2 +-
 net/core/flow_offload.c| 143 +++
 net/sched/cls_flower.c |  47 ++-
 16 files changed, 1196 insertions(+), 1207 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 749f63beddd8..b82143d6cdde 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -177,18 +177,12 @@ static int bnxt_tc_parse_actions(struct bnxt *bp,
return 0;
 }
 
-#define GET_KEY(flow_cmd, key_type)\
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->key)
-#define GET_MASK(flow_cmd, key_type)   \
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->mask)
-
 static int bnxt_tc_parse_flow(struct bnxt *bp,
  struct tc_cls_flower_offload *tc_flow_cmd,
  struct bnxt_tc_flow *flow)
 {
-   struct flow_dissector *dissector = tc_flow_cmd->dissector;
+   struct flow_rule *rule = tc_cls_flower_offload_flow_rule(tc_flow_cmd);
+   struct flow_dissector *dissector = rule->match.dissector;
 
/* KEY_CONTROL and KEY_BASIC are needed for forming a meaningful key */
if ((dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL)) == 0 ||
@@ -198,140 +192,120 @@ static int bnxt_tc_parse_flow(struct bnxt *bp,
return -EOPNOTSUPP;
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_BASIC)) {
-   struct flow_dissector_key_basic *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
-   struct flow_dissector_key_basic *mask =
-   GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+   struct flow_match_basic match;
 
-   flow->l2_key.ether_type = key->n_proto;
-   flow->l2_mask.ether_type = mask->n_proto;
+   flow_rule_match_basic(rule, &match);
+   flow->l2_key.ether_type = match.key->n_proto;
+   flow->l2_mask.ether_type = match.mask->n_proto;
 
-   if (key->n_proto == htons(ETH_P_IP) ||
-   key->n_proto == htons(ETH_P_IPV6)) {
-   flow->l4_key.ip_proto = key->ip_proto;
-   flow->l4_mask.ip_proto = mask->ip_proto;
+   if (match.key->n_proto == htons(ETH_P_IP) ||
+   match.key->n_proto == htons(ETH_P_IPV6)) {
+   flow->l4_key.ip_proto = match.key->ip_proto;
+   flow->l4_mask.ip_proto = match.mask->ip_proto;
}
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
-   struct flow_dissector_key_eth_addrs *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
-   struct flow_dissector_key_eth_addrs 

[PATCH net-next,v5 10/12] dsa: bcm_sf2: use flow_rule infrastructure

2018-12-06 Thread Pablo Neira Ayuso
Update this driver to use the flow_rule infrastructure, hence we can use
the same code to populate hardware IR from ethtool_rx_flow and the
cls_flower interfaces.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 drivers/net/dsa/bcm_sf2_cfp.c | 102 +++---
 1 file changed, 67 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index e14663ab6dbc..6d8059dc77b7 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bcm_sf2.h"
 #include "bcm_sf2_regs.h"
@@ -257,7 +258,8 @@ static int bcm_sf2_cfp_act_pol_set(struct bcm_sf2_priv 
*priv,
 }
 
 static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv,
-  struct ethtool_tcpip4_spec *v4_spec,
+  struct flow_dissector_key_ipv4_addrs *addrs,
+  struct flow_dissector_key_ports *ports,
   unsigned int slice_num,
   bool mask)
 {
@@ -278,7 +280,7 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A6 [23:8]
 * UDF_n_A5 [7:0]
 */
-   reg = be16_to_cpu(v4_spec->pdst) >> 8;
+   reg = be16_to_cpu(ports->dst) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(3);
else
@@ -289,9 +291,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A4 [23:8]
 * UDF_n_A3 [7:0]
 */
-   reg = (be16_to_cpu(v4_spec->pdst) & 0xff) << 24 |
- (u32)be16_to_cpu(v4_spec->psrc) << 8 |
- (be32_to_cpu(v4_spec->ip4dst) & 0xff00) >> 8;
+   reg = (be16_to_cpu(ports->dst) & 0xff) << 24 |
+ (u32)be16_to_cpu(ports->src) << 8 |
+ (be32_to_cpu(addrs->dst) & 0xff00) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(2);
else
@@ -302,9 +304,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A2 [23:8]
 * UDF_n_A1 [7:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4dst) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4dst) >> 16) << 8 |
- (be32_to_cpu(v4_spec->ip4src) & 0xff00) >> 8;
+   reg = (u32)(be32_to_cpu(addrs->dst) & 0xff) << 24 |
+ (u32)(be32_to_cpu(addrs->dst) >> 16) << 8 |
+ (be32_to_cpu(addrs->src) & 0xff00) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(1);
else
@@ -317,8 +319,8 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * Slice ID [3:2]
 * Slice valid  [1:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4src) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4src) >> 16) << 8 |
+   reg = (u32)(be32_to_cpu(addrs->src) & 0xff) << 24 |
+ (u32)(be32_to_cpu(addrs->src) >> 16) << 8 |
  SLICE_NUM(slice_num) | SLICE_VALID;
if (mask)
offset = CORE_CFP_MASK_PORT(0);
@@ -332,9 +334,13 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
 unsigned int queue_num,
 struct ethtool_rx_flow_spec *fs)
 {
-   struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
+   struct ethtool_rx_flow_spec_input input = {};
const struct cfp_udf_layout *layout;
unsigned int slice_num, rule_index;
+   struct ethtool_rx_flow_rule *flow;
+   struct flow_match_ipv4_addrs ipv4;
+   struct flow_match_ports ports;
+   struct flow_match_ip ip;
u8 ip_proto, ip_frag;
u8 num_udf;
u32 reg;
@@ -343,13 +349,9 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
switch (fs->flow_type & ~FLOW_EXT) {
case TCP_V4_FLOW:
ip_proto = IPPROTO_TCP;
-   v4_spec = &fs->h_u.tcp_ip4_spec;
-   v4_m_spec = &fs->m_u.tcp_ip4_spec;
break;
case UDP_V4_FLOW:
ip_proto = IPPROTO_UDP;
-   v4_spec = &fs->h_u.udp_ip4_spec;
-   v4_m_spec = &fs->m_u.udp_ip4_spec;
break;
default:
return -EINVAL;
@@ -367,11 +369,22 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
if (rule_index > bcm_sf2_cfp_rule_size(priv))
return -ENOSPC;
 
+   input.fs = fs;
+   flow = ethtool_rx_flow_rule_create(&input);
+   if (IS_ERR(flow))
+   return PTR_ERR(flow);
+
+   flow_rule_match_ipv4_addrs(flow->rule, &ipv4);
+   flow_rule_match_ports(flow->rule, &ports);
+   flow_rule_match_ip(flow->rule, &ip);
+
layout = &udf_tcpip4_layout;
/* We only

[PATCH net-next,v5 04/12] cls_api: add translator to flow_action representation

2018-12-06 Thread Pablo Neira Ayuso
This patch implements a new function to translate from native TC action
to the new flow_action representation. Moreover, this patch also updates
cls_flower to use this new function.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 include/net/pkt_cls.h  |  3 ++
 net/sched/cls_api.c| 99 ++
 net/sched/cls_flower.c | 14 +++
 3 files changed, 116 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 9ceac97e5eff..abb035f84321 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -622,6 +622,9 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
 
+int tc_setup_flow_action(struct flow_action *flow_action,
+const struct tcf_exts *exts);
+
 int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
 enum tc_setup_type type, void *type_data, bool err_stop);
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3a4d36072fd5..00b7b639f713 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -32,6 +32,13 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
 
@@ -2568,6 +2575,98 @@ int tc_setup_cb_call(struct tcf_block *block, struct 
tcf_exts *exts,
 }
 EXPORT_SYMBOL(tc_setup_cb_call);
 
+int tc_setup_flow_action(struct flow_action *flow_action,
+const struct tcf_exts *exts)
+{
+   const struct tc_action *act;
+   int i, j, k;
+
+   if (!exts)
+   return 0;
+
+   j = 0;
+   tcf_exts_for_each_action(i, act, exts) {
+   struct flow_action_entry *entry;
+
+   entry = &flow_action->entries[j];
+   if (is_tcf_gact_ok(act)) {
+   entry->id = FLOW_ACTION_ACCEPT;
+   } else if (is_tcf_gact_shot(act)) {
+   entry->id = FLOW_ACTION_DROP;
+   } else if (is_tcf_gact_trap(act)) {
+   entry->id = FLOW_ACTION_TRAP;
+   } else if (is_tcf_gact_goto_chain(act)) {
+   entry->id = FLOW_ACTION_GOTO;
+   entry->chain_index = tcf_gact_goto_chain_index(act);
+   } else if (is_tcf_mirred_egress_redirect(act)) {
+   entry->id = FLOW_ACTION_REDIRECT;
+   entry->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_mirred_egress_mirror(act)) {
+   entry->id = FLOW_ACTION_MIRRED;
+   entry->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_vlan(act)) {
+   switch (tcf_vlan_action(act)) {
+   case TCA_VLAN_ACT_PUSH:
+   entry->id = FLOW_ACTION_VLAN_PUSH;
+   entry->vlan.vid = tcf_vlan_push_vid(act);
+   entry->vlan.proto = tcf_vlan_push_proto(act);
+   entry->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   case TCA_VLAN_ACT_POP:
+   entry->id = FLOW_ACTION_VLAN_POP;
+   break;
+   case TCA_VLAN_ACT_MODIFY:
+   entry->id = FLOW_ACTION_VLAN_MANGLE;
+   entry->vlan.vid = tcf_vlan_push_vid(act);
+   entry->vlan.proto = tcf_vlan_push_proto(act);
+   entry->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   default:
+   goto err_out;
+   }
+   } else if (is_tcf_tunnel_set(act)) {
+   entry->id = FLOW_ACTION_TUNNEL_ENCAP;
+   entry->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_tunnel_release(act)) {
+   entry->id = FLOW_ACTION_TUNNEL_DECAP;
+   entry->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_pedit(act)) {
+   for (k = 0; k < tcf_pedit_nkeys(act); k++) {
+   switch (tcf_pedit_cmd(act, k)) {
+   case TCA_PEDIT_KEY_EX_CMD_SET:
+   entry->id = FLOW_ACTION_MANGLE;
+   break;
+   case TCA_PEDIT_KEY_EX_CMD_ADD:
+   entry->id = FLOW_ACTION_ADD;
+   break;
+   default:
+   goto err_out;
+   }
+   entry->mangle.htype = tcf_pedit_htype(act, k);
+   entry->mangl

[PATCH net-next,v5 07/12] cls_flower: don't expose TC actions to drivers anymore

2018-12-06 Thread Pablo Neira Ayuso
Now that drivers have been converted to use the flow action
infrastructure, remove this field from the tc_cls_flower_offload
structure.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 include/net/pkt_cls.h  | 1 -
 net/sched/cls_flower.c | 5 -
 2 files changed, 6 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a08c06e383db..9bd724bfa860 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -768,7 +768,6 @@ struct tc_cls_flower_offload {
unsigned long cookie;
struct flow_rule *rule;
struct flow_stats stats;
-   struct tcf_exts *exts;
u32 classid;
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 806392598ae2..bb4d39689404 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -392,7 +392,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
cls_flower.rule->match.dissector = &f->mask->dissector;
cls_flower.rule->match.mask = &f->mask->key;
cls_flower.rule->match.key = &f->mkey;
-   cls_flower.exts = &f->exts;
cls_flower.classid = f->res.classid;
 
err = tc_setup_flow_action(&cls_flower.rule->action, &f->exts);
@@ -427,7 +426,6 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct 
cls_fl_filter *f)
tc_cls_common_offload_init(&cls_flower.common, tp, f->flags, NULL);
cls_flower.command = TC_CLSFLOWER_STATS;
cls_flower.cookie = (unsigned long) f;
-   cls_flower.exts = &f->exts;
cls_flower.classid = f->res.classid;
 
tc_setup_cb_call(block, &f->exts, TC_SETUP_CLSFLOWER,
@@ -1490,7 +1488,6 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, 
tc_setup_cb_t *cb,
cls_flower.rule->match.dissector = &mask->dissector;
cls_flower.rule->match.mask = &mask->key;
cls_flower.rule->match.key = &f->mkey;
-   cls_flower.exts = &f->exts;
 
err = tc_setup_flow_action(&cls_flower.rule->action,
   &f->exts);
@@ -1523,7 +1520,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
 {
struct tc_cls_flower_offload cls_flower = {};
struct tcf_block *block = chain->block;
-   struct tcf_exts dummy_exts = { 0, };
 
cls_flower.rule = flow_rule_alloc(0);
if (!cls_flower.rule)
@@ -1535,7 +1531,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
cls_flower.rule->match.dissector = &tmplt->dissector;
cls_flower.rule->match.mask = &tmplt->mask;
cls_flower.rule->match.key = &tmplt->dummy_key;
-   cls_flower.exts = &dummy_exts;
 
/* We don't care if driver (any of them) fails to handle this
 * call. It serves just as a hint for it.
-- 
2.11.0



[PATCH net-next,v5 05/12] flow_offload: add statistics retrieval infrastructure and use it

2018-12-06 Thread Pablo Neira Ayuso
This patch provides the flow_stats structure that acts as container for
tc_cls_flower_offload, then we can use to restore the statistics on the
existing TC actions. Hence, tcf_exts_stats_update() is not used from
drivers anymore.

Signed-off-by: Pablo Neira Ayuso 
---
v5: Fix bytes and packet parameter swap in flow_stats_update() call,
reported by Venkat Duvvuru.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c |  2 +-
 drivers/net/ethernet/netronome/nfp/flower/offload.c   |  5 ++---
 include/net/flow_offload.h| 14 ++
 include/net/pkt_cls.h |  1 +
 net/sched/cls_flower.c|  4 
 8 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index b82143d6cdde..09cd75f54eba 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1366,8 +1366,8 @@ static int bnxt_tc_get_flow_stats(struct bnxt *bp,
lastused = flow->lastused;
spin_unlock(&flow->stats_lock);
 
-   tcf_exts_stats_update(tc_flow_cmd->exts, stats.bytes, stats.packets,
- lastused);
+   flow_stats_update(&tc_flow_cmd->stats, stats.bytes, stats.packets,
+ lastused);
return 0;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index 39c5af5dad3d..8a2d66ee1d7b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -807,9 +807,9 @@ int cxgb4_tc_flower_stats(struct net_device *dev,
if (ofld_stats->packet_count != packets) {
if (ofld_stats->prev_packet_count != packets)
ofld_stats->last_used = jiffies;
-   tcf_exts_stats_update(cls->exts, bytes - ofld_stats->byte_count,
- packets - ofld_stats->packet_count,
- ofld_stats->last_used);
+   flow_stats_update(&cls->stats, bytes - ofld_stats->byte_count,
+ packets - ofld_stats->packet_count,
+ ofld_stats->last_used);
 
ofld_stats->packet_count = packets;
ofld_stats->byte_count = bytes;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 1daaab91280f..2e1eaf6f5139 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3227,7 +3227,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
 
mlx5_fc_query_cached(counter, &bytes, &packets, &lastuse);
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   flow_stats_update(&f->stats, bytes, packets, lastuse);
 
return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index e6c4c672b1ca..60900e53243b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -460,7 +460,7 @@ int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_rule_get_stats;
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   flow_stats_update(&f->stats, bytes, packets, lastuse);
 
mlxsw_sp_acl_ruleset_put(mlxsw_sp, ruleset);
return 0;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 708331234908..524b9ae1a639 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -532,9 +532,8 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
ctx_id = be32_to_cpu(nfp_flow->meta.host_ctx_id);
 
spin_lock_bh(&priv->stats_lock);
-   tcf_exts_stats_update(flow->exts, priv->stats[ctx_id].bytes,
- priv->stats[ctx_id].pkts,
- priv->stats[ctx_id].used);
+   flow_stats_update(&flow->stats, priv->stats[ctx_id].bytes,
+ priv->stats[ctx_id].pkts, priv->stats[ctx_id].used);
 
priv->stats[ctx_id].pkts = 0;
priv->stats[ctx_id].bytes = 0;
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index dabc819b6cc9..f9ce39992dbd 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -179,4 +179,18 @@ static inline bool flow_rule_match_key(const struct 
flow_rule *rule,
return dissector_uses_key(rule->ma

[PATCH net-next,v5 12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code

2018-12-06 Thread Pablo Neira Ayuso
The qede driver supports for ethtool_rx_flow_spec and flower, both
codebases look very similar.

This patch uses the ethtool_rx_flow_rule() infrastructure to remove the
duplicated ethtool_rx_flow_spec parser and consolidate ACL offload
support around the flow_rule infrastructure.

Furthermore, more code can be consolidated by merging
qede_add_cls_rule() and qede_add_tc_flower_fltr(), these two functions
also look very similar.

This driver currently provides simple ACL support, such as 5-tuple
matching, drop policy and queue to CPU.

Drivers that support more features can benefit from this infrastructure
to save even more redundant codebase.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 drivers/net/ethernet/qlogic/qede/qede_filter.c | 279 +++--
 1 file changed, 76 insertions(+), 203 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index ed77950f6cf9..37c0651184ce 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1665,132 +1665,6 @@ static int qede_set_v6_tuple_to_profile(struct qede_dev 
*edev,
return 0;
 }
 
-static int qede_flow_spec_to_tuple_ipv4_common(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   if ((fs->h_u.tcp_ip4_spec.ip4src &
-fs->m_u.tcp_ip4_spec.ip4src) != fs->h_u.tcp_ip4_spec.ip4src) {
-   DP_INFO(edev, "Don't support IP-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.ip4dst &
-fs->m_u.tcp_ip4_spec.ip4dst) != fs->h_u.tcp_ip4_spec.ip4dst) {
-   DP_INFO(edev, "Don't support IP-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.psrc &
-fs->m_u.tcp_ip4_spec.psrc) != fs->h_u.tcp_ip4_spec.psrc) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.pdst &
-fs->m_u.tcp_ip4_spec.pdst) != fs->h_u.tcp_ip4_spec.pdst) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if (fs->h_u.tcp_ip4_spec.tos) {
-   DP_INFO(edev, "Don't support tos\n");
-   return -EOPNOTSUPP;
-   }
-
-   t->eth_proto = htons(ETH_P_IP);
-   t->src_ipv4 = fs->h_u.tcp_ip4_spec.ip4src;
-   t->dst_ipv4 = fs->h_u.tcp_ip4_spec.ip4dst;
-   t->src_port = fs->h_u.tcp_ip4_spec.psrc;
-   t->dst_port = fs->h_u.tcp_ip4_spec.pdst;
-
-   return qede_set_v4_tuple_to_profile(edev, t);
-}
-
-static int qede_flow_spec_to_tuple_tcpv4(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-struct ethtool_rx_flow_spec *fs)
-{
-   t->ip_proto = IPPROTO_TCP;
-
-   if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs))
-   return -EINVAL;
-
-   return 0;
-}
-
-static int qede_flow_spec_to_tuple_udpv4(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-struct ethtool_rx_flow_spec *fs)
-{
-   t->ip_proto = IPPROTO_UDP;
-
-   if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs))
-   return -EINVAL;
-
-   return 0;
-}
-
-static int qede_flow_spec_to_tuple_ipv6_common(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   struct in6_addr zero_addr;
-
-   memset(&zero_addr, 0, sizeof(zero_addr));
-
-   if ((fs->h_u.tcp_ip6_spec.psrc &
-fs->m_u.tcp_ip6_spec.psrc) != fs->h_u.tcp_ip6_spec.psrc) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip6_spec.pdst &
-fs->m_u.tcp_ip6_spec.pdst) != fs->h_u.tcp_ip6_spec.pdst) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if (fs->h_u.tcp_ip6_spec.tclass) {
-   DP_INFO(edev, "Don't support tclass\n");
-   return -EOPNOTSUPP;
-   }
-
-   t->eth_proto = htons(ETH_P_IPV6);
-   memcpy(&t->src_ipv6, &fs->h_u.tcp_ip6_spec.ip6src,
-  sizeof(struct in6_addr));
-   memcpy(&t->dst_ipv6, &fs->h_u.tcp_ip6_spec.ip6dst,
-  sizeof(struct in6_addr));
-   t->src_port = fs->h_u.tcp_ip6_spec.psrc;
-   t->dst_port = fs->h_u.tcp_ip6_spec.pdst;
-
-   return qede_set_v6_tuple_to_profile(edev, t, &zero_addr);
-}
-
-static int qede_flow_spec_to_tuple_tcpv6(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-  

[PATCH net-next,v5 09/12] ethtool: add ethtool_rx_flow_spec to flow_rule structure translator

2018-12-06 Thread Pablo Neira Ayuso
This patch adds a function to translate the ethtool_rx_flow_spec
structure to the flow_rule representation.

This allows us to reuse code from the driver side given that both flower
and ethtool_rx_flow interfaces use the same representation.

This patch also includes support for the flow type flags FLOW_EXT,
FLOW_MAC_EXT and FLOW_RSS.

The ethtool_rx_flow_spec_input wrapper structure is used to convey the
rss_context field, that is away from the ethtool_rx_flow_spec structure,
and the ethtool_rx_flow_spec structure.

Signed-off-by: Pablo Neira Ayuso 
---
v5: support for FLOW_RSS flowtype flag and set rss context in queue action.
Suggested by Michal Kubecek.

 include/linux/ethtool.h |  15 +++
 net/core/ethtool.c  | 240 
 2 files changed, 255 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index afd9596ce636..19a8de5326fb 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -400,4 +400,19 @@ struct ethtool_ops {
void(*get_ethtool_phy_stats)(struct net_device *,
 struct ethtool_stats *, u64 *);
 };
+
+struct ethtool_rx_flow_rule {
+   struct flow_rule*rule;
+   unsigned long   priv[0];
+};
+
+struct ethtool_rx_flow_spec_input {
+   const struct ethtool_rx_flow_spec   *fs;
+   u32 rss_ctx;
+};
+
+struct ethtool_rx_flow_rule *
+ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec_input *input);
+void ethtool_rx_flow_rule_destroy(struct ethtool_rx_flow_rule *rule);
+
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index d05402868575..2711d0737d3f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Some useful ethtool_ops methods that're device independent.
@@ -2808,3 +2809,242 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 
return rc;
 }
+
+struct ethtool_rx_flow_key {
+   struct flow_dissector_key_basic basic;
+   union {
+   struct flow_dissector_key_ipv4_addrsipv4;
+   struct flow_dissector_key_ipv6_addrsipv6;
+   };
+   struct flow_dissector_key_ports tp;
+   struct flow_dissector_key_ipip;
+   struct flow_dissector_key_vlan  vlan;
+   struct flow_dissector_key_eth_addrs eth_addrs;
+} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
+
+struct ethtool_rx_flow_match {
+   struct flow_dissector   dissector;
+   struct ethtool_rx_flow_key  key;
+   struct ethtool_rx_flow_key  mask;
+};
+
+struct ethtool_rx_flow_rule *
+ethtool_rx_flow_rule_create(const struct ethtool_rx_flow_spec_input *input)
+{
+   const struct ethtool_rx_flow_spec *fs = input->fs;
+   static struct in6_addr zero_addr = {};
+   struct ethtool_rx_flow_match *match;
+   struct ethtool_rx_flow_rule *flow;
+   struct flow_action_entry *act;
+
+   flow = kzalloc(sizeof(struct ethtool_rx_flow_rule) +
+  sizeof(struct ethtool_rx_flow_match), GFP_KERNEL);
+   if (!flow)
+   return ERR_PTR(-ENOMEM);
+
+   /* ethtool_rx supports only one single action per rule. */
+   flow->rule = flow_rule_alloc(1);
+   if (!flow->rule) {
+   kfree(flow);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   match = (struct ethtool_rx_flow_match *)flow->priv;
+   flow->rule->match.dissector = &match->dissector;
+   flow->rule->match.mask  = &match->mask;
+   flow->rule->match.key   = &match->key;
+
+   match->mask.basic.n_proto = htons(0x);
+
+   switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT | FLOW_RSS)) {
+   case TCP_V4_FLOW:
+   case UDP_V4_FLOW: {
+   const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
+
+   match->key.basic.n_proto = htons(ETH_P_IP);
+
+   v4_spec = &fs->h_u.tcp_ip4_spec;
+   v4_m_spec = &fs->m_u.tcp_ip4_spec;
+
+   if (v4_m_spec->ip4src) {
+   match->key.ipv4.src = v4_spec->ip4src;
+   match->mask.ipv4.src = v4_m_spec->ip4src;
+   }
+   if (v4_m_spec->ip4dst) {
+   match->key.ipv4.dst = v4_spec->ip4dst;
+   match->mask.ipv4.dst = v4_m_spec->ip4dst;
+   }
+   if (v4_m_spec->ip4src ||
+   v4_m_spec->ip4dst) {
+   match->dissector.used_keys |=
+   BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS);
+   match->dissector.offset[FLOW_DISSECTOR_KEY_IPV4_ADDRS] =
+   offsetof(struct ethtool_rx_flow_key, ipv4);
+   }
+   if (v4_m_spec->psrc) {
+ 

[PATCH net-next,v5 06/12] drivers: net: use flow action infrastructure

2018-12-06 Thread Pablo Neira Ayuso
This patch updates drivers to use the new flow action infrastructure.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   |  74 +++---
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 250 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 266 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  54 +++--
 drivers/net/ethernet/netronome/nfp/flower/action.c | 187 ---
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  12 +-
 7 files changed, 418 insertions(+), 427 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 09cd75f54eba..b7bd27edd80e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -61,9 +61,9 @@ static u16 bnxt_flow_get_dst_fid(struct bnxt *pf_bp, struct 
net_device *dev)
 
 static int bnxt_tc_parse_redir(struct bnxt *bp,
   struct bnxt_tc_actions *actions,
-  const struct tc_action *tc_act)
+  const struct flow_action_entry *act)
 {
-   struct net_device *dev = tcf_mirred_dev(tc_act);
+   struct net_device *dev = act->dev;
 
if (!dev) {
netdev_info(bp->dev, "no dev in mirred action");
@@ -77,16 +77,16 @@ static int bnxt_tc_parse_redir(struct bnxt *bp,
 
 static int bnxt_tc_parse_vlan(struct bnxt *bp,
  struct bnxt_tc_actions *actions,
- const struct tc_action *tc_act)
+ const struct flow_action_entry *act)
 {
-   switch (tcf_vlan_action(tc_act)) {
-   case TCA_VLAN_ACT_POP:
+   switch (act->id) {
+   case FLOW_ACTION_VLAN_POP:
actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN;
break;
-   case TCA_VLAN_ACT_PUSH:
+   case FLOW_ACTION_VLAN_PUSH:
actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN;
-   actions->push_vlan_tci = htons(tcf_vlan_push_vid(tc_act));
-   actions->push_vlan_tpid = tcf_vlan_push_proto(tc_act);
+   actions->push_vlan_tci = htons(act->vlan.vid);
+   actions->push_vlan_tpid = act->vlan.proto;
break;
default:
return -EOPNOTSUPP;
@@ -96,10 +96,10 @@ static int bnxt_tc_parse_vlan(struct bnxt *bp,
 
 static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
struct bnxt_tc_actions *actions,
-   const struct tc_action *tc_act)
+   const struct flow_action_entry *act)
 {
-   struct ip_tunnel_info *tun_info = tcf_tunnel_info(tc_act);
-   struct ip_tunnel_key *tun_key = &tun_info->key;
+   const struct ip_tunnel_info *tun_info = act->tunnel;
+   const struct ip_tunnel_key *tun_key = &tun_info->key;
 
if (ip_tunnel_info_af(tun_info) != AF_INET) {
netdev_info(bp->dev, "only IPv4 tunnel-encap is supported");
@@ -113,51 +113,43 @@ static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
 
 static int bnxt_tc_parse_actions(struct bnxt *bp,
 struct bnxt_tc_actions *actions,
-struct tcf_exts *tc_exts)
+struct flow_action *flow_action)
 {
-   const struct tc_action *tc_act;
+   struct flow_action_entry *act;
int i, rc;
 
-   if (!tcf_exts_has_actions(tc_exts)) {
+   if (!flow_action_has_entries(flow_action)) {
netdev_info(bp->dev, "no actions");
return -EINVAL;
}
 
-   tcf_exts_for_each_action(i, tc_act, tc_exts) {
-   /* Drop action */
-   if (is_tcf_gact_shot(tc_act)) {
+   flow_action_for_each(i, act, flow_action) {
+   switch (act->id) {
+   case FLOW_ACTION_DROP:
actions->flags |= BNXT_TC_ACTION_FLAG_DROP;
return 0; /* don't bother with other actions */
-   }
-
-   /* Redirect action */
-   if (is_tcf_mirred_egress_redirect(tc_act)) {
-   rc = bnxt_tc_parse_redir(bp, actions, tc_act);
+   case FLOW_ACTION_REDIRECT:
+   rc = bnxt_tc_parse_redir(bp, actions, act);
if (rc)
return rc;
-   continue;
-   }
-
-   /* Push/pop VLAN */
-   if (is_tcf_vlan(tc_act)) {
-   rc = bnxt_tc_parse_vlan(bp, actions, tc_act);
+   break;
+   case FLOW_ACTION_VLAN_POP:
+   case FLOW_ACTION_VLAN_PUSH:
+   case FLOW_ACTION_VLAN_MANGLE:
+   rc = bnxt_tc_parse_v

[PATCH net-next,v5 08/12] flow_offload: add wake-up-on-lan and queue to flow_action

2018-12-06 Thread Pablo Neira Ayuso
These actions need to be added to support the ethtool_rx_flow interface.
The queue action includes a field to specify the RSS context, that is
set via FLOW_RSS flow type flag and the rss_context field in struct
ethtool_rxnfc, plus the corresponding queue index. FLOW_RSS implies that
rss_context is non-zero, therefore, queue.ctx == 0 means that FLOW_RSS
was not set.

Signed-off-by: Pablo Neira Ayuso 
---
v5: add queue structure and context field, per Michal Kubecek.

 include/net/flow_offload.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index f9ce39992dbd..6489fb9eb394 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -116,6 +116,8 @@ enum flow_action_id {
FLOW_ACTION_ADD,
FLOW_ACTION_CSUM,
FLOW_ACTION_MARK,
+   FLOW_ACTION_WAKE,
+   FLOW_ACTION_QUEUE,
 };
 
 /* This is mirroring enum pedit_header_type definition for easy mapping between
@@ -150,6 +152,10 @@ struct flow_action_entry {
const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_TUNNEL_ENCAP */
u32 csum_flags; /* FLOW_ACTION_CSUM */
u32 mark;   /* FLOW_ACTION_MARK */
+   struct {/* FLOW_ACTION_QUEUE */
+   u32 ctx;
+   u32 index;
+   } queue;
};
 };
 
-- 
2.11.0



[PATCH net-next,v5 00/12] add flow_rule infrastructure

2018-12-06 Thread Pablo Neira Ayuso
Hi,

This is another iteration of the in-kernel intermediate representation
(IR) that allows to express ACL hardware offloads using one unified
representation from the driver side for the ethtool and the tc
frontends [1] [2] [3].

In words of Michal Kubecek:

"... the ethtool interface can apply four types of action to matching
packets:

  - put into a specific queue
  - discard
  - distribute accross the queues according to a RSS context
  - use the rule as a wake-on-lan filter"

This new round now supports for these four types, that can be mapped to
the flow_rule representation.

Changes from previous version:

* Michal Kubecek:
- Add support for FLOW_RSS flag to the ethtool_rx_flow_spec
  to flow_rule translator.

* Venkat Duvvuru:
- Fix accidental swapping of flow_stats_update() bytes and
  packets parameter.

* kbuild robot:
- Fix double kfree in error path from cls_flower, via Julian Lawal.
- Fix enum type mismatch in nfp driver reported by sparse
  checks.

Please apply, thanks.

Pablo Neira Ayuso (12):
  flow_offload: add flow_rule and flow_match structures and use them
  net/mlx5e: support for two independent packet edit actions
  flow_offload: add flow action infrastructure
  cls_api: add translator to flow_action representation
  flow_offload: add statistics retrieval infrastructure and use it
  drivers: net: use flow action infrastructure
  cls_flower: don't expose TC actions to drivers anymore
  flow_offload: add wake-up-on-lan and queue to flow_action
  ethtool: add ethtool_rx_flow_spec to flow_rule structure translator
  dsa: bcm_sf2: use flow_rule infrastructure
  qede: place ethtool_rx_flow_spec after code after TC flower codebase
  qede: use ethtool_rx_flow_rule() to remove duplicated parser code

 drivers/net/dsa/bcm_sf2_cfp.c  | 102 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 252 +++
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 450 ++---
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 ++---
 drivers/net/ethernet/intel/iavf/iavf_main.c| 195 +++---
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 743 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 258 ---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 198 +++---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 150 ++---
 drivers/net/ethernet/qlogic/qede/qede_filter.c | 572 ++--
 include/linux/ethtool.h|  15 +
 include/net/flow_offload.h | 202 ++
 include/net/pkt_cls.h  |  18 +-
 net/core/Makefile  |   2 +-
 net/core/ethtool.c | 240 +++
 net/core/flow_offload.c| 153 +
 net/sched/cls_api.c| 116 
 net/sched/cls_flower.c |  71 +-
 21 files changed, 2403 insertions(+), 1995 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

-- 
2.11.0



[PATCH net-next,v5 02/12] net/mlx5e: support for two independent packet edit actions

2018-12-06 Thread Pablo Neira Ayuso
This patch adds pedit_headers_action structure to store the result of
parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to
populate the mlx5e hardware intermediate representation once all actions
have been parsed.

This patch comes in preparation for the new flow_action infrastructure,
where each packet mangling comes in an separated action, ie. not packed
as in tc pedit.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 81 ++---
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index d2e6c6578b9c..1daaab91280f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1751,6 +1751,12 @@ struct pedit_headers {
struct udphdr  udp;
 };
 
+struct pedit_headers_action {
+   struct pedit_headersvals;
+   struct pedit_headersmasks;
+   u32 pedits;
+};
+
 static int pedit_header_offsets[] = {
[TCA_PEDIT_KEY_EX_HDR_TYPE_ETH] = offsetof(struct pedit_headers, eth),
[TCA_PEDIT_KEY_EX_HDR_TYPE_IP4] = offsetof(struct pedit_headers, ip4),
@@ -1762,16 +1768,15 @@ static int pedit_header_offsets[] = {
 #define pedit_header(_ph, _htype) ((void *)(_ph) + 
pedit_header_offsets[_htype])
 
 static int set_pedit_val(u8 hdr_type, u32 mask, u32 val, u32 offset,
-struct pedit_headers *masks,
-struct pedit_headers *vals)
+struct pedit_headers_action *hdrs)
 {
u32 *curr_pmask, *curr_pval;
 
if (hdr_type >= __PEDIT_HDR_TYPE_MAX)
goto out_err;
 
-   curr_pmask = (u32 *)(pedit_header(masks, hdr_type) + offset);
-   curr_pval  = (u32 *)(pedit_header(vals, hdr_type) + offset);
+   curr_pmask = (u32 *)(pedit_header(&hdrs->masks, hdr_type) + offset);
+   curr_pval  = (u32 *)(pedit_header(&hdrs->vals, hdr_type) + offset);
 
if (*curr_pmask & mask)  /* disallow acting twice on the same location 
*/
goto out_err;
@@ -1827,8 +1832,7 @@ static struct mlx5_fields fields[] = {
  * max from the SW pedit action. On success, it says how many HW actions were
  * actually parsed.
  */
-static int offload_pedit_fields(struct pedit_headers *masks,
-   struct pedit_headers *vals,
+static int offload_pedit_fields(struct pedit_headers_action *hdrs,
struct mlx5e_tc_flow_parse_attr *parse_attr,
struct netlink_ext_ack *extack)
 {
@@ -1843,10 +1847,10 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
__be16 mask_be16;
void *action;
 
-   set_masks = &masks[TCA_PEDIT_KEY_EX_CMD_SET];
-   add_masks = &masks[TCA_PEDIT_KEY_EX_CMD_ADD];
-   set_vals = &vals[TCA_PEDIT_KEY_EX_CMD_SET];
-   add_vals = &vals[TCA_PEDIT_KEY_EX_CMD_ADD];
+   set_masks = &hdrs[TCA_PEDIT_KEY_EX_CMD_SET].masks;
+   add_masks = &hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].masks;
+   set_vals = &hdrs[TCA_PEDIT_KEY_EX_CMD_SET].vals;
+   add_vals = &hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].vals;
 
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
action = parse_attr->mod_hdr_actions;
@@ -1942,12 +1946,14 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
 }
 
 static int alloc_mod_hdr_actions(struct mlx5e_priv *priv,
-const struct tc_action *a, int namespace,
+struct pedit_headers_action *hdrs,
+int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr)
 {
int nkeys, action_size, max_actions;
 
-   nkeys = tcf_pedit_nkeys(a);
+   nkeys = hdrs[TCA_PEDIT_KEY_EX_CMD_SET].pedits +
+   hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].pedits;
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
 
if (namespace == MLX5_FLOW_NAMESPACE_FDB) /* FDB offloading */
@@ -1971,18 +1977,15 @@ static const struct pedit_headers zero_masks = {};
 static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 const struct tc_action *a, int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr,
+struct pedit_headers_action *hdrs,
 struct netlink_ext_ack *extack)
 {
-   struct pedit_headers masks[__PEDIT_CMD_MAX], vals[__PEDIT_CMD_MAX], 
*cmd_masks;
int nkeys, i, err = -EOPNOTSUPP;
u32 mask, val, offset;
u8 cmd, htype;
 
nkeys = tcf_pedit_nkeys(a);
 
-   memset(masks, 0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-   memset(vals,  0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-
for (i = 0; i < nkeys; i++) {

[PATCH net-next,v5 03/12] flow_offload: add flow action infrastructure

2018-12-06 Thread Pablo Neira Ayuso
This new infrastructure defines the nic actions that you can perform
from existing network drivers. This infrastructure allows us to avoid a
direct dependency with the native software TC action representation.

Signed-off-by: Pablo Neira Ayuso 
---
v5: rebase on top of net-next head.

 include/net/flow_offload.h | 69 +-
 include/net/pkt_cls.h  |  2 ++
 net/core/flow_offload.c| 14 --
 net/sched/cls_api.c| 17 
 net/sched/cls_flower.c |  7 +++--
 5 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 461c66595763..dabc819b6cc9 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -100,11 +100,78 @@ void flow_rule_match_enc_keyid(const struct flow_rule 
*rule,
 void flow_rule_match_enc_opts(const struct flow_rule *rule,
  struct flow_match_enc_opts *out);
 
+enum flow_action_id {
+   FLOW_ACTION_ACCEPT  = 0,
+   FLOW_ACTION_DROP,
+   FLOW_ACTION_TRAP,
+   FLOW_ACTION_GOTO,
+   FLOW_ACTION_REDIRECT,
+   FLOW_ACTION_MIRRED,
+   FLOW_ACTION_VLAN_PUSH,
+   FLOW_ACTION_VLAN_POP,
+   FLOW_ACTION_VLAN_MANGLE,
+   FLOW_ACTION_TUNNEL_ENCAP,
+   FLOW_ACTION_TUNNEL_DECAP,
+   FLOW_ACTION_MANGLE,
+   FLOW_ACTION_ADD,
+   FLOW_ACTION_CSUM,
+   FLOW_ACTION_MARK,
+};
+
+/* This is mirroring enum pedit_header_type definition for easy mapping between
+ * tc pedit action. Legacy TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK is mapped to
+ * FLOW_ACT_MANGLE_UNSPEC, which is supported by no driver.
+ */
+enum flow_action_mangle_base {
+   FLOW_ACT_MANGLE_UNSPEC  = 0,
+   FLOW_ACT_MANGLE_HDR_TYPE_ETH,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP4,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP6,
+   FLOW_ACT_MANGLE_HDR_TYPE_TCP,
+   FLOW_ACT_MANGLE_HDR_TYPE_UDP,
+};
+
+struct flow_action_entry {
+   enum flow_action_id id;
+   union {
+   u32 chain_index;/* FLOW_ACTION_GOTO */
+   struct net_device   *dev;   /* FLOW_ACTION_REDIRECT 
*/
+   struct {/* FLOW_ACTION_VLAN */
+   u16 vid;
+   __be16  proto;
+   u8  prio;
+   } vlan;
+   struct {/* 
FLOW_ACTION_PACKET_EDIT */
+   enum flow_action_mangle_base htype;
+   u32 offset;
+   u32 mask;
+   u32 val;
+   } mangle;
+   const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_TUNNEL_ENCAP */
+   u32 csum_flags; /* FLOW_ACTION_CSUM */
+   u32 mark;   /* FLOW_ACTION_MARK */
+   };
+};
+
+struct flow_action {
+   unsigned intnum_entries;
+   struct flow_action_entryentries[0];
+};
+
+static inline bool flow_action_has_entries(const struct flow_action *action)
+{
+   return action->num_entries;
+}
+
+#define flow_action_for_each(__i, __act, __actions)\
+for (__i = 0, __act = &(__actions)->entries[0]; __i < 
(__actions)->num_entries; __act = &(__actions)->entries[__i++])
+
 struct flow_rule {
struct flow_match   match;
+   struct flow_action  action;
 };
 
-struct flow_rule *flow_rule_alloc(void);
+struct flow_rule *flow_rule_alloc(unsigned int num_actions);
 
 static inline bool flow_rule_match_key(const struct flow_rule *rule,
   enum flow_dissector_key_id key)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 359876ee32be..9ceac97e5eff 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -620,6 +620,8 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 }
 #endif /* CONFIG_NET_CLS_IND */
 
+unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
+
 int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
 enum tc_setup_type type, void *type_data, bool err_stop);
 
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 2fbf6903d2f6..c3a00eac4804 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -3,9 +3,19 @@
 #include 
 #include 
 
-struct flow_rule *flow_rule_alloc(void)
+struct flow_rule *flow_rule_alloc(unsigned int num_actions)
 {
-   return kzalloc(sizeof(struct flow_rule), GFP_KERNEL);
+   struct flow_rule *rule;
+
+   rule = kzalloc(sizeof(struct flow_rule) +
+  sizeof(struct flow_action_entry) * num_actions,
+  GFP_KERNEL);
+   if (!rule)
+   return NULL;
+
+   rule->action.num_entries = num_actions;
+
+   return rule;
 }
 EXPORT_SYMBOL(flo

[PATCH net-next] neighbour: Improve garbage collection

2018-12-06 Thread David Ahern
From: David Ahern 

The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

-|-x|--|-
t-5 t t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:
1. use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern 
---
 Documentation/networking/ip-sysctl.txt |   4 +-
 include/net/neighbour.h|   4 ++
 net/core/neighbour.c   | 122 +++--
 3 files changed, 93 insertions(+), 37 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index af2a69439b93..acdfb5d2bcaa 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER
Default: 512
 
 neigh/default/gc_thresh3 - INTEGER
-   Maximum number of neighbor entries allowed.  Increase this
-   when using large numbers of interfaces and when communicating
+   Maximum number of non-PERMANENT neighbor entries allowed.  Increase
+   this when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
Default: 1024
 
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..846ad8da91eb 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -154,6 +154,8 @@ struct neighbour {
struct hh_cache hh;
int (*output)(struct neighbour *, struct sk_buff *);
const struct neigh_ops  *ops;
+   struct list_headgc_list;
+   boolon_gc_list;
struct rcu_head rcu;
struct net_device   *dev;
u8  primary_key[0];
@@ -214,6 +216,8 @@ struct neigh_table {
struct timer_list   proxy_timer;
struct sk_buff_head proxy_queue;
atomic_tentries;
+   atomic_tgc_entries;
+   struct list_headgc_list;
rwlock_tlock;
unsigned long   last_rand;
struct neigh_statistics __percpu *stats;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 6d479b5562be..ab11e94ec44d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -118,6 +118,36 @@ unsigned long neigh_rand_r

Re: OMAP4430 SDP with KS8851: very slow networking

2018-12-06 Thread Tony Lindgren
* Russell King - ARM Linux  [181206 18:08]:
> reverted, the problem is still there.  Revert:
> 
> ec0daae685b2 ("gpio: omap: Add level wakeup handling for omap4 based SoCs")
> 
> on top, and networking returns to normal.  So it appears to be this
> last commit causing the issue.
> 
> With that and b764a5863fd8 applied, it still misbehaves.  Then, poking
> at the OMAP4_GPIO_IRQWAKEN0 register, changing it from 0 to 4 with
> devmem2 restores normal behaviour - ping times are normal and NFS is
> happy.
> 
> # devmem2 0x48055044 w 4

OK thanks.

> Given that this GPIO device is not runtime suspended, and is
> permanently active (which is what I think we expect, given that it
> has an IRQ claimed against it) does the hardware still attempt to
> idle the GPIO block - if so, could that be why we need to program
> the wakeup register, so the GPIO block signals that it's active?

Yes we now idle non-irq GPIOs only from CPU_CLUSTER_PM_ENTER
as the selected cpuidle state triggers the domain transitions
with WFI. And that's why runtime_suspended_time does not increase
for a GPIO instance with IRQs.

I can reproduce the long ping latencies on duovero smsc connected
to gpio_44, I'll try to debug it more.

Regards,

Tony


Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread Stephen Hemminger
On Thu,  6 Dec 2018 11:36:05 +0100
Andrew Lunn  wrote:

> +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
> +{
> + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead;
> + int err;
> +
> + rtnl_lock();
> + if (mtu <= dev->max_mtu) {
> + err = dev_set_mtu(dev, mtu);
> + if (err)
> + netdev_dbg(dev, "Unable to set MTU to include for DSA 
> overheads\n");
> + }
> + rtnl_unlock();
> +}
> +

You don't need the debug message. Use err_ack instead?

Debug messages are usually disabled in most distributions.


Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread David Miller
From: Andrew Lunn 
Date: Thu, 6 Dec 2018 21:48:46 +0100

> David has already accepted the patchset, so i will add a followup
> patch.

Yeah sorry for jumping the gun, the changes looked pretty
straightforward to me. :-/


Re: [PATCH net-next v2 0/8] Pass extack to NETDEV_PRE_UP

2018-12-06 Thread David Miller
From: Petr Machata 
Date: Thu, 6 Dec 2018 17:05:35 +

> Drivers may need to validate configuration of a device that's about to
> be upped. An example is mlxsw, which needs to check the configuration of
> a VXLAN device attached to an offloaded bridge. Should the validation
> fail, there's currently no way to communicate details of the failure to
> the user, beyond an error number.
> 
> Therefore this patch set extends the NETDEV_PRE_UP event to include
> extack, if available.
 ...

Series applied, thank you.


Re: [PATCH net 0/4] mlxsw: Various fixes

2018-12-06 Thread David Miller
From: Ido Schimmel 
Date: Thu, 6 Dec 2018 17:44:48 +

> Patches #1 and #2 fix two VxLAN related issues. The first patch removes
> warnings that can currently be triggered from user space. Second patch
> avoids leaking a FID in an error path.
> 
> Patch #3 fixes a too strict check that causes certain host routes not to
> be promoted to perform GRE decapsulation in hardware.
> 
> Last patch avoids a use-after-free when deleting a VLAN device via an
> ioctl when it is enslaved to a bridge. I have a patchset for net-next
> that reworks this code and makes the driver more robust.

Series applied.


[PATCH] net: wireless: FTM: fix kernel-doc "cannot understand" warnings

2018-12-06 Thread Randy Dunlap
From: Randy Dunlap 

Fix kernel-doc warnings in FTM due to missing "struct" keyword.

Fixes 109 warnings from :
../include/net/cfg80211.h:2838: warning: cannot understand function prototype: 
'struct cfg80211_ftm_responder_stats '

and fixes 88 warnings from :
../include/net/mac80211.h:477: warning: cannot understand function prototype: 
'struct ieee80211_ftm_responder_params '

Fixes: 81e54d08d9d8 ("cfg80211: support FTM responder configuration/statistics")
Fixes: bc847970f432 ("mac80211: support FTM responder configuration/statistics")

Signed-off-by: Randy Dunlap 
Cc: Pradeep Kumar Chitrapu 
Cc: Johannes Berg 
Cc: David Spinadel 
---
 include/net/cfg80211.h |2 +-
 include/net/mac80211.h |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- lnx-420-rc5.orig/include/net/cfg80211.h
+++ lnx-420-rc5/include/net/cfg80211.h
@@ -2815,7 +2815,7 @@ struct cfg80211_external_auth_params {
 };
 
 /**
- * cfg80211_ftm_responder_stats - FTM responder statistics
+ * struct cfg80211_ftm_responder_stats - FTM responder statistics
  *
  * @filled: bitflag of flags using the bits of &enum nl80211_ftm_stats to
  * indicate the relevant values in this struct for them
--- lnx-420-rc5.orig/include/net/mac80211.h
+++ lnx-420-rc5/include/net/mac80211.h
@@ -467,7 +467,7 @@ struct ieee80211_mu_group_data {
 };
 
 /**
- * ieee80211_ftm_responder_params - FTM responder parameters
+ * struct ieee80211_ftm_responder_params - FTM responder parameters
  *
  * @lci: LCI subelement content
  * @civicloc: CIVIC location subelement content




Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread Andrew Lunn
On Thu, Dec 06, 2018 at 12:21:31PM -0800, Florian Fainelli wrote:
> On 12/6/18 2:36 AM, Andrew Lunn wrote:
> > DSA tagging of frames sent over the master interface to the switch
> > increases the size of the frame. Such frames can then be bigger than
> > the normal MTU of the master interface, and it may drop them. Use the
> > overhead information from the tagger to set the MTU of the master
> > device to include this overhead.
> > 
> > Signed-off-by: Andrew Lunn 
> > ---
> >  net/dsa/master.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/net/dsa/master.c b/net/dsa/master.c
> > index c90ee3227dea..42f525bc68e2 100644
> > --- a/net/dsa/master.c
> > +++ b/net/dsa/master.c
> > @@ -158,8 +158,24 @@ static void dsa_master_ethtool_teardown(struct 
> > net_device *dev)
> > cpu_dp->orig_ethtool_ops = NULL;
> >  }
> >  
> > +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
> > +{
> > +   unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead;
> > +   int err;
> > +
> > +   rtnl_lock();
> > +   if (mtu <= dev->max_mtu) {
> > +   err = dev_set_mtu(dev, mtu);
> > +   if (err)
> > +   netdev_dbg(dev, "Unable to set MTU to include for DSA 
> > overheads\n");
> > +   }
> 
> Would it make sense to warn the user that there might be
> transmit/receive issues with the DSA tagging protocol if either
> dev_set_mtu() fails or mtu > dev->max_mtu?

I thought about that. But we have setups which work today with the
standard MTU. The master might not implement the set_mtu op, or might
impose the standard MTU, but be quite happy to deal with our DSA
packets. So i wanted to make this a hint to the master device, not a
strong requirement.

> Not that I think it matters too much to people because unbinding the
> switch driver and expecting the CPU port to continue operating is
> wishful thinking, but we should probably unwind that operation in
> dsa_master_teardown(), right?

That would make sense.

David has already accepted the patchset, so i will add a followup
patch.

Andrew


Re: [PATCH RFC 2/6] net: dsa: microchip: Add MIB counter reading support

2018-12-06 Thread Andrew Lunn
> I wonder what is the official way to clear the counters.

I don't think there is one, other than unloading the driver and
loading it again.

Andrew


Re: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver

2018-12-06 Thread Andrew Lunn
> I did try to implement this way.  But the other switches do not have the same
> format even though the length is the same.  Then I need to change the 
> following
> files for any new KSZ switch: include/linux/dsa.h, net/dsa/dsa.c, 
> net/dsa/dsa_priv.h,
> and finally net/dsa/tag_ksz.c.

You can always add two different tag drivers. They don't have to share
code if it does not make sense.

> Even then it will not work if Microchip wants to add 1588 PTP
> capability to the switches.
> 
> For KSZ9477 the length of the tail tag changes when the PTP function
> is enabled.  Typically this function is either enabled or disabled
> all the time, but if users want to change that during normal
> operation to see how the switch behaves, the transmit function
> completely stops working correctly.

We should figure out how to support PTP. I think that is the main
issue here.

> Older driver implementation is to monitor that register change and adjust the 
> length
> dynamically.
> 
> Another problem is the tail tag needs to include the timestamp for the 1-step
> Pdelay_Resp to have accurate turnaround time when that message is sent out by 
> the
> switch.  This will require access to the main switch driver which will keep 
> track of those
> PTP messages.
> 
> PTP handles transmit timestamp in skb_tx_timestamp, which is typically called 
> after the
> frame is sent, so it is too late.  DSA calls dsa_skb_tx_timestamp before 
> sending, but it
> only provides a clone to the driver that supports port_txstamp and so the 
> switch driver
> may not be able to do anything.

The current design assumes the hardware will insert the PTP timestamp
into the frame using the clock inside the hardware. You then ask it
what timestamp it actually used. 

If i understand you correctly, in your case, software was to provide
the timestamp which then gets inserted into the frame. So you want to
provide this timestamp as late as possible, when the frame reaches the
head of the queue and is about to be sent out the master interface?

> In dsa_switch_rcv() the CPU receive function is called first before
> dsa_skb_defer_rx_timestamp().  That means the receive tail tag
> operation has to be done first to retrieve the receive timestamp so
> that it can be passed later.

What i think you can do is in your tag rx function you can directly
add the timestamp info to the skbuf. The dsa driver function
.port_txtstamp can then always return false.

Your tag function is going to need access to some driver state, but
you should be able to get at that, following pointers, and placing
some of the structures in global headers.

Andrew


Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-06 Thread Jouke Witteveen
On Thu, Dec 6, 2018 at 9:10 PM David Miller  wrote:
>
> From: Jouke Witteveen 
> Date: Thu, 6 Dec 2018 09:59:20 +0100
>
> > On Thu, Dec 6, 2018 at 1:34 AM David Miller  wrote:
> >>
> >> From: Jouke Witteveen 
> >> Date: Wed, 5 Dec 2018 23:38:17 +0100
> >>
> >> > Can you elaborate a bit? I may not be aware of the policy you have in
> >> > mind.
> >>
> >> When we have a user facing interface to do something, we don't create
> >> another one unless it is absolutely, positively, unavoidable.
> >
> > Obviously, if I would have known this I would not have gone through
> > the trouble of investigating and proposing this patch. It was an
> > honest attempt at making the kernel better.
> > Where could I have found this policy? I have looked on kernel.org/doc,
> > but couldn't find it.
>
> It is not formally documented but it is a concern we raise every time
> a duplicate piece of user facing functionality is proposed.

Ok, thanks for getting back to me! Now I know.

That said, when looking into udev I became more convinced that the
kernel should send uevents on link state changes anyway. An example of
another kernel interface that has two ways of sending out state
changes is rfkill. It informs userspace of state changes via
/dev/rfkill and via uevents. For the latter it also sets some
informative environment variables, which my patch currently does not
do.

What would be needed to get you (or anyone else) to reconsider this
patch (or a revision)? I can definitely see your point and am willing
to accept your rejection. However, I also think there are substantial
ergonomic benefits to a unified and consistent interface for device
state changes and would like to give it one more shot, if possible.

Thanks for your time,
- Jouke


Re: [PATCH] mv88e6060: Warn about errors

2018-12-06 Thread Pavel Machek
On Thu 2018-12-06 12:21:59, David Miller wrote:
> 
> Plain "printk" are never appropriate.
> 
> Please explicitly use pr_warn() or similar.  If there is a device context
> available, either a generic device or a netdev, use one of the dev_*()
> or netdev_*() variants.

Can do, I guess is there's agreeement that such error is worth
some kind of output to the logs?

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: mv88e6060: Turn e6060 driver into e6065 driver

2018-12-06 Thread Pavel Machek
On Thu 2018-12-06 12:23:33, David Miller wrote:
> From: Pavel Machek 
> Date: Thu, 6 Dec 2018 14:03:45 +0100
> 
> > @@ -79,7 +82,7 @@ static enum dsa_tag_protocol 
> > mv88e6060_get_tag_protocol(struct dsa_switch *ds,
> >  {
> >//return DSA_TAG_PROTO_QCA;
> >//return DSA_TAG_PROTO_TRAILER;
> 
> These C++ style comments are not in any of my tree(s).
> 
> Your patch submission really needs to shape up if you want your patches
> to be considered seriously.

This one should have been "RFD". It has way more serious problems then
this, as changelog tried to explain.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


RE: [PATCH RFC 1/6] net: dsa: microchip: Prepare PHY for proper advertisement

2018-12-06 Thread Tristram.Ha
> > +static void ksz9477_phy_setup(struct ksz_device *dev, int port,
> > + struct phy_device *phy)
> > +{
> > +   if (port < dev->phy_port_cnt) {
> > +   /* SUPPORTED_Asym_Pause and SUPPORTED_Pause can be
> removed to
> > +* disable flow control when rate limiting is used.
> > +*/
> > +   }
> 
> Hi Tristram
> 
> Is this meant to be a TODO comment?
> 
> What is supposed to happen here is that all forms of pause are disable
> by default. The MAC driver needs to enable what it supports by calling
> phy_support_sym_pause() or phy_support_asym_pause().
> 
> Ah, is this because there is not a real PHY driver?

The kernel has been changed so I am not sure about the current behavior.
I would like to turn on flow control by default.  Before I just assigned
"supported" to "advertising."  I know linkmode_copy is being used now for that.
But last time I checked "advertising" is already the same as "supported."

There will be a situation that flow control should not be turned on as the 
switch
uses bandwidth control to limit outgoing traffic.

The issue is actually becoming more complex as KSZ9477 has a variant which
does not support gigabit speed, although the same PHY device id is being used.
That means the driver has to fake it by returning a different id and also 
registering
a different PHY driver to handle that.  Marketing also likes to display the 
correct chip
name during kernel booting so that users do not get confused.



Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Andy Lutomirski
On Thu, Dec 6, 2018 at 12:20 PM Edgecombe, Rick P
 wrote:
>
> On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
> > >
> > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > > time rather than vfree() time?
> > > >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data?  Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses.  I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > > > RO instead of not present might be safer.  But I do like the idea of
> > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > > making it do all of this.
> > >
> > > Yeah, doing it for everything automatically seemed like it was/is
> > > going to be a lot of work to debug all the corner cases where things
> > > expect memory to be mapped but don't explicitly say it. And in
> > > particular, the XPFO series only does it for user memory, whereas an
> > > additional flag like this would work for extra paranoid allocations
> > > of kernel memory too.
> > >
> >
> > I just read the code, and I looks like vmalloc() is already using
> > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > example, we already don't have modules in the direct map.
> >
> > So I say we go for it.  This should be quite simple to implement --
> > the pageattr code already has almost all the needed logic on x86.  The
> > only arch support we should need is a pair of functions to remove a
> > vmalloc address range from the address map (if it was present in the
> > first place) and a function to put it back.  On x86, this should only
> > be a few lines of code.
> >
> > What do you all think?  This should solve most of the problems we have.
> >
> > If we really wanted to optimize this, we'd make it so that
> > module_alloc() allocates memory the normal way, then, later on, we
> > call some function that, all at once, removes the memory from the
> > direct map and applies the right permissions to the vmalloc alias (or
> > just makes the vmalloc alias not-present so we can add permissions
> > later without flushing), and flushes the TLB.  And we arrange for
> > vunmap to zap the vmalloc range, then put the memory back into the
> > direct map, then free the pages back to the page allocator, with the
> > flush in the appropriate place.
> >
> > I don't see why the page allocator needs to know about any of this.
> > It's already okay with the permissions being changed out from under it
> > on x86, and it seems fine.  Rick, do you want to give some variant of
> > this a try?
> Hi,
>
> Sorry, I've been having email troubles today.
>
> I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
> NP/RO in the directmap, so it would be sort of inconsistent whether the
> directmap of vmalloc range allocations were readable or not. I couldn't see 
> any
> places where it would cause problems today though.
>
> I was ready to assume that all TLBs don't cache NP, because I don't know how
> usages where a page fault is used to load something could work without lots of
> flushes.

Or the architecture just fixes up the spurious faults, I suppose.  I'm
only well-educated on the x86 mmu.

> If that's the case, then all archs with directmap permissions could
> share a single vmalloc special permission flush implementation that works like
> Andy described originally. It could be controlled with an
> ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
> set_pages_rw on any archs with directmap permissions. So seems simpler to me
> (and what I have been doing) unless I'm missing the problem.

Hmm.  The only reason I've proposed anything fancier was because I was
thinking of minimizing flushes, but I think I'm being silly.  This
sequence ought to work optimally:

- vmalloc(..., VM_HAS_DIRECT_MAP_PERMS);  /* no flushes */

- Write some data, via vmalloc's return address.

- Use some set_memory_whatever() functions to update permissions,
which will flush, hopefully just once.

- Run the module code!

- vunmap -- this will do a single flush that will fix everything.

This does require that set_pages_np() or set_memory_np() or whatever
exists and that it's safe to do that, then flush, and then
set_pages_rw().  So maybe you want set_pages_np_noflush() and
set_pages_rw_noflush() to make it totally clear what's supposed to
happen.

--Andy


Re: mv88e6060: Turn e6060 driver into e6065 driver

2018-12-06 Thread David Miller
From: Pavel Machek 
Date: Thu, 6 Dec 2018 14:03:45 +0100

> @@ -79,7 +82,7 @@ static enum dsa_tag_protocol 
> mv88e6060_get_tag_protocol(struct dsa_switch *ds,
>  {
>//return DSA_TAG_PROTO_QCA;
>//return DSA_TAG_PROTO_TRAILER;

These C++ style comments are not in any of my tree(s).

Your patch submission really needs to shape up if you want your patches
to be considered seriously.

Thank you.


Re: [PATCH] mv88e6060: Warn about errors

2018-12-06 Thread David Miller


Plain "printk" are never appropriate.

Please explicitly use pr_warn() or similar.  If there is a device context
available, either a generic device or a netdev, use one of the dev_*()
or netdev_*() variants.


Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread Florian Fainelli
On 12/6/18 2:36 AM, Andrew Lunn wrote:
> DSA tagging of frames sent over the master interface to the switch
> increases the size of the frame. Such frames can then be bigger than
> the normal MTU of the master interface, and it may drop them. Use the
> overhead information from the tagger to set the MTU of the master
> device to include this overhead.
> 
> Signed-off-by: Andrew Lunn 
> ---
>  net/dsa/master.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/net/dsa/master.c b/net/dsa/master.c
> index c90ee3227dea..42f525bc68e2 100644
> --- a/net/dsa/master.c
> +++ b/net/dsa/master.c
> @@ -158,8 +158,24 @@ static void dsa_master_ethtool_teardown(struct 
> net_device *dev)
>   cpu_dp->orig_ethtool_ops = NULL;
>  }
>  
> +void dsa_master_set_mtu(struct net_device *dev, struct dsa_port *cpu_dp)
> +{
> + unsigned int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead;
> + int err;
> +
> + rtnl_lock();
> + if (mtu <= dev->max_mtu) {
> + err = dev_set_mtu(dev, mtu);
> + if (err)
> + netdev_dbg(dev, "Unable to set MTU to include for DSA 
> overheads\n");
> + }

Would it make sense to warn the user that there might be
transmit/receive issues with the DSA tagging protocol if either
dev_set_mtu() fails or mtu > dev->max_mtu?

> + rtnl_unlock();
> +}
> +
>  int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp)
>  {
> + dsa_master_set_mtu(dev,  cpu_dp);

Not that I think it matters too much to people because unbinding the
switch driver and expecting the CPU port to continue operating is
wishful thinking, but we should probably unwind that operation in
dsa_master_teardown(), right?

> +
>   /* If we use a tagging format that doesn't have an ethertype
>* field, make sure that all packets from this point on get
>* sent to the tag format's receive function.
> 


-- 
Florian


Re: [PATCH] tcp: fix code style in tcp_recvmsg()

2018-12-06 Thread David Miller
From: Pedro Tammela 
Date: Thu,  6 Dec 2018 10:45:28 -0200

> 2 goto labels are indented with a tab. remove the tabs and
> keep the code style consistent.
> 
> Signed-off-by: Pedro Tammela 

Applied to net-next.


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Edgecombe, Rick P
On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
> > 
> > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > time rather than vfree() time?
> > > 
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data?  Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses.  I guess if
> > > highmem is ever used for modules, then we should be fine.
> > > 
> > > RO instead of not present might be safer.  But I do like the idea of
> > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > making it do all of this.
> > 
> > Yeah, doing it for everything automatically seemed like it was/is
> > going to be a lot of work to debug all the corner cases where things
> > expect memory to be mapped but don't explicitly say it. And in
> > particular, the XPFO series only does it for user memory, whereas an
> > additional flag like this would work for extra paranoid allocations
> > of kernel memory too.
> > 
> 
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
> 
> So I say we go for it.  This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86.  The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back.  On x86, this should only
> be a few lines of code.
> 
> What do you all think?  This should solve most of the problems we have.
> 
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB.  And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
> 
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine.  Rick, do you want to give some variant of
> this a try?
Hi,

Sorry, I've been having email troubles today.

I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
NP/RO in the directmap, so it would be sort of inconsistent whether the
directmap of vmalloc range allocations were readable or not. I couldn't see any
places where it would cause problems today though.

I was ready to assume that all TLBs don't cache NP, because I don't know how
usages where a page fault is used to load something could work without lots of
flushes. If that's the case, then all archs with directmap permissions could
share a single vmalloc special permission flush implementation that works like
Andy described originally. It could be controlled with an
ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
set_pages_rw on any archs with directmap permissions. So seems simpler to me
(and what I have been doing) unless I'm missing the problem.

If you all think so I can indeed take a shot at it, I just don't see what the
problem was with the original solution, that seems less likely to break
anything.

Thanks,

Rick


Re: [PATCH net-next 0/2] Adjust MTU of DSA master interface

2018-12-06 Thread David Miller
From: Andrew Lunn 
Date: Thu,  6 Dec 2018 11:36:03 +0100

> DSA makes use of additional headers to direct a frame in/out of a
> specific port of the switch. When the slave interfaces uses an MTU of
> 1500, the master interface can be asked to handle frames with an MTU
> of 1504, or 1508 bytes. Some Ethernet interfaces won't
> transmit/receive frames which are bigger than their MTU.
> 
> Automate the increasing of the MTU on the master interface, by adding
> to each tagging driver how much overhead they need, and then calling
> dev_set_mtu() of the master interface to increase its MTU as needed.

Series applied, thanks Andrew.


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Andy Lutomirski
On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit  wrote:
>
> > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski  wrote:
> >
> > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
> >> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
>  If we are going to unmap the linear alias, why not do it at vmalloc()
>  time rather than vfree() time?
> >>>
> >>> That’s not totally nuts. Do we ever have code that expects __va() to
> >>> work on module data?  Perhaps crypto code trying to encrypt static
> >>> data because our APIs don’t understand virtual addresses.  I guess if
> >>> highmem is ever used for modules, then we should be fine.
> >>>
> >>> RO instead of not present might be safer.  But I do like the idea of
> >>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> >>> making it do all of this.
> >>
> >> Yeah, doing it for everything automatically seemed like it was/is
> >> going to be a lot of work to debug all the corner cases where things
> >> expect memory to be mapped but don't explicitly say it. And in
> >> particular, the XPFO series only does it for user memory, whereas an
> >> additional flag like this would work for extra paranoid allocations
> >> of kernel memory too.
> >
> > I just read the code, and I looks like vmalloc() is already using
> > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > example, we already don't have modules in the direct map.
> >
> > So I say we go for it.  This should be quite simple to implement --
> > the pageattr code already has almost all the needed logic on x86.  The
> > only arch support we should need is a pair of functions to remove a
> > vmalloc address range from the address map (if it was present in the
> > first place) and a function to put it back.  On x86, this should only
> > be a few lines of code.
> >
> > What do you all think?  This should solve most of the problems we have.
> >
> > If we really wanted to optimize this, we'd make it so that
> > module_alloc() allocates memory the normal way, then, later on, we
> > call some function that, all at once, removes the memory from the
> > direct map and applies the right permissions to the vmalloc alias (or
> > just makes the vmalloc alias not-present so we can add permissions
> > later without flushing), and flushes the TLB.  And we arrange for
> > vunmap to zap the vmalloc range, then put the memory back into the
> > direct map, then free the pages back to the page allocator, with the
> > flush in the appropriate place.
> >
> > I don't see why the page allocator needs to know about any of this.
> > It's already okay with the permissions being changed out from under it
> > on x86, and it seems fine.  Rick, do you want to give some variant of
> > this a try?
>
> Setting it as read-only may work (and already happens for the read-only
> module data). I am not sure about setting it as non-present.
>
> At some point, a discussion about a threat-model, as Rick indicated, would
> be required. I presume ROP attacks can easily call set_all_modules_text_rw()
> and override all the protections.
>

I am far from an expert on exploit techniques, but here's a
potentially useful model: let's assume there's an attacker who can
write controlled data to a controlled kernel address but cannot
directly modify control flow.  It would be nice for such an attacker
to have a very difficult time of modifying kernel text or of
compromising control flow.  So we're assuming a feature like kernel
CET or that the attacker finds it very difficult to do something like
modifying some thread's IRET frame.

Admittedly, for the kernel, this is an odd threat model, since an
attacker can presumably quite easily learn the kernel stack address of
one of their tasks, do some syscall, and then modify their kernel
thread's stack such that it will IRET right back to a fully controlled
register state with RSP pointing at an attacker-supplied kernel stack.
So this threat model gives very strong ROP powers. unless we have
either CET or some software technique to harden all the RET
instructions in the kernel.

I wonder if there's a better model to use.  Maybe with stack-protector
we get some degree of protection?  Or is all of this is rather weak
until we have CET or a RAP-like feature.


Re: [PATCH][net-next] tun: align write-heavy flow entry members to a cache line

2018-12-06 Thread David Miller
From: Li RongQing 
Date: Thu,  6 Dec 2018 16:08:17 +0800

> tun flow entry 'updated' fields are written when receive
> every packet. Thus if a flow is receiving packets from a
> particular flow entry, it'll cause false-sharing with
> all the other who has looked it up, so move it in its own
> cache line
> 
> and update 'queue_index' and 'update' field only when
> they are changed to reduce the cache false-sharing.
> 
> Signed-off-by: Zhang Yu 
> Signed-off-by: Wang Li 
> Signed-off-by: Li RongQing 

Applied.


Re: [PATCH][net-next] tun: remove unnecessary check in tun_flow_update

2018-12-06 Thread David Miller
From: Li RongQing 
Date: Thu,  6 Dec 2018 16:28:11 +0800

> caller has guaranted that rxhash is not zero
> 
> Signed-off-by: Li RongQing 

Applied.


RE: [PATCH RFC 2/6] net: dsa: microchip: Add MIB counter reading support

2018-12-06 Thread Tristram.Ha
> > +static void ksz9477_r_mib_cnt(struct ksz_device *dev, int port, u16 addr,
> > + u64 *cnt)
> > +{
> > +   u32 data;
> > +   int timeout;
> > +   struct ksz_port *p = &dev->ports[port];
> > +
> > +   /* retain the flush/freeze bit */
> > +   data = p->freeze ? MIB_COUNTER_FLUSH_FREEZE : 0;
> > +   data |= MIB_COUNTER_READ;
> > +   data |= (addr << MIB_COUNTER_INDEX_S);
> > +   ksz_pwrite32(dev, port, REG_PORT_MIB_CTRL_STAT__4, data);
> > +
> > +   timeout = 1000;
> > +   do {
> > +   ksz_pread32(dev, port, REG_PORT_MIB_CTRL_STAT__4,
> > +   &data);
> > +   usleep_range(1, 10);
> > +   if (!(data & MIB_COUNTER_READ))
> > +   break;
> > +   } while (timeout-- > 0);
> 
> Could you use readx_poll_timeout() here?
> 
> > +void ksz_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *buf)
> > +{
> > +   struct ksz_device *dev = ds->priv;
> > +   struct ksz_port_mib *mib;
> > +
> > +   mib = &dev->ports[port].mib;
> > +
> > +   /* freeze MIB counters if supported */
> > +   if (dev->dev_ops->freeze_mib)
> > +   dev->dev_ops->freeze_mib(dev, port, true);
> > +   mutex_lock(&mib->cnt_mutex);
> > +   port_r_cnt(dev, port);
> > +   mutex_unlock(&mib->cnt_mutex);
> > +   if (dev->dev_ops->freeze_mib)
> > +   dev->dev_ops->freeze_mib(dev, port, false);
> 
> Should the freeze be protected by the mutex as well?
> 
> > +   memcpy(buf, mib->counters, dev->mib_cnt * sizeof(u64));
> 
> I wonder if this memcpy should also be protected by the mutex. As soon
> as the mutex is dropped, the scheduled work could start updating
> mib->counters in non-atomic ways?
>

I will update as suggested.
 
> > +}
> > +
> >  int ksz_port_bridge_join(struct dsa_switch *ds, int port,
> >  struct net_device *br)
> >  {
> > @@ -255,6 +349,7 @@ int ksz_enable_port(struct dsa_switch *ds, int port,
> struct phy_device *phy)
> > /* setup slave port */
> > dev->dev_ops->port_setup(dev, port, false);
> > dev->dev_ops->phy_setup(dev, port, phy);
> > +   dev->dev_ops->port_init_cnt(dev, port);
> 
> This is probably not the correct place to do this. MIB counters should
> not be cleared by an ifdown/ifup cycle. They should only be cleared
> when the driver is probed.

I wonder what is the official way to clear the counters.  For network debugging
It is good to clear the counters to start fresh to see which frame is not
being sent or received.  Typically the device is reset when it is shutdown as 
there
are hardware problems.  I would think it is the job of applications like SNMP 
Manager
to keep track of MIB counters throughout the life of a running system.


Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-06 Thread David Miller
From: Jouke Witteveen 
Date: Thu, 6 Dec 2018 09:59:20 +0100

> On Thu, Dec 6, 2018 at 1:34 AM David Miller  wrote:
>>
>> From: Jouke Witteveen 
>> Date: Wed, 5 Dec 2018 23:38:17 +0100
>>
>> > Can you elaborate a bit? I may not be aware of the policy you have in
>> > mind.
>>
>> When we have a user facing interface to do something, we don't create
>> another one unless it is absolutely, positively, unavoidable.
> 
> Obviously, if I would have known this I would not have gone through
> the trouble of investigating and proposing this patch. It was an
> honest attempt at making the kernel better.
> Where could I have found this policy? I have looked on kernel.org/doc,
> but couldn't find it.

It is not formally documented but it is a concern we raise every time
a duplicate piece of user facing functionality is proposed.


RE: [PATCH RFC 5/6] net: dsa: microchip: Update tag_ksz.c to access switch driver

2018-12-06 Thread Tristram.Ha
> >>> Update tag_ksz.c to access switch driver's tail tagging operations.
> >>
> >> Hi Tristram
> >>
> >> Humm, i'm not sure we want this, the tagging spit into two places.  I
> >> need to take a closer look at the previous patch, to see why it cannot
> >> be done here.
> >
> > O.K, i think i get what is going on.
> >
> > I would however implement it differently.
> >
> > One net/dsa/tag_X.c file can export two dsa_device_ops structures,
> > allowing you to share common code for the two taggers. You could call
> > these DSA_TAG_PROTO_KSZ_1_BYTE, and DSA_TAG_PROTO_KSZ_2_BYTE,
> and the
> > .get_tag_protocol call would then return the correct one for the
> > switch.
> 
> Agreed, that is what is done by net/dsa/tag_brcm.c because there are two
> formats for the Broadcom tag:
> 
> - TAG_BRCM: the 4-bytes Broadcom tag is between MAC SA and Ethertype
> - TAG_BRCM_PREPEND: the 4-bytes Broadcom tag is before the MAC DA
>

I did try to implement this way.  But the other switches do not have the same
format even though the length is the same.  Then I need to change the following
files for any new KSZ switch: include/linux/dsa.h, net/dsa/dsa.c, 
net/dsa/dsa_priv.h,
and finally net/dsa/tag_ksz.c.

Even then it will not work if Microchip wants to add 1588 PTP capability to the 
switches.

For KSZ9477 the length of the tail tag changes when the PTP function is enabled.
Typically this function is either enabled or disabled all the time, but if 
users want to
change that during normal operation to see how the switch behaves, the transmit
function completely stops working correctly.

Older driver implementation is to monitor that register change and adjust the 
length
dynamically.

Another problem is the tail tag needs to include the timestamp for the 1-step
Pdelay_Resp to have accurate turnaround time when that message is sent out by 
the
switch.  This will require access to the main switch driver which will keep 
track of those
PTP messages.

PTP handles transmit timestamp in skb_tx_timestamp, which is typically called 
after the
frame is sent, so it is too late.  DSA calls dsa_skb_tx_timestamp before 
sending, but it
only provides a clone to the driver that supports port_txstamp and so the 
switch driver
may not be able to do anything.
 
> And the code to process them is basically using relative offsets from
> the start of the frame to access correct data.
> 
> This is done largely for performance reasons because we have 1/2
> Gigabit/secs capable CPU ports and so we want to avoid as little cache
> trashing as possible and immediately get the right rcv() function to
> process the packets.
> 

The SoC I used for this driver development actually has problem sending
Gigabit traffic so I do not see the effect of any slowdown, and the updated
MAC driver change for a hardware problem does not help and greatly
degrades the transmit performance.

> >
> > It might also be possible to merge in tag_trailer, or at least share
> > some code.
> >

Actually in previous old DSA implementation I just hijacked this file to
add the tail tag operations without creating a new file like tag_ksz.c.

> > What i don't yet understand is how you are passing PTP information
> > around. The commit messages need to explain that, since it is not
> > obvious, and it is the first time we have needed PTP info in a tag
> > driver.

It seems the official 1588 PTP timestamp API for a PHY driver is only 
implemented
in only PHY driver, net/phy/dp83640.c, in the whole kernel.  DSA uses similar
mechanism to support 1588 PTP.  In dsa_switch_rcv() the CPU receive function is 
called
first before dsa_skb_defer_rx_timestamp().  That means the receive tail tag 
operation
has to be done first to retrieve the receive timestamp so that it can be passed 
later.

It is probably not good to change the socket buffer length inside the 
port_rxtstamp
function, and I do not see any other way to insert that transmit timestamp.

A customer has already inquired about implementing 1588 PTP in the DSA driver.  
I hope
this mechanism is approved so that I can start doing that.



Re: [PATCH net-next 0/2] platform data controls for mdio-gpio

2018-12-06 Thread Andrew Lunn
On Thu, Dec 06, 2018 at 09:22:27AM -0800, Florian Fainelli wrote:
> Hi Andrew,
> 
> On 12/6/18 5:58 AM, Andrew Lunn wrote:
> > Soon to be mainlined is an x86 platform with a Marvell switch, and a
> > bit-banging MDIO bus. In order to make this work, the phy_mask of the
> > MDIO bus needs to be set to prevent scanning for PHYs, and the
> > phy_ignore_ta_mask needs to be set because the switch has broken
> > turnaround.
> 
> This looks good, I would just make one/two changes which is to match the
> internal phy_mask and phy_ignore_ta_mask types from the struct mii_bus
> and use u32 instead of int.

Yes, that makes sense.

v2 to follow.

 Andrew


Re: [PATCH] gianfar: Add gfar_change_carrier()

2018-12-06 Thread Andrew Lunn
> I can have a look at using dormant, but what is change_carrier
> supposed to do if not this?

It is intended for interfaces which are stacked, like the team driver,
and for devices which don't have a phy, e.g. tun, and dummy.

> I didn't find a tool for DORMANT, I guess i will have to write one
> myself(using SIOCGIFFLAGS, SIOCSIFFLAGS)?

ip link should be able to set it.

Try ip link set mode dormant dev eth0

Andrew


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Nadav Amit
> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski  wrote:
> 
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
>> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
 If we are going to unmap the linear alias, why not do it at vmalloc()
 time rather than vfree() time?
>>> 
>>> That’s not totally nuts. Do we ever have code that expects __va() to
>>> work on module data?  Perhaps crypto code trying to encrypt static
>>> data because our APIs don’t understand virtual addresses.  I guess if
>>> highmem is ever used for modules, then we should be fine.
>>> 
>>> RO instead of not present might be safer.  But I do like the idea of
>>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
>>> making it do all of this.
>> 
>> Yeah, doing it for everything automatically seemed like it was/is
>> going to be a lot of work to debug all the corner cases where things
>> expect memory to be mapped but don't explicitly say it. And in
>> particular, the XPFO series only does it for user memory, whereas an
>> additional flag like this would work for extra paranoid allocations
>> of kernel memory too.
> 
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
> 
> So I say we go for it.  This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86.  The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back.  On x86, this should only
> be a few lines of code.
> 
> What do you all think?  This should solve most of the problems we have.
> 
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB.  And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
> 
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine.  Rick, do you want to give some variant of
> this a try?

Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.

At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call set_all_modules_text_rw()
and override all the protections.



Re: [PATCH bpf-next] tools: bpftool: add a command to dump the trace pipe

2018-12-06 Thread Alexei Starovoitov
On Thu, Dec 06, 2018 at 05:20:54PM +, Quentin Monnet wrote:
> 2018-12-05 19:18 UTC-0800 ~ Alexei Starovoitov
> 
> > On Wed, Dec 05, 2018 at 06:15:23PM +, Quentin Monnet wrote:
>  +
>  +/* Allow room for NULL terminating byte and pipe file name */
>  +snprintf(format, sizeof(format), "%%*s %%%zds %%99s %%*s %%*d 
>  %%*d\\n",
>  + PATH_MAX - strlen(pipe_name) - 1);
> >>>
> >>> before scanning trace_pipe could you add a check that trace_options are 
> >>> compatible?
> >>> Otherwise there will be a lot of garbage printed.
> >>> afaik default is rarely changed, so the patch is ok as-is.
> >>> The followup some time in the future would be perfect.
> >>
> >> Sure. What do you mean exactly by compatible options? I can check that
> >> "trace_printk" is set, is there any other option that would be relevant?
> > 
> > See Documentation/trace/ftrace.rst
> > a lot of the flags will change the format significantly.
> > Like 'bin' will make it binary.
> > I'm not suggesting to support all possible output formats.
> > Only to check that trace flags match scanf.
> 
> fscanf() is only used to retrieve the name of the sysfs directory where
> the pipe is located, when listing all the mount points on the system. It
> is not used to dump the content from the pipe (which is done with
> getline(), so formatting does not matter much).
> 
> If the "bin" option is set, "bpftool prog tracelog" will dump the same
> binary content as "cat /sys/kernel/debug/tracing/trace_pipe", which is
> the expected behaviour (at least with the current patch). Let me know if
> you would like me to change this somehow.

I misread the patch :) thanks for explaining. all good then.



Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Ard Biesheuvel
On Thu, 6 Dec 2018 at 20:30, Will Deacon  wrote:
>
> On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> > On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski  wrote:
> > >
> > > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> > >  wrote:
> > > >
> > > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski  wrote:
> > > > >
> > >
> > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > work on module data?  Perhaps crypto code trying to encrypt static
> > > > > data because our APIs don’t understand virtual addresses.  I guess if
> > > > > highmem is ever used for modules, then we should be fine.
> > > > >
> > > >
> > > > The crypto code shouldn't care, but I think it will probably break 
> > > > hibernate :-(
> > >
> > > How so?  Hibernate works (or at least should work) on x86 PAE, where
> > > __va doesn't work on module data, and, on x86, the direct map has some
> > > RO parts with where the module is, so hibernate can't be writing to
> > > the memory through the direct map with its final permissions.
> >
> > On arm64 at least, hibernate reads the contents of memory via the
> > linear mapping. Not sure about other arches.
>
> Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
> the pte when we see that it's PROT_NONE?
>

As long as we can easily figure out whether a certain linear address
is mapped or not, having a special case like that for these mappings
doesn't sound unreasonable.


Re: [PATCH net-next 4/4] net: aquantia: add support of RSS configuration

2018-12-06 Thread Jakub Kicinski
On Thu, 6 Dec 2018 15:02:52 +, Igor Russkikh wrote:
> From: Dmitry Bogdanov 
> 
> Add support of configuration of RSS hash key and RSS indirection table.
> 
> Signed-off-by: Dmitry Bogdanov 
> Signed-off-by: Igor Russkikh 
> ---
>  .../ethernet/aquantia/atlantic/aq_ethtool.c   | 42 +++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c 
> b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
> index a5fd71692c8b..2f2e12c2b632 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
> @@ -202,6 +202,47 @@ static int aq_ethtool_get_rss(struct net_device *ndev, 
> u32 *indir, u8 *key,
>   return 0;
>  }
>  
> +static int aq_ethtool_set_rss(struct net_device *netdev, const u32 *indir,
> +   const u8 *key, const u8 hfunc)
> +{
> + struct aq_nic_s *aq_nic = netdev_priv(netdev);
> + struct aq_nic_cfg_s *cfg;
> + unsigned int i = 0U;
> + u32 rss_entries;
> + int err = 0;
> +
> + cfg = aq_nic_get_cfg(aq_nic);
> + rss_entries = cfg->aq_rss.indirection_table_size;
> +
> + /* We do not allow change in unsupported parameters */
> + if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
> + return -EOPNOTSUPP;
> + /* Fill out the redirection table */
> + if (indir) {
> + /* Verify user input. */
> + for (i = 0; i < rss_entries; i++)
> + if (indir[i] >= cfg->num_rss_queues)
> + return -EINVAL;

nit: you shouldn't have to do this, see ethtool_copy_validate_indir().

> + for (i = 0; i < rss_entries; i++)
> + cfg->aq_rss.indirection_table[i] = indir[i];
> + }
> +
> + /* Fill out the rss hash key */
> + if (key) {
> + memcpy(cfg->aq_rss.hash_secret_key, key,
> +sizeof(cfg->aq_rss.hash_secret_key));
> + err = aq_nic->aq_hw_ops->hw_rss_hash_set(aq_nic->aq_hw,
> + &cfg->aq_rss);
> + if (err)
> + return err;
> + }
> +
> + err = aq_nic->aq_hw_ops->hw_rss_set(aq_nic->aq_hw, &cfg->aq_rss);
> +
> + return err;
> +}


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Will Deacon
On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski  wrote:
> >
> > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> >  wrote:
> > >
> > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski  wrote:
> > > >
> >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data?  Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses.  I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > >
> > > The crypto code shouldn't care, but I think it will probably break 
> > > hibernate :-(
> >
> > How so?  Hibernate works (or at least should work) on x86 PAE, where
> > __va doesn't work on module data, and, on x86, the direct map has some
> > RO parts with where the module is, so hibernate can't be writing to
> > the memory through the direct map with its final permissions.
> 
> On arm64 at least, hibernate reads the contents of memory via the
> linear mapping. Not sure about other arches.

Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
the pte when we see that it's PROT_NONE?

Will


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Ard Biesheuvel
On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski  wrote:
>
> On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
>  wrote:
> >
> > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski  wrote:
> > >
>
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data?  Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses.  I guess if
> > > highmem is ever used for modules, then we should be fine.
> > >
> >
> > The crypto code shouldn't care, but I think it will probably break 
> > hibernate :-(
>
> How so?  Hibernate works (or at least should work) on x86 PAE, where
> __va doesn't work on module data, and, on x86, the direct map has some
> RO parts with where the module is, so hibernate can't be writing to
> the memory through the direct map with its final permissions.

On arm64 at least, hibernate reads the contents of memory via the
linear mapping. Not sure about other arches.


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Andy Lutomirski
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
 wrote:
>
> On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski  wrote:
> >

> > That’s not totally nuts. Do we ever have code that expects __va() to
> > work on module data?  Perhaps crypto code trying to encrypt static
> > data because our APIs don’t understand virtual addresses.  I guess if
> > highmem is ever used for modules, then we should be fine.
> >
>
> The crypto code shouldn't care, but I think it will probably break hibernate 
> :-(

How so?  Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Andy Lutomirski
On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen  wrote:
>
> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > time rather than vfree() time?
> >
> > That’s not totally nuts. Do we ever have code that expects __va() to
> > work on module data?  Perhaps crypto code trying to encrypt static
> > data because our APIs don’t understand virtual addresses.  I guess if
> > highmem is ever used for modules, then we should be fine.
> >
> > RO instead of not present might be safer.  But I do like the idea of
> > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > making it do all of this.
>
> Yeah, doing it for everything automatically seemed like it was/is
> going to be a lot of work to debug all the corner cases where things
> expect memory to be mapped but don't explicitly say it. And in
> particular, the XPFO series only does it for user memory, whereas an
> additional flag like this would work for extra paranoid allocations
> of kernel memory too.
>

I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.

So I say we go for it.  This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86.  The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back.  On x86, this should only
be a few lines of code.

What do you all think?  This should solve most of the problems we have.

If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB.  And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.

I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine.  Rick, do you want to give some variant of
this a try?


[PATCH] vhost/vsock: fix reset orphans race with close timeout

2018-12-06 Thread Stefan Hajnoczi
If a local process has closed a connected socket and hasn't received a
RST packet yet, then the socket remains in the table until a timeout
expires.

When a vhost_vsock instance is released with the timeout still pending,
the socket is never freed because vhost_vsock has already set the
SOCK_DONE flag.

Check if the close timer is pending and let it close the socket.  This
prevents the race which can leak sockets.

Reported-by: Maximilian Riemensberger 
Cc: Graham Whaley 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/vhost/vsock.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab40c6d..731e2ea2aeca 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -563,13 +563,21 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
 * executing.
 */
 
-   if (!vhost_vsock_get(vsk->remote_addr.svm_cid)) {
-   sock_set_flag(sk, SOCK_DONE);
-   vsk->peer_shutdown = SHUTDOWN_MASK;
-   sk->sk_state = SS_UNCONNECTED;
-   sk->sk_err = ECONNRESET;
-   sk->sk_error_report(sk);
-   }
+   /* If the peer is still valid, no need to reset connection */
+   if (vhost_vsock_get(vsk->remote_addr.svm_cid))
+   return;
+
+   /* If the close timeout is pending, let it expire.  This avoids races
+* with the timeout callback.
+*/
+   if (vsk->close_work_scheduled)
+   return;
+
+   sock_set_flag(sk, SOCK_DONE);
+   vsk->peer_shutdown = SHUTDOWN_MASK;
+   sk->sk_state = SS_UNCONNECTED;
+   sk->sk_err = ECONNRESET;
+   sk->sk_error_report(sk);
 }
 
 static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
-- 
2.19.2



Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Ard Biesheuvel
On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski  wrote:
>
> > On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel  
> > wrote:
> >
> >> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski  wrote:
> >>
> >>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon  wrote:
> >>>
>  On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
>  On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>   wrote:
> >
> >> On Tue, 2018-12-04 at 16:03 +, Will Deacon wrote:
> >> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>  On Nov 27, 2018, at 4:07 PM, Rick Edgecombe 
>  
>  wrote:
> 
>  Since vfree will lazily flush the TLB, but not lazily free the 
>  underlying
>  pages,
>  it often leaves stale TLB entries to freed pages that could get 
>  re-used.
>  This is
>  undesirable for cases where the memory being freed has special 
>  permissions
>  such
>  as executable.
> >>>
> >>> So I am trying to finish my patch-set for preventing transient W+X 
> >>> mappings
> >>> from taking space, by handling kprobes & ftrace that I missed (thanks 
> >>> again
> >>> for
> >>> pointing it out).
> >>>
> >>> But all of the sudden, I don’t understand why we have the problem 
> >>> that this
> >>> (your) patch-set deals with at all. We already change the mappings to 
> >>> make
> >>> the memory wrAcked-by: Ard Biesheuvel 
> > itable before freeing the memory, so why can’t we make it
> >>> non-executable at the same time? Actually, why do we make the module 
> >>> memory,
> >>> including its data executable before freeing it???
> >>
> >> Yeah, this is really confusing, but I have a suspicion it's a 
> >> combination
> >> of the various different configurations and hysterical raisins. We 
> >> can't
> >> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> >> can we rely on disable_ro_nx() being available at build time.
> >>
> >> If we *could* rely on module allocations always using vmalloc(), then
> >> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> >> afaict -- who cares about the memory attributes of a mapping that's 
> >> about
> >> to disappear anyway?
> >>
> >> Is it just nios2 that does something different?
> >>
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere 
> > would
> > solve it as well, in fact that was what I first thought the solution 
> > should be
> > until this was suggested. It's interesting that from the other thread 
> > Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would 
> > have
> > inadvertently blocked this on x86. But, on the other architectures I 
> > have since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so 
> > all of
> > the frob_* functions are actually just noops. In which case allocating 
> > RWX is
> > needed to make it work at all, because that is what the allocation is 
> > going to
> > stay at. So in these archs, set_memory_nx won't solve it because it 
> > will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there 
> > is the
> > changing of the permissions on the directmap as well. You don't want 
> > some other
> > caller getting a page that was left RO when freed and then trying to 
> > write to
> > it, if I understand this.
> >
> 
>  Exactly.
> >>>
> >>> Of course, I forgot about the linear mapping. On arm64, we've just queued
> >>> support for reflecting changes to read-only permissions in the linear map
> >>> [1]. So, whilst the linear map is always non-executable, we will need to
> >>> make parts of it writable again when freeing the module.
> >>>
>  After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>  VM_MAY_ADJUST_PERMS or similar.  It would have the semantics you want,
>  but it would also call some arch hooks to put back the direct map
>  permissions before the flush.  Does that seem reasonable?  It would
>  need to be hooked up that implement set_memory_ro(), but that should
>  be quite easy.  If nothing else, it could fall back to set_memory_ro()
>  in the absence of a better implementation.
> >>>
> >>> You mean set_memory_rw() here, right? Although, eliding the TLB 
> >>> invalidation
> >>> would open up a window where the vmap mapping is executable and the linear
> >>> mapping is writable, which is a bit rubbish.
> >>>
> >>
> >> Right, and Rick pointed out the same issue.  Instead, we should set
> >> the direct map not-present or its ARM equivalent, then do the flush,
> >> then make it RW.  I assume this also works on arm and arm64, although
> >> I don't 

  1   2   3   >