from:"john"

Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-12-03 Thread John Crispin




On 03/12/2018 15:00, René van Dorst wrote:

Quoting Bjørn Mork :

Greg Ungerer  writes:


The following change helped alot, but I still get some problems under
sustained load and some types of port setups. Still trying to figure
out what exactly is going on.

--- a/linux/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/linux/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1750,8 +1750,8 @@ static irqreturn_t mtk_handle_irq_rx(int irq, 
void *_eth)

   if (likely(napi_schedule_prep(>rx_napi))) {
    __napi_schedule(>rx_napi);
-   mtk_rx_irq_disable(eth, MTK_RX_DONE_INT);
    }
+   mtk_rx_irq_disable(eth, MTK_RX_DONE_INT);
   return IRQ_HANDLED;
 }
@@ -1762,11 +1762,53 @@ static irqreturn_t mtk_handle_irq_tx(int 
irq, void *_eth)

   if (likely(napi_schedule_prep(>tx_napi))) {
    __napi_schedule(>tx_napi);
-   mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
    }
+   mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
   return IRQ_HANDLED;
 }


Yes, sorry I didn't point to that as well.  Just to be clear:  I have no
clue how this thing is actually wired up, or if you could use three
interrupts on the MT7621 too. I just messed with it until I got
something to work, based on Renés original idea and code.


My idea is a just a copy of mtk_handle_irq_{rx,tx} see [1]
You probably want to look at the staging driver or Ubiquity source 
with a 3.10.x kernel [2] or padavan with 3.4.x kernel [3].

AFAIK mt7621 only has 1 IRQ for ethernet part.


correct there is only 1 single IRQ on mt7621

    John





Greats,

René

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/mediatek/mtk_eth_soc.c#n1739
[2] 
https://www.ubnt.com/download/edgemax/edgerouter-x-sfp/default/edgerouter-er-xer-x-sfpep-r6-firmware-v1107
[3] 
https://bitbucket.org/padavan/rt-n56u/src/e6f45337528f668651e251057a1a0fec735f6df1/trunk/linux-3.4.x/drivers/net/raeth/raether.c?at=master=file-view-default#raether.c-658

Re: [PATCH net-next,v4 00/12] add flow_rule infrastructure

2018-11-29 Thread John Fastabend

On 11/28/18 6:22 PM, Pablo Neira Ayuso wrote:
> Hi,
> 
> This patchset is another iteration to introduce an in-kernel intermediate
> representation (IR) to express ACL hardware offloads [1] [2] [3].
> 

Hi,

Also wanted to add. In an earlier thread it was mentioned this could be
used for other offload rule infrastructures specifically u32 was
mentioned. I don't think this is actually possible on the flow_rule
side. This set uses basically an enum based key system where enums
such as FLOW_DISSECTOR_KEY_* identify the field in the packet. For
every field we want to match a new key is needed. But the u32 classifier
defines fields using offset/mask and a parse graph. They do not seem
compatible to me so in the end this unifies ethtool and flower only.
Did I get this right?

So would it be better to simply map ethtool onto flower vs defining
a new one? Patch 1 seems to be pretty light-weight so maybe rather
than calling it a new IR we just need some helper routines for
drivers to work with.

Probably a more detailed cover letter explaining motivation
and any future work would help (me at least) understand the direction.
I see netfilter offload was mentioned at one point so maybe that is
the motivation that makes it more clear why flower API today is
insufficient. Mostly curious at this point I see Jiri and Florian
both reviewed it already.

Thanks,
John

[PATCH bpf-next v2 2/3] bpf: add msg_pop_data helper to tools

2018-11-26 Thread John Fastabend

Add the necessary header definitions to tools for new
msg_pop_data_helper.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h| 16 +++-
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 23e2031..597afdb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2268,6 +2268,19 @@ union bpf_attr {
  *
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags)
+ *  Description
+ * Will remove *pop* bytes from a *msg* starting at byte *start*.
+ * This may result in **ENOMEM** errors under certain situations if
+ * an allocation and copy are required due to a full ring buffer.
+ * However, the helper will try to avoid doing the allocation
+ * if possible. Other errors can occur if input parameters are
+ * invalid either due to *start* byte not being valid part of msg
+ * payload and/or *pop* value being to large.
+ *
+ * Return
+ * 0 on success, or a negative erro in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2360,7 +2373,8 @@ union bpf_attr {
FN(map_push_elem),  \
FN(map_pop_elem),   \
FN(map_peek_elem),  \
-   FN(msg_push_data),
+   FN(msg_push_data),  \
+   FN(msg_pop_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 686e57c..7b69519 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -113,6 +113,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_push_data;
+static int (*bpf_msg_pop_data)(void *ctx, int start, int cut, int flags) =
+   (void *) BPF_FUNC_msg_pop_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
-- 
2.7.4

[PATCH bpf-next v2 3/3] bpf: test_sockmap, add options for msg_pop_data() helper

2018-11-26 Thread John Fastabend

Similar to msg_pull_data and msg_push_data add a set of options to
have msg_pop_data() exercised.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c  | 127 +++-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 ++---
 2 files changed, 180 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 622ade0..e85a771 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -79,6 +79,8 @@ int txmsg_start;
 int txmsg_end;
 int txmsg_start_push;
 int txmsg_end_push;
+int txmsg_start_pop;
+int txmsg_pop;
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
@@ -104,6 +106,8 @@ static const struct option long_options[] = {
{"txmsg_end",   required_argument,  NULL, 'e'},
{"txmsg_start_push", required_argument, NULL, 'p'},
{"txmsg_end_push",   required_argument, NULL, 'q'},
+   {"txmsg_start_pop",  required_argument, NULL, 'w'},
+   {"txmsg_pop",required_argument, NULL, 'x'},
{"txmsg_ingress", no_argument,  _ingress, 1 },
{"txmsg_skb", no_argument,  _skb, 1 },
{"ktls", no_argument,   , 1 },
@@ -473,13 +477,27 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
clock_gettime(CLOCK_MONOTONIC, >end);
} else {
int slct, recvp = 0, recv, max_fd = fd;
+   float total_bytes, txmsg_pop_total;
int fd_flags = O_NONBLOCK;
struct timeval timeout;
-   float total_bytes;
fd_set w;
 
fcntl(fd, fd_flags);
+   /* Account for pop bytes noting each iteration of apply will
+* call msg_pop_data helper so we need to account for this
+* by calculating the number of apply iterations. Note user
+* of the tool can create cases where no data is sent by
+* manipulating pop/push/pull/etc. For example txmsg_apply 1
+* with txmsg_pop 1 will try to apply 1B at a time but each
+* iteration will then pop 1B so no data will ever be sent.
+* This is really only useful for testing edge cases in code
+* paths.
+*/
total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
+   txmsg_pop_total = txmsg_pop;
+   if (txmsg_apply)
+   txmsg_pop_total *= (total_bytes / txmsg_apply);
+   total_bytes -= txmsg_pop_total;
err = clock_gettime(CLOCK_MONOTONIC, >start);
if (err < 0)
perror("recv start time: ");
@@ -488,7 +506,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
timeout.tv_sec = 0;
timeout.tv_usec = 30;
} else {
-   timeout.tv_sec = 1;
+   timeout.tv_sec = 3;
timeout.tv_usec = 0;
}
 
@@ -503,7 +521,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
goto out_errno;
} else if (!slct) {
if (opt->verbose)
-   fprintf(stderr, "unexpected timeout\n");
+   fprintf(stderr, "unexpected timeout: 
recved %zu/%f pop_total %f\n", s->bytes_recvd, total_bytes, txmsg_pop_total);
errno = -EIO;
clock_gettime(CLOCK_MONOTONIC, >end);
goto out_errno;
@@ -619,7 +637,7 @@ static int sendmsg_test(struct sockmap_options *opt)
iov_count = 1;
err = msg_loop(rx_fd, iov_count, iov_buf,
   cnt, , false, opt);
-   if (err && opt->verbose)
+   if (opt->verbose)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
iov_count, iov_buf, cnt, err);
@@ -931,6 +949,39 @@ static int run_options(struct sockmap_options *options, 
int cg_fd,  int test)
}
}
 
+   if (txmsg_start_pop) {
+   i = 4;
+   err = bpf_map_update_elem(map_fd[5],
+ , _start_pop, 
BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+

[PATCH bpf-next v2 1/3] bpf: helper to pop data from messages

2018-11-26 Thread John Fastabend

This adds a BPF SK_MSG program helper so that we can pop data from a
msg. We use this to pop metadata from a previous push data call.

Signed-off-by: John Fastabend 
---
 include/uapi/linux/bpf.h |  16 -
 net/core/filter.c| 171 +++
 net/ipv4/tcp_bpf.c   |  17 -
 net/tls/tls_sw.c |  11 ++-
 4 files changed, 209 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 23e2031..597afdb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2268,6 +2268,19 @@ union bpf_attr {
  *
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags)
+ *  Description
+ * Will remove *pop* bytes from a *msg* starting at byte *start*.
+ * This may result in **ENOMEM** errors under certain situations if
+ * an allocation and copy are required due to a full ring buffer.
+ * However, the helper will try to avoid doing the allocation
+ * if possible. Other errors can occur if input parameters are
+ * invalid either due to *start* byte not being valid part of msg
+ * payload and/or *pop* value being to large.
+ *
+ * Return
+ * 0 on success, or a negative erro in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2360,7 +2373,8 @@ union bpf_attr {
FN(map_push_elem),  \
FN(map_pop_elem),   \
FN(map_peek_elem),  \
-   FN(msg_push_data),
+   FN(msg_push_data),  \
+   FN(msg_pop_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index f50ea97..bd0df75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2425,6 +2425,174 @@ static const struct bpf_func_proto 
bpf_msg_push_data_proto = {
.arg4_type  = ARG_ANYTHING,
 };
 
+static void sk_msg_shift_left(struct sk_msg *msg, int i)
+{
+   int prev;
+
+   do {
+   prev = i;
+   sk_msg_iter_var_next(i);
+   msg->sg.data[prev] = msg->sg.data[i];
+   } while (i != msg->sg.end);
+
+   sk_msg_iter_prev(msg, end);
+}
+
+static void sk_msg_shift_right(struct sk_msg *msg, int i)
+{
+   struct scatterlist tmp, sge;
+
+   sk_msg_iter_next(msg, end);
+   sge = sk_msg_elem_cpy(msg, i);
+   sk_msg_iter_var_next(i);
+   tmp = sk_msg_elem_cpy(msg, i);
+
+   while (i != msg->sg.end) {
+   msg->sg.data[i] = sge;
+   sk_msg_iter_var_next(i);
+   sge = tmp;
+   tmp = sk_msg_elem_cpy(msg, i);
+   }
+}
+
+BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
+  u32, len, u64, flags)
+{
+   u32 i = 0, l, space, offset = 0;
+   u64 last = start + len;
+   int pop;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   /* First find the starting scatterlist element */
+   i = msg->sg.start;
+   do {
+   l = sk_msg_elem(msg, i)->length;
+
+   if (start < offset + l)
+   break;
+   offset += l;
+   sk_msg_iter_var_next(i);
+   } while (i != msg->sg.end);
+
+   /* Bounds checks: start and pop must be inside message */
+   if (start >= offset + l || last >= msg->sg.size)
+   return -EINVAL;
+
+   space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
+
+   pop = len;
+   /* --| offset
+* -| start  | len ---|
+*
+*  |- a | pop ---|- b |
+*  |__| length
+*
+*
+* a:   region at front of scatter element to save
+* b:   region at back of scatter element to save when length > A + pop
+* pop: region to pop from element, same as input 'pop' here will be
+*  decremented below per iteration.
+*
+* Two top-level cases to handle when start != offset, first B is non
+* zero and second B is zero corresponding to when a pop includes more
+* than one element.
+*
+* Then if B is non-zero AND there is no space allocate space and
+* compact A, B regions into page. If there is space shift ring to
+* the rigth free'ing the next element in ring to place B, leaving
+* A untouched except to reduce length.
+*/
+   if (start != offset) {
+   struct scatterlist *nsge, *sge = sk_msg_elem(msg, i);
+   int a = start;
+   int b = sge->length - pop - a;
+
+   sk_msg_iter_var_next(i

[PATCH bpf-next v2 0/3] bpf: add sk_msg helper sk_msg_pop_data

2018-11-26 Thread John Fastabend

After being able to add metadata to messages with sk_msg_push_data we
have also found it useful to be able to "pop" this metadata off before
sending it to applications in some cases. This series adds a new helper
sk_msg_pop_data() and the associated patches to add tests and tools/lib
support.

Thanks!

v2: Daniel caught that we missed adding sk_msg_pop_data to the changes
data helper so that the verifier ensures BPF programs revalidate
data after using this helper. Also improve documentation adding a
return description and using RST syntax per Quentin's comment. And
delta calculations for DROP with pop'd data (albeit a strange set
of operations for a program to be doing) had potential to be
incorrect possibly confusing user space applications, so fix it.

John Fastabend (3):
  bpf: helper to pop data from messages
  bpf: add msg_pop_data helper to tools
  bpf: test_sockmap, add options for msg_pop_data() helper usage

 include/uapi/linux/bpf.h|  13 +-
 net/core/filter.c   | 169 
 net/ipv4/tcp_bpf.c  |  14 +-
 tools/include/uapi/linux/bpf.h  |  13 +-
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/test_sockmap.c  | 127 +-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 --
 7 files changed, 386 insertions(+), 22 deletions(-)

-- 
2.7.4

Re: [PATCH bpf-next 1/3] bpf: helper to pop data from messages

2018-11-26 Thread John Fastabend

On 11/25/18 5:05 PM, Daniel Borkmann wrote:
> On 11/23/2018 02:38 AM, John Fastabend wrote:
>> This adds a BPF SK_MSG program helper so that we can pop data from a
>> msg. We use this to pop metadata from a previous push data call.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  include/uapi/linux/bpf.h |  13 +++-
>>  net/core/filter.c| 169 
>> +++
>>  net/ipv4/tcp_bpf.c   |  14 +++-
>>  3 files changed, 192 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c1554aa..64681f8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -2268,6 +2268,16 @@ union bpf_attr {
>>   *
>>   *  Return
>>   *  0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 
>> flags)
>> + *   Description
>> + *  Will remove 'pop' bytes from a msg starting at byte 'start'.
>> + *  This result in ENOMEM errors under certain situations where
>> + *  a allocation and copy are required due to a full ring buffer.
>> + *  However, the helper will try to avoid doing the allocation
>> + *  if possible. Other errors can occur if input parameters are
>> + *  invalid either do to start byte not being valid part of msg
>> + *  payload and/or pop value being to large.
>>   */
>>  #define __BPF_FUNC_MAPPER(FN)   \
>>  FN(unspec), \
>> @@ -2360,7 +2370,8 @@ union bpf_attr {
>>  FN(map_push_elem),  \
>>  FN(map_pop_elem),   \
>>  FN(map_peek_elem),  \
>> -FN(msg_push_data),
>> +FN(msg_push_data),  \
>> +FN(msg_pop_data),
>>  
>>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>   * function eBPF program intends to call
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index f6ca38a..c6b35b5 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2428,6 +2428,173 @@ static const struct bpf_func_proto 
>> bpf_msg_push_data_proto = {
>>  .arg4_type  = ARG_ANYTHING,
>>  };
>>  
>> +static void sk_msg_shift_left(struct sk_msg *msg, int i)
>> +{
>> +int prev;
>> +
>> +do {
>> +prev = i;
>> +sk_msg_iter_var_next(i);
>> +msg->sg.data[prev] = msg->sg.data[i];
>> +} while (i != msg->sg.end);
>> +
>> +sk_msg_iter_prev(msg, end);
>> +}
>> +
>> +static void sk_msg_shift_right(struct sk_msg *msg, int i)
>> +{
>> +struct scatterlist tmp, sge;
>> +
>> +sk_msg_iter_next(msg, end);
>> +sge = sk_msg_elem_cpy(msg, i);
>> +sk_msg_iter_var_next(i);
>> +tmp = sk_msg_elem_cpy(msg, i);
>> +
>> +while (i != msg->sg.end) {
>> +msg->sg.data[i] = sge;
>> +sk_msg_iter_var_next(i);
>> +sge = tmp;
>> +tmp = sk_msg_elem_cpy(msg, i);
>> +}
>> +}
>> +
>> +BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
>> +   u32, len, u64, flags)
>> +{
>> +u32 i = 0, l, space, offset = 0;
>> +u64 last = start + len;
>> +int pop;
>> +
>> +if (unlikely(flags))
>> +return -EINVAL;
>> +
>> +/* First find the starting scatterlist element */
>> +i = msg->sg.start;
>> +do {
>> +l = sk_msg_elem(msg, i)->length;
>> +
>> +if (start < offset + l)
>> +break;
>> +offset += l;
>> +sk_msg_iter_var_next(i);
>> +} while (i != msg->sg.end);
>> +
>> +/* Bounds checks: start and pop must be inside message */
>> +if (start >= offset + l || last >= msg->sg.size)
>> +return -EINVAL;
>> +
>> +space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
>> +
>> +pop = len;
>> +/* --| offset
>> + * -| start  |--- len --|
>> + *
>> + *  |- a | pop ---|- b |
>> + *  |__| length
>> + *
>> + *
>> + * a:   region at front of scatter element to save
>> + * b:   region at back of scatter element to save when length > A + pop
>> +

Re: [PATCH bpf-next 1/3] bpf: helper to pop data from messages

2018-11-26 Thread John Fastabend

On 11/26/18 3:16 AM, Quentin Monnet wrote:
> 2018-11-26 02:05 UTC+0100 ~ Daniel Borkmann 
>> On 11/23/2018 02:38 AM, John Fastabend wrote:
>>> This adds a BPF SK_MSG program helper so that we can pop data from a
>>> msg. We use this to pop metadata from a previous push data call.
>>>
>>> Signed-off-by: John Fastabend 
>>> ---
>>>  include/uapi/linux/bpf.h |  13 +++-
>>>  net/core/filter.c| 169 
>>> +++
>>>  net/ipv4/tcp_bpf.c   |  14 +++-
>>>  3 files changed, 192 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index c1554aa..64681f8 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2268,6 +2268,16 @@ union bpf_attr {
>>>   *
>>>   * Return
>>>   * 0 on success, or a negative error in case of failure.
>>> + *
>>> + * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 
>>> flags)
>>> + *  Description
>>> + * Will remove 'pop' bytes from a msg starting at byte 'start'.
>>> + * This result in ENOMEM errors under certain situations where
>>> + * a allocation and copy are required due to a full ring buffer.
>>> + * However, the helper will try to avoid doing the allocation
>>> + * if possible. Other errors can occur if input parameters are
>>> + * invalid either do to start byte not being valid part of msg
>>> + * payload and/or pop value being to large.
>>>   */
> 
> Hi John,
> 
> If you respin could you please update the helper documentation to use
> RST syntax for argument and constant names (*pop* instead of 'pop',
> *msg*, *start*, *flags*, **ENOMEM**), and document the return value from
> the helper?
> 

Sure no problem.

> Thanks a lot,
> Quentin
>

[PATCH bpf-next 0/3] bpf: add sk_msg helper sk_msg_pop_data

2018-11-22 Thread John Fastabend

After being able to add metadata to messages with sk_msg_push_data we
have also found it useful to be able to "pop" this metadata off before
sending it to applications in some cases. This series adds a new helper
sk_msg_pop_data() and the associated patches to add tests and tools/lib
support.

Thanks!

John Fastabend (3):
  bpf: helper to pop data from messages
  bpf: add msg_pop_data helper to tools
  bpf: test_sockmap, add options for msg_pop_data() helper usage

 include/uapi/linux/bpf.h|  13 +-
 net/core/filter.c   | 169 
 net/ipv4/tcp_bpf.c  |  14 +-
 tools/include/uapi/linux/bpf.h  |  13 +-
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/test_sockmap.c  | 127 +-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 --
 7 files changed, 386 insertions(+), 22 deletions(-)

-- 
2.7.4

[PATCH bpf-next 2/3] bpf: add msg_pop_data helper to tools

2018-11-22 Thread John Fastabend

Add the necessary header definitions to tools for new
msg_pop_data_helper.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h| 13 -
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c1554aa..95cf7a5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2268,6 +2268,16 @@ union bpf_attr {
  *
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags)
+ * Description
+ * Will remove 'pop' bytes from a msg starting at byte 'start'.
+ * This result in ENOMEM errors under certain situations where
+ * a allocation and copy are required due to a full ring buffer.
+ * However, the helper will try to avoid doing the allocation
+ * if possible. Other errors can occur if input parameters are
+ * invalid either do to start byte not being valid part of msg
+ * payload and/or pop value being to large.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2360,7 +2370,8 @@ union bpf_attr {
FN(map_push_elem),  \
FN(map_pop_elem),   \
FN(map_peek_elem),  \
-   FN(msg_push_data),
+   FN(msg_push_data),  \
+   FN(msg_pop_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 686e57c..7b69519 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -113,6 +113,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_push_data;
+static int (*bpf_msg_pop_data)(void *ctx, int start, int cut, int flags) =
+   (void *) BPF_FUNC_msg_pop_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
-- 
2.7.4

[PATCH bpf-next 1/3] bpf: helper to pop data from messages

2018-11-22 Thread John Fastabend

This adds a BPF SK_MSG program helper so that we can pop data from a
msg. We use this to pop metadata from a previous push data call.

Signed-off-by: John Fastabend 
---
 include/uapi/linux/bpf.h |  13 +++-
 net/core/filter.c| 169 +++
 net/ipv4/tcp_bpf.c   |  14 +++-
 3 files changed, 192 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c1554aa..64681f8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2268,6 +2268,16 @@ union bpf_attr {
  *
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags)
+ *  Description
+ * Will remove 'pop' bytes from a msg starting at byte 'start'.
+ * This result in ENOMEM errors under certain situations where
+ * a allocation and copy are required due to a full ring buffer.
+ * However, the helper will try to avoid doing the allocation
+ * if possible. Other errors can occur if input parameters are
+ * invalid either do to start byte not being valid part of msg
+ * payload and/or pop value being to large.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2360,7 +2370,8 @@ union bpf_attr {
FN(map_push_elem),  \
FN(map_pop_elem),   \
FN(map_peek_elem),  \
-   FN(msg_push_data),
+   FN(msg_push_data),  \
+   FN(msg_pop_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index f6ca38a..c6b35b5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2428,6 +2428,173 @@ static const struct bpf_func_proto 
bpf_msg_push_data_proto = {
.arg4_type  = ARG_ANYTHING,
 };
 
+static void sk_msg_shift_left(struct sk_msg *msg, int i)
+{
+   int prev;
+
+   do {
+   prev = i;
+   sk_msg_iter_var_next(i);
+   msg->sg.data[prev] = msg->sg.data[i];
+   } while (i != msg->sg.end);
+
+   sk_msg_iter_prev(msg, end);
+}
+
+static void sk_msg_shift_right(struct sk_msg *msg, int i)
+{
+   struct scatterlist tmp, sge;
+
+   sk_msg_iter_next(msg, end);
+   sge = sk_msg_elem_cpy(msg, i);
+   sk_msg_iter_var_next(i);
+   tmp = sk_msg_elem_cpy(msg, i);
+
+   while (i != msg->sg.end) {
+   msg->sg.data[i] = sge;
+   sk_msg_iter_var_next(i);
+   sge = tmp;
+   tmp = sk_msg_elem_cpy(msg, i);
+   }
+}
+
+BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
+  u32, len, u64, flags)
+{
+   u32 i = 0, l, space, offset = 0;
+   u64 last = start + len;
+   int pop;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   /* First find the starting scatterlist element */
+   i = msg->sg.start;
+   do {
+   l = sk_msg_elem(msg, i)->length;
+
+   if (start < offset + l)
+   break;
+   offset += l;
+   sk_msg_iter_var_next(i);
+   } while (i != msg->sg.end);
+
+   /* Bounds checks: start and pop must be inside message */
+   if (start >= offset + l || last >= msg->sg.size)
+   return -EINVAL;
+
+   space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
+
+   pop = len;
+   /* --| offset
+* -| start  |--- len --|
+*
+*  |- a | pop ---|- b |
+*  |__| length
+*
+*
+* a:   region at front of scatter element to save
+* b:   region at back of scatter element to save when length > A + pop
+* pop: region to pop from element, same as input 'pop' here will be
+*  decremented below per iteration.
+*
+* Two top-level cases to handle when start != offset, first B is non
+* zero and second B is zero corresponding to when a pop includes more
+* than one element.
+*
+* Then if B is non-zero AND there is no space allocate space and
+* compact A, B regions into page. If there is space shift ring to
+* the rigth free'ing the next element in ring to place B, leaving
+* A untouched except to reduce length.
+*/
+   if (start != offset) {
+   struct scatterlist *nsge, *sge = sk_msg_elem(msg, i);
+   int a = start;
+   int b = sge->length - pop - a;
+
+   sk_msg_iter_var_next(i);
+
+   if (pop < sge->length - a) {
+   if (space) {
+   sge->length = a;
+

[PATCH bpf-next 3/3] bpf: test_sockmap, add options for msg_pop_data()

2018-11-22 Thread John Fastabend

Similar to msg_pull_data and msg_push_data add a set of options to
have msg_pop_data() exercised.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c  | 127 +++-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 ++---
 2 files changed, 180 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 622ade0..e85a771 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -79,6 +79,8 @@ int txmsg_start;
 int txmsg_end;
 int txmsg_start_push;
 int txmsg_end_push;
+int txmsg_start_pop;
+int txmsg_pop;
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
@@ -104,6 +106,8 @@ static const struct option long_options[] = {
{"txmsg_end",   required_argument,  NULL, 'e'},
{"txmsg_start_push", required_argument, NULL, 'p'},
{"txmsg_end_push",   required_argument, NULL, 'q'},
+   {"txmsg_start_pop",  required_argument, NULL, 'w'},
+   {"txmsg_pop",required_argument, NULL, 'x'},
{"txmsg_ingress", no_argument,  _ingress, 1 },
{"txmsg_skb", no_argument,  _skb, 1 },
{"ktls", no_argument,   , 1 },
@@ -473,13 +477,27 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
clock_gettime(CLOCK_MONOTONIC, >end);
} else {
int slct, recvp = 0, recv, max_fd = fd;
+   float total_bytes, txmsg_pop_total;
int fd_flags = O_NONBLOCK;
struct timeval timeout;
-   float total_bytes;
fd_set w;
 
fcntl(fd, fd_flags);
+   /* Account for pop bytes noting each iteration of apply will
+* call msg_pop_data helper so we need to account for this
+* by calculating the number of apply iterations. Note user
+* of the tool can create cases where no data is sent by
+* manipulating pop/push/pull/etc. For example txmsg_apply 1
+* with txmsg_pop 1 will try to apply 1B at a time but each
+* iteration will then pop 1B so no data will ever be sent.
+* This is really only useful for testing edge cases in code
+* paths.
+*/
total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
+   txmsg_pop_total = txmsg_pop;
+   if (txmsg_apply)
+   txmsg_pop_total *= (total_bytes / txmsg_apply);
+   total_bytes -= txmsg_pop_total;
err = clock_gettime(CLOCK_MONOTONIC, >start);
if (err < 0)
perror("recv start time: ");
@@ -488,7 +506,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
timeout.tv_sec = 0;
timeout.tv_usec = 30;
} else {
-   timeout.tv_sec = 1;
+   timeout.tv_sec = 3;
timeout.tv_usec = 0;
}
 
@@ -503,7 +521,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
goto out_errno;
} else if (!slct) {
if (opt->verbose)
-   fprintf(stderr, "unexpected timeout\n");
+   fprintf(stderr, "unexpected timeout: 
recved %zu/%f pop_total %f\n", s->bytes_recvd, total_bytes, txmsg_pop_total);
errno = -EIO;
clock_gettime(CLOCK_MONOTONIC, >end);
goto out_errno;
@@ -619,7 +637,7 @@ static int sendmsg_test(struct sockmap_options *opt)
iov_count = 1;
err = msg_loop(rx_fd, iov_count, iov_buf,
   cnt, , false, opt);
-   if (err && opt->verbose)
+   if (opt->verbose)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
iov_count, iov_buf, cnt, err);
@@ -931,6 +949,39 @@ static int run_options(struct sockmap_options *options, 
int cg_fd,  int test)
}
}
 
+   if (txmsg_start_pop) {
+   i = 4;
+   err = bpf_map_update_elem(map_fd[5],
+ , _start_pop, 
BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+

[PATCH net-next 3/3] nfp: flower: include geneve as supported offload tunnel type

2018-11-07 Thread John Hurley

Offload of geneve decap rules is supported in NFP. Include geneve in the
check for supported types.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c 
b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index 8e5bec0..170f314 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -190,6 +190,8 @@ static bool nfp_tun_is_netdev_to_offload(struct net_device 
*netdev)
return true;
if (netif_is_vxlan(netdev))
return true;
+   if (netif_is_geneve(netdev))
+   return true;
 
return false;
 }
-- 
2.7.4

[PATCH net-next 2/3] nfp: flower: use geneve and vxlan helpers

2018-11-07 Thread John Hurley

Make use of the recently added VXLAN and geneve helper functions to
determine the type of the netdev from its rtnl_link_ops.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 244dc26..2f67cd55 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cmsg.h"
 #include "main.h"
@@ -94,13 +95,10 @@ nfp_fl_pre_lag(struct nfp_app *app, const struct tc_action 
*action,
 static bool nfp_fl_netdev_is_tunnel_type(struct net_device *out_dev,
 enum nfp_flower_tun_type tun_type)
 {
-   if (!out_dev->rtnl_link_ops)
-   return false;
-
-   if (!strcmp(out_dev->rtnl_link_ops->kind, "vxlan"))
+   if (netif_is_vxlan(out_dev))
return tun_type == NFP_FL_TUNNEL_VXLAN;
 
-   if (!strcmp(out_dev->rtnl_link_ops->kind, "geneve"))
+   if (netif_is_geneve(out_dev))
return tun_type == NFP_FL_TUNNEL_GENEVE;
 
return false;
-- 
2.7.4

[PATCH net-next 1/3] net: add netif_is_geneve()

2018-11-07 Thread John Hurley

Add a helper function to determine if the type of a netdev is geneve based
on its rtnl_link_ops. This allows drivers that may wish to offload tunnels
to check the underlying type of the device.

A recent patch added a similar helper to vxlan.h

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 include/net/geneve.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/net/geneve.h b/include/net/geneve.h
index a7600ed..fc6a7e0 100644
--- a/include/net/geneve.h
+++ b/include/net/geneve.h
@@ -60,6 +60,12 @@ struct genevehdr {
struct geneve_opt options[];
 };
 
+static inline bool netif_is_geneve(const struct net_device *dev)
+{
+   return dev->rtnl_link_ops &&
+  !strcmp(dev->rtnl_link_ops->kind, "geneve");
+}
+
 #ifdef CONFIG_INET
 struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
u8 name_assign_type, u16 dst_port);
-- 
2.7.4

[PATCH net-next 0/3] nfp: add and use tunnel netdev helpers

2018-11-07 Thread John Hurley

A recent patch introduced the function netif_is_vxlan() to verify the
tunnel type of a given netdev as vxlan.

Add a similar function to detect geneve netdevs and make use of this
function in the NFP driver. Also make use of the vxlan helper where
applicable.

John Hurley (3):
  net: add netif_is_geneve()
  nfp: flower: use geneve and vxlan helpers
  nfp: flower: include geneve as supported offload tunnel type

 drivers/net/ethernet/netronome/nfp/flower/action.c  | 8 +++-
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 2 ++
 include/net/geneve.h| 6 ++
 3 files changed, 11 insertions(+), 5 deletions(-)

-- 
2.7.4

Rules for retransmitting sk_buffs?

2018-11-05 Thread John Ousterhout

I am creating a kernel module that implements the Homa transport
protocol (see paper in SIGCOMM 2018) and as a Linux kernel newbie I'm
struggling a bit to figure out how all of Linux's network plumbing
works.

I'm currently having problems retransmitting an sk_buff after packet
loss and hoping that perhaps someone here can help me understand the
rules and/or constraints around retransmission. Pointers to any
existing documentation would also be great.

I'm currently using the naive approach: Homa saves a reference to the
sk_buff after it is first transmitted, and if retransmission is
necessary it calls ip_queue_xmit again with the same sk_buff (it also
reuses the same flowi and dst as in the first call). The behavior I'm
seeing is very strange: the second call to ip_queue_xmit  is modifying
the flowi so that its daddr is 127.0.0.1. This then results in an
ICMP_DEST_UNREACH error.

Am I doing something fundamentally wrong here? E.g., do I need to
clone the sk_buff before retransmitting it? If so, are there any
restrictions on *when* I clone it (I'd prefer not to do this unless
retransmission is necessary, just to save work).

Thanks in advance for any advice/pointers.

-John-

ethtool 4.19 released

2018-11-02 Thread John W. Linville

ethtool version 4.19 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.19.tar.xz

Release notes:

* Feature: support combinations of FEC modes
* Feature: better syntax for combinations of FEC modes
* Fix: Fix uninitialized variable use at qsfp dump

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: [PATCH bpf-next] bpf_load: add map name to load_maps error message

2018-10-29 Thread John Fastabend

On 10/29/2018 02:14 PM, Shannon Nelson wrote:
> To help when debugging bpf/xdp load issues, have the load_map()
> error message include the number and name of the map that
> failed.
> 
> Signed-off-by: Shannon Nelson 
> ---
>  samples/bpf/bpf_load.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
> index 89161c9..5de0357 100644
> --- a/samples/bpf/bpf_load.c
> +++ b/samples/bpf/bpf_load.c
> @@ -282,8 +282,8 @@ static int load_maps(struct bpf_map_data *maps, int 
> nr_maps,
>   numa_node);
>   }
>   if (map_fd[i] < 0) {
> - printf("failed to create a map: %d %s\n",
> -errno, strerror(errno));
> + printf("failed to create map %d (%s): %d %s\n",
> +i, maps[i].name, errno, strerror(errno));
>   return 1;
>   }
>   maps[i].fd = map_fd[i];
> 

LGTM

Acked-by: John Fastabend

Re: [PATCH bpf-next] xdp: sample code for redirecting vlan packets to specific cpus

2018-10-29 Thread John Fastabend

On 10/29/2018 02:19 PM, Shannon Nelson wrote:
> This is an example of using XDP to redirect the processing of
> particular vlan packets to specific CPUs.  This is in response
> to comments received on a kernel patch put forth previously
> to do something similar using RPS.
>  https://www.spinics.net/lists/netdev/msg528210.html
>  [PATCH net-next] net: enable RPS on vlan devices
> 
> This XDP application watches for inbound vlan-tagged packets
> and redirects those packets to be processed on a specific CPU
> as configured in a BPF map.  The BPF map can be modified by
> this user program, which can also load and unload the kernel
> XDP code.
> 
> One example use is for supporting VMs where we can't control the
> OS being used: we'd like to separate the VM CPU processing from
> the host's CPUs as a way to help mitigate L1TF related issues.
> When running the VM's traffic on a vlan we can stick the host's
> Rx processing on one set of CPUs separate from the VM's CPUs.
> 
> This example currently uses a vlan key and cpu value in the
> BPF map, so only can do one CPU per vlan.  This could easily
> be modified to use a bitpattern of CPUs rather than a CPU id
> to allow multiple CPUs per vlan.

Great, so does this solve your use case then? At least on drivers
with XDP support?

> 
> Signed-off-by: Shannon Nelson 
> ---

Some really small and trivial nits below.

Acked-by: John Fastabend 

[...]

> + if (install) {
> +

new line probably not needed. 

> + /* check to see if already installed */
> + errno = 0;
> + access(pin_prog_name, R_OK);
> + if (errno != ENOENT) {
> + fprintf(stderr, "ERR: %s is already installed\n", 
> argv[0]);
> + return -1;
> + }
> +
> + /* load the XDP program and maps with the convenient library */
> + if (load_bpf_file(filename)) {
> + fprintf(stderr, "ERR: load_bpf_file(%s): \n%s",
> + filename, bpf_log_buf);
> + return -1;
> + }
> + if (!prog_fd[0]) {
> + fprintf(stderr, "ERR: load_bpf_file(%s): %d %s\n",
> + filename, errno, strerror(errno));
> + return -1;
> + }
> +
> + /* pin the XDP program and maps */
> + if (bpf_obj_pin(prog_fd[0], pin_prog_name) < 0) {
> + fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
> + pin_prog_name, errno, strerror(errno));
> + if (errno == 2)
> + fprintf(stderr, " (is the BPF fs mounted on 
> /sys/fs/bpf?)\n");
> + return -1;
> + }
> + if (bpf_obj_pin(map_fd[0], pin_vlanmap_name) < 0) {
> + fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
> + pin_vlanmap_name, errno, strerror(errno));
> + return -1;
> + }
> + if (bpf_obj_pin(map_fd[2], pin_countermap_name) < 0) {
> + fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
> + pin_countermap_name, errno, strerror(errno));
> + return -1;
> + }
> +
> + /* prep the vlan map with "not used" values */
> + c64 = UNDEF_CPU;
> + for (v64 = 0; v64 < 4096; v64++) {

maybe #define MAX_VLANS 4096 just to avoid constants.

> + if (bpf_map_update_elem(map_fd[0], , , 0)) {
> + fprintf(stderr, "ERR: preping vlan map failed 
> on v=%llu: %d %s\n",
> + v64, errno, strerror(errno));
> + return -1;
> + }
> + }
> +
> + /* prep the cpumap with queue sizes */
> + c64 = 128+64;  /* see note in xdp_redirect_cpu_user.c */
> + for (v64 = 0; v64 < MAX_CPUS; v64++) {
> + if (bpf_map_update_elem(map_fd[1], , , 0)) {
> + if (errno == ENODEV) {
> + /* Save the last CPU number attempted
> +  * into the counters map
> +  */
> + c64 = CPU_COUNT;
> + ret = bpf_map_update_elem(map_fd[2], 
> , , 0);
> + break;
> + }
> +

Re: [PATCH] bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data

2018-10-29 Thread John Fastabend

On 10/29/2018 12:31 PM, John Fastabend wrote:
> We return 0 in the case of a nonblocking socket that has no data
> available. However, this is incorrect and may confuse applications.
> After this patch we do the correct thing and return the error
> EAGAIN.
> 
> Quoting return codes from recvmsg manpage,
> 
> EAGAIN or EWOULDBLOCK
>  The socket is marked nonblocking and the receive operation would
>  block, or a receive timeout had been set and the timeout expired
>  before data was received.
> 
> Signed-off-by: John Fastabend 
> ---

Add fixes tag.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")

[PATCH] bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data

2018-10-29 Thread John Fastabend

We return 0 in the case of a nonblocking socket that has no data
available. However, this is incorrect and may confuse applications.
After this patch we do the correct thing and return the error
EAGAIN.

Quoting return codes from recvmsg manpage,

EAGAIN or EWOULDBLOCK
 The socket is marked nonblocking and the receive operation would
 block, or a receive timeout had been set and the timeout expired
 before data was received.

Signed-off-by: John Fastabend 
---
 net/ipv4/tcp_bpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index b7918d4..3b45fe5 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -145,6 +145,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
ret = err;
goto out;
}
+   copied = -EAGAIN;
}
ret = copied;
 out:
-- 
1.9.1

Re: [RFC net-next v2 1/8] net: sched: register callbacks for indirect tc block binds

2018-10-29 Thread John Hurley

On Sun, Oct 28, 2018 at 11:10 AM Or Gerlitz  wrote:
>
> On Thu, Oct 25, 2018 at 3:28 PM John Hurley  wrote:
> > Currently drivers can register to receive TC block bind/unbind callbacks
> > by implementing the setup_tc ndo in any of their given netdevs. However,
> > drivers may also be interested in binds to higher level devices (e.g.
> > tunnel drivers) to potentially offload filters applied to them.
>
> > Introduce indirect block devs which allows drivers to register callbacks
> > for block binds on other devices. The calling driver is expected to
> > reference an 'owner' struct that it will pass to all block registrations.
> > This is used to track the callbacks from a given driver and free them if
> > the driver is removed while the upper level device is still active.
>
> Hi John,
>
> Maybe it would be better to follow the trusted environment model of the kernel
> and not protect the core from driver bugs? If the driver does things right 
> they
> will unregister before bailing out and if not, they will have to fix..
>

Hi Or,
The owner stuff just makes it easier for a driver to track the blocks
it has registered for and, in turn, release these when exiting.
We could just leave this up to the driver to ensure it properly cleans
up after itself.
I don't feel that strongly either way.

> > Freeing a callback will also trigger an unbind event (if necessary) to
> > direct the driver to remove any offloaded rules and unreg any block filter
> > callbacks.
>
> > Allow registering an indirect block dev callback for a device that is
> > already bound to a block. In this case (if it is an ingress block),
> > register and also trigger the callback meaning that any already installed
> > rules can be replayed to the calling driver.
>
> not just can be replayed.. they will be replayed, but through an
> existing (tc re-offload?)
> facility, correct?
>

Yes, currently in TC, when you register for rule callbacks to a block
that already has rules, these rules are replayed.
With the indirect block approach we still use the same mechanism for
requesting rule callbacks,

> Or.

Re: [RFC net-next v2 2/8] net: add netif_is_geneve()

2018-10-29 Thread John Hurley

On Fri, Oct 26, 2018 at 9:52 AM Sergei Shtylyov
 wrote:
>
> Hello!
>
> On 25.10.2018 15:26, John Hurley wrote:
>
> > Add a helper function to determine if the type of a netdev is geneve based
> > on its rtnl_link_ops. This allows drivers that may wish to ofload tunnels
>
> Offload?
>

offload encap/decap to a hardware device such as a smartNIC.
Sorry, should have made this clearer

> > to check the underlying type of the device.
> >
> > A recent patch added a similar helper to vxlan.h
> >
> > Signed-off-by: John Hurley 
> > Reviewed-by: Jakub Kicinski 
> [...]
>
> MBR, Sergei
>
>

Re: [RFC net-next v2 2/8] net: add netif_is_geneve()

2018-10-25 Thread John Hurley

On Thu, Oct 25, 2018 at 2:00 PM Jiri Pirko  wrote:
>
> Thu, Oct 25, 2018 at 02:26:51PM CEST, john.hur...@netronome.com wrote:
> >Add a helper function to determine if the type of a netdev is geneve based
> >on its rtnl_link_ops. This allows drivers that may wish to ofload tunnels
> >to check the underlying type of the device.
> >
> >A recent patch added a similar helper to vxlan.h
> >
> >Signed-off-by: John Hurley 
> >Reviewed-by: Jakub Kicinski 
>
> I don't understand why this and the next patch are part of this
> patchset. They don't seem directly related.

This is used in later patches that implement the indirect block
offload but I suppose it is not directly related.
We can probably move it to a separate patchset.
Thanks

Re: [RFC net-next v2 0/8] indirect tc block cb registration

2018-10-25 Thread John Hurley

On Thu, Oct 25, 2018 at 1:58 PM Jiri Pirko  wrote:
>
> Thu, Oct 25, 2018 at 02:26:49PM CEST, john.hur...@netronome.com wrote:
> >This patchset introduces an alternative to egdev offload by allowing a
> >driver to register for block updates when an external device (e.g. tunnel
> >netdev) is bound to a TC block. Drivers can track new netdevs or register
> >to existing ones to receive information on such events. Based on this,
> >they may register for block offload rules using already existing
> >functions.
> >
> >The patchset also implements this new indirect block registration in the
> >NFP driver to allow the offloading of tunnel rules. The use of egdev
> >offload (which is currently only used for tunnel offload) is subsequently
> >removed.
>
> John, I'm missing v1->v2 changelog. Could you please add it?
>
> Thanks!

Hi Jiri,
There's little change outside the NFP in v2 but here's short changelog:

v1->v2:
- free allocated owner struct in block_owner_clean function
- add geneve type helper function
- move test stub in NFP (v1 patch 2) to full tunnel offload
implementation via indirect blocks (v2 patches 3-8)

[RFC net-next v2 8/8] nfp: flower: remove unnecessary code in flow lookup

2018-10-25 Thread John Hurley

Recent changes to NFP mean that stats updates from fw to driver no longer
require a flow lookup and (because egdev offload has been removed) the
ingress netdev for a lookup is now always known.

Remove obsolete code in a flow lookup that matches on host context and
that allows for a netdev to be NULL.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.h |  3 +--
 drivers/net/ethernet/netronome/nfp/flower/metadata.c | 11 +++
 drivers/net/ethernet/netronome/nfp/flower/offload.c  |  6 ++
 3 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index d8c8f0d..3d3a13f 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -20,7 +20,6 @@ struct nfp_fl_pre_lag;
 struct net_device;
 struct nfp_app;
 
-#define NFP_FL_STATS_CTX_DONT_CARE cpu_to_be32(0x)
 #define NFP_FL_STATS_ELEM_RS   FIELD_SIZEOF(struct nfp_fl_stats_id, \
 init_unalloc)
 #define NFP_FLOWER_MASK_ENTRY_RS   256
@@ -248,7 +247,7 @@ int nfp_modify_flow_metadata(struct nfp_app *app,
 
 struct nfp_fl_payload *
 nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
-  struct net_device *netdev, __be32 host_ctx);
+  struct net_device *netdev);
 struct nfp_fl_payload *
 nfp_flower_remove_fl_table(struct nfp_app *app, unsigned long 
tc_flower_cookie);
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index 9b4711c..573a440 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -21,7 +21,6 @@ struct nfp_mask_id_table {
 struct nfp_fl_flow_table_cmp_arg {
struct net_device *netdev;
unsigned long cookie;
-   __be32 host_ctx;
 };
 
 static int nfp_release_stats_entry(struct nfp_app *app, u32 stats_context_id)
@@ -76,14 +75,13 @@ static int nfp_get_stats_entry(struct nfp_app *app, u32 
*stats_context_id)
 /* Must be called with either RTNL or rcu_read_lock */
 struct nfp_fl_payload *
 nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
-  struct net_device *netdev, __be32 host_ctx)
+  struct net_device *netdev)
 {
struct nfp_fl_flow_table_cmp_arg flower_cmp_arg;
struct nfp_flower_priv *priv = app->priv;
 
flower_cmp_arg.netdev = netdev;
flower_cmp_arg.cookie = tc_flower_cookie;
-   flower_cmp_arg.host_ctx = host_ctx;
 
return rhashtable_lookup_fast(>flow_table, _cmp_arg,
  nfp_flower_table_params);
@@ -307,8 +305,7 @@ int nfp_compile_flow_metadata(struct nfp_app *app,
priv->stats[stats_cxt].bytes = 0;
priv->stats[stats_cxt].used = jiffies;
 
-   check_entry = nfp_flower_search_fl_table(app, flow->cookie, netdev,
-NFP_FL_STATS_CTX_DONT_CARE);
+   check_entry = nfp_flower_search_fl_table(app, flow->cookie, netdev);
if (check_entry) {
if (nfp_release_stats_entry(app, stats_cxt))
return -EINVAL;
@@ -353,9 +350,7 @@ static int nfp_fl_obj_cmpfn(struct rhashtable_compare_arg 
*arg,
const struct nfp_fl_flow_table_cmp_arg *cmp_arg = arg->key;
const struct nfp_fl_payload *flow_entry = obj;
 
-   if ((!cmp_arg->netdev || flow_entry->ingress_dev == cmp_arg->netdev) &&
-   (cmp_arg->host_ctx == NFP_FL_STATS_CTX_DONT_CARE ||
-flow_entry->meta.host_ctx_id == cmp_arg->host_ctx))
+   if (flow_entry->ingress_dev == cmp_arg->netdev)
return flow_entry->tc_flower_cookie != cmp_arg->cookie;
 
return 1;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 392d292..07ff728 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -512,8 +512,7 @@ nfp_flower_del_offload(struct nfp_app *app, struct 
net_device *netdev,
if (nfp_netdev_is_nfp_repr(netdev))
port = nfp_port_from_netdev(netdev);
 
-   nfp_flow = nfp_flower_search_fl_table(app, flow->cookie, netdev,
- NFP_FL_STATS_CTX_DONT_CARE);
+   nfp_flow = nfp_flower_search_fl_table(app, flow->cookie, netdev);
if (!nfp_flow)
return -ENOENT;
 
@@ -561,8 +560,7 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
struct nfp_fl_payload *nfp_flow;
u32 ctx_id;
 
-

[RFC net-next v2 5/8] nfp: flower: add infastructure for indirect TC block register

2018-10-25 Thread John Hurley

Add support structures and functions that can be used by NFP to impliment
the indirect block register functionality of TC.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  13 +++
 drivers/net/ethernet/netronome/nfp/flower/main.h   |   8 ++
 .../net/ethernet/netronome/nfp/flower/offload.c| 129 +
 3 files changed, 150 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 3a54728..518006c 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -568,8 +568,18 @@ static int nfp_flower_init(struct nfp_app *app)
goto err_cleanup_metadata;
}
 
+   INIT_LIST_HEAD(_priv->indr_block_cb_priv);
+   app_priv->indr_block_owner = tc_indr_block_owner_create();
+   if (!app_priv->indr_block_owner) {
+   err = -ENOMEM;
+   goto err_lag_clean;
+   }
+
return 0;
 
+err_lag_clean:
+   if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG)
+   nfp_flower_lag_cleanup(_priv->nfp_lag);
 err_cleanup_metadata:
nfp_flower_metadata_cleanup(app);
 err_free_app_priv:
@@ -588,6 +598,8 @@ static void nfp_flower_clean(struct nfp_app *app)
if (app_priv->flower_ext_feats & NFP_FL_FEATS_LAG)
nfp_flower_lag_cleanup(_priv->nfp_lag);
 
+   nfp_flower_clean_indr_block_priv(app);
+
nfp_flower_metadata_cleanup(app);
vfree(app->priv);
app->priv = NULL;
@@ -678,6 +690,7 @@ static void nfp_flower_stop(struct nfp_app *app)
unregister_netdevice_notifier(_priv->nfp_lag.lag_nb);
 
nfp_tunnel_config_stop(app);
+   tc_indr_block_owner_clean(app_priv->indr_block_owner);
 }
 
 const struct nfp_app_type app_flower = {
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index a91ac52..8b4bcf3 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -133,6 +133,8 @@ struct nfp_fl_lag {
  * @reify_wait_queue:  wait queue for repr reify response counting
  * @mtu_conf:  Configuration of repr MTU value
  * @nfp_lag:   Link aggregation data block
+ * @indr_block_cb_priv:List of priv data passed to indirect block 
registers
+ * @indr_block_owner:  Struct required for indirect blocks
  */
 struct nfp_flower_priv {
struct nfp_app *app;
@@ -166,6 +168,8 @@ struct nfp_flower_priv {
wait_queue_head_t reify_wait_queue;
struct nfp_mtu_conf mtu_conf;
struct nfp_fl_lag nfp_lag;
+   struct list_head indr_block_cb_priv;
+   struct tcf_indr_block_owner *indr_block_owner;
 };
 
 /**
@@ -269,5 +273,9 @@ int nfp_flower_lag_populate_pre_action(struct nfp_app *app,
   struct nfp_fl_pre_lag *pre_act);
 int nfp_flower_lag_get_output_id(struct nfp_app *app,
 struct net_device *master);
+void
+nfp_flower_register_indr_block(struct nfp_app *app, struct net_device *netdev);
+void nfp_flower_unregister_indr_block(struct net_device *netdev);
+void nfp_flower_clean_indr_block_priv(struct nfp_app *app);
 
 #endif
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 2c32edf..f701b2e 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -693,3 +693,132 @@ int nfp_flower_setup_tc(struct nfp_app *app, struct 
net_device *netdev,
return -EOPNOTSUPP;
}
 }
+
+struct nfp_flower_indr_block_cb_priv {
+   struct net_device *netdev;
+   struct nfp_app *app;
+   struct list_head list;
+};
+
+static struct nfp_flower_indr_block_cb_priv *
+nfp_flower_indr_block_cb_priv_lookup(struct nfp_app *app,
+struct net_device *netdev)
+{
+   struct nfp_flower_indr_block_cb_priv *cb_priv;
+   struct nfp_flower_priv *priv = app->priv;
+
+   /* All callback list access should be protected by RTNL. */
+   ASSERT_RTNL();
+
+   list_for_each_entry(cb_priv, >indr_block_cb_priv, list)
+   if (cb_priv->netdev == netdev)
+   return cb_priv;
+
+   return NULL;
+}
+
+void nfp_flower_clean_indr_block_priv(struct nfp_app *app)
+{
+   struct nfp_flower_indr_block_cb_priv *cb_priv, *temp;
+   struct nfp_flower_priv *priv = app->priv;
+
+   list_for_each_entry_safe(cb_priv, temp, >indr_block_cb_priv, list)
+   kfree(cb_priv);
+}
+
+static int nfp_flower_setup_indr_block_cb(enum tc_setup_type type,
+ void *type_data, void *cb_priv)
+{
+   struct nfp_flower_indr_block_cb_priv *priv = cb_priv;

[RFC net-next v2 6/8] nfp: flower: offload tunnel decap rules via indirect TC blocks

2018-10-25 Thread John Hurley

Previously, TC block tunnel decap rules were only offloaded when a
callback was triggered through registration of the rules egress device.
This meant that the driver had no access to the ingress netdev and so
could not verify it was the same tunnel type that the rule implied.

Register tunnel devices for indirect TC block offloads in NFP, giving
access to new rules based on the ingress device rather than egress. Use
this to verify the netdev type of VXLAN and Geneve based rules and offload
the rules to HW if applicable.

Tunnel registration is done via a netdev notifier. On notifier
registration, this is triggered for already existing netdevs. This means
that NFP can register for offloads from devices that exist before it is
loaded (filter rules will be replayed from the TC core).

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/action.c  | 15 ---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h| 13 +
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 11 +++
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c |  9 -
 4 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 04349c7..1260825 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -91,21 +91,6 @@ nfp_fl_pre_lag(struct nfp_app *app, const struct tc_action 
*action,
return act_size;
 }
 
-static bool nfp_fl_netdev_is_tunnel_type(struct net_device *out_dev,
-enum nfp_flower_tun_type tun_type)
-{
-   if (!out_dev->rtnl_link_ops)
-   return false;
-
-   if (!strcmp(out_dev->rtnl_link_ops->kind, "vxlan"))
-   return tun_type == NFP_FL_TUNNEL_VXLAN;
-
-   if (!strcmp(out_dev->rtnl_link_ops->kind, "geneve"))
-   return tun_type == NFP_FL_TUNNEL_GENEVE;
-
-   return false;
-}
-
 static int
 nfp_fl_output(struct nfp_app *app, struct nfp_fl_output *output,
  const struct tc_action *action, struct nfp_fl_payload *nfp_flow,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h 
b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 29d673a..06e2888 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../nfp_app.h"
 #include "../nfpcore/nfp_cpp.h"
@@ -475,6 +476,18 @@ static inline int nfp_flower_cmsg_get_data_len(struct 
sk_buff *skb)
return skb->len - NFP_FLOWER_CMSG_HLEN;
 }
 
+static inline bool
+nfp_fl_netdev_is_tunnel_type(struct net_device *dev,
+enum nfp_flower_tun_type tun_type)
+{
+   if (netif_is_vxlan(dev))
+   return tun_type == NFP_FL_TUNNEL_VXLAN;
+   if (netif_is_geneve(dev))
+   return tun_type == NFP_FL_TUNNEL_GENEVE;
+
+   return false;
+}
+
 struct sk_buff *
 nfp_flower_cmsg_mac_repr_start(struct nfp_app *app, unsigned int num_ports);
 void
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index f701b2e..1dc6044 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -128,6 +128,7 @@ nfp_flower_calc_opt_layer(struct 
flow_dissector_key_enc_opts *enc_opts,
 
 static int
 nfp_flower_calculate_key_layers(struct nfp_app *app,
+   struct net_device *netdev,
struct nfp_fl_key_ls *ret_key_ls,
struct tc_cls_flower_offload *flow,
bool egress,
@@ -186,8 +187,6 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
skb_flow_dissector_target(flow->dissector,
  
FLOW_DISSECTOR_KEY_ENC_CONTROL,
  flow->key);
-   if (!egress)
-   return -EOPNOTSUPP;
 
if (mask_enc_ctl->addr_type != 0x ||
enc_ctl->addr_type != FLOW_DISSECTOR_KEY_IPV4_ADDRS)
@@ -250,6 +249,10 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
default:
return -EOPNOTSUPP;
}
+
+   /* Ensure the ingress netdev matches the expected tun type. */
+   if (!nfp_fl_netdev_is_tunnel_type(netdev, *tun_type))
+   return -EOPNOTSUPP;
} else if (egress) {
/* Reject non tunnel matches offloaded to egress repr. */
return -EOPNOTSUPP;
@@ -451,8 +454,8 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_

[RFC net-next v2 7/8] nfp: flower: remove TC egdev offloads

2018-10-25 Thread John Hurley

Previously, only tunnel decap rules required egdev registration for
offload in NFP. These are now supported via indirect TC block callbacks.

Remove the egdev code from NFP.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c   | 12 
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  3 -
 .../net/ethernet/netronome/nfp/flower/metadata.c   |  1 +
 .../net/ethernet/netronome/nfp/flower/offload.c| 79 +-
 4 files changed, 17 insertions(+), 78 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 518006c..45ab4be 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -146,23 +146,12 @@ nfp_flower_repr_netdev_stop(struct nfp_app *app, struct 
nfp_repr *repr)
return nfp_flower_cmsg_portmod(repr, false, repr->netdev->mtu, false);
 }
 
-static int
-nfp_flower_repr_netdev_init(struct nfp_app *app, struct net_device *netdev)
-{
-   return tc_setup_cb_egdev_register(netdev,
- nfp_flower_setup_tc_egress_cb,
- netdev_priv(netdev));
-}
-
 static void
 nfp_flower_repr_netdev_clean(struct nfp_app *app, struct net_device *netdev)
 {
struct nfp_repr *repr = netdev_priv(netdev);
 
kfree(repr->app_priv);
-
-   tc_setup_cb_egdev_unregister(netdev, nfp_flower_setup_tc_egress_cb,
-netdev_priv(netdev));
 }
 
 static void
@@ -711,7 +700,6 @@ const struct nfp_app_type app_flower = {
.vnic_init  = nfp_flower_vnic_init,
.vnic_clean = nfp_flower_vnic_clean,
 
-   .repr_init  = nfp_flower_repr_netdev_init,
.repr_preclean  = nfp_flower_repr_netdev_preclean,
.repr_clean = nfp_flower_repr_netdev_clean,
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 8b4bcf3..d8c8f0d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -213,7 +213,6 @@ struct nfp_fl_payload {
char *unmasked_data;
char *mask_data;
char *action_data;
-   bool ingress_offload;
 };
 
 extern const struct rhashtable_params nfp_flower_table_params;
@@ -262,8 +261,6 @@ void nfp_tunnel_del_ipv4_off(struct nfp_app *app, __be32 
ipv4);
 void nfp_tunnel_add_ipv4_off(struct nfp_app *app, __be32 ipv4);
 void nfp_tunnel_request_route(struct nfp_app *app, struct sk_buff *skb);
 void nfp_tunnel_keep_alive(struct nfp_app *app, struct sk_buff *skb);
-int nfp_flower_setup_tc_egress_cb(enum tc_setup_type type, void *type_data,
- void *cb_priv);
 void nfp_flower_lag_init(struct nfp_fl_lag *lag);
 void nfp_flower_lag_cleanup(struct nfp_fl_lag *lag);
 int nfp_flower_lag_reset(struct nfp_fl_lag *lag);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index 48729bf..9b4711c 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -287,6 +287,7 @@ int nfp_compile_flow_metadata(struct nfp_app *app,
 
nfp_flow->meta.host_ctx_id = cpu_to_be32(stats_cxt);
nfp_flow->meta.host_cookie = cpu_to_be64(flow->cookie);
+   nfp_flow->ingress_dev = netdev;
 
new_mask_id = 0;
if (!nfp_check_mask_add(app, nfp_flow->mask_data,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 1dc6044..392d292 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -131,7 +131,6 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
struct net_device *netdev,
struct nfp_fl_key_ls *ret_key_ls,
struct tc_cls_flower_offload *flow,
-   bool egress,
enum nfp_flower_tun_type *tun_type)
 {
struct flow_dissector_key_basic *mask_basic = NULL;
@@ -253,9 +252,6 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
/* Ensure the ingress netdev matches the expected tun type. */
if (!nfp_fl_netdev_is_tunnel_type(netdev, *tun_type))
return -EOPNOTSUPP;
-   } else if (egress) {
-   /* Reject non tunnel matches offloaded to egress repr. */
-   return -EOPNOTSUPP;
}
 
if (dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
@@ -376,7 +372,7 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 }
 
 static struct nfp_fl_payload *
-nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer, bool egress)
+nfp

[RFC net-next v2 2/8] net: add netif_is_geneve()

2018-10-25 Thread John Hurley

Add a helper function to determine if the type of a netdev is geneve based
on its rtnl_link_ops. This allows drivers that may wish to ofload tunnels
to check the underlying type of the device.

A recent patch added a similar helper to vxlan.h

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 include/net/geneve.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/net/geneve.h b/include/net/geneve.h
index a7600ed..fc6a7e0 100644
--- a/include/net/geneve.h
+++ b/include/net/geneve.h
@@ -60,6 +60,12 @@ struct genevehdr {
struct geneve_opt options[];
 };
 
+static inline bool netif_is_geneve(const struct net_device *dev)
+{
+   return dev->rtnl_link_ops &&
+  !strcmp(dev->rtnl_link_ops->kind, "geneve");
+}
+
 #ifdef CONFIG_INET
 struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
u8 name_assign_type, u16 dst_port);
-- 
2.7.4

[RFC net-next v2 4/8] nfp: flower: allow non repr netdev offload

2018-10-25 Thread John Hurley

Previously the offload functions in NFP assumed that the ingress (or
egress) netdev passed to them was an nfp repr.

Modify the driver to permit the passing of non repr netdevs as the ingress
device for an offload rule candidate. This may include devices such as
tunnels. The driver should then base its offload decision on a combination
of ingress device and egress port for a rule.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 14 
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  3 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 38 --
 .../net/ethernet/netronome/nfp/flower/offload.c| 33 +++
 4 files changed, 49 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 244dc26..04349c7 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -151,11 +151,12 @@ nfp_fl_output(struct nfp_app *app, struct nfp_fl_output 
*output,
/* Set action output parameters. */
output->flags = cpu_to_be16(tmp_flags);
 
-   /* Only offload if egress ports are on the same device as the
-* ingress port.
-*/
-   if (!switchdev_port_same_parent_id(in_dev, out_dev))
-   return -EOPNOTSUPP;
+   if (nfp_netdev_is_nfp_repr(in_dev)) {
+   /* Confirm ingress and egress are on same device. */
+   if (!switchdev_port_same_parent_id(in_dev, out_dev))
+   return -EOPNOTSUPP;
+   }
+
if (!nfp_netdev_is_nfp_repr(out_dev))
return -EOPNOTSUPP;
 
@@ -728,9 +729,8 @@ nfp_flower_loop_action(struct nfp_app *app, const struct 
tc_action *a,
*a_len += sizeof(struct nfp_fl_push_vlan);
} else if (is_tcf_tunnel_set(a)) {
struct ip_tunnel_info *ip_tun = tcf_tunnel_info(a);
-   struct nfp_repr *repr = netdev_priv(netdev);
 
-   *tun_type = nfp_fl_get_tun_from_act_l4_port(repr->app, a);
+   *tun_type = nfp_fl_get_tun_from_act_l4_port(app, a);
if (*tun_type == NFP_FL_TUNNEL_NONE)
return -EOPNOTSUPP;
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 90045ba..a91ac52 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -226,7 +226,8 @@ void nfp_flower_metadata_cleanup(struct nfp_app *app);
 
 int nfp_flower_setup_tc(struct nfp_app *app, struct net_device *netdev,
enum tc_setup_type type, void *type_data);
-int nfp_flower_compile_flow_match(struct tc_cls_flower_offload *flow,
+int nfp_flower_compile_flow_match(struct nfp_app *app,
+ struct tc_cls_flower_offload *flow,
  struct nfp_fl_key_ls *key_ls,
  struct net_device *netdev,
  struct nfp_fl_payload *nfp_flow,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c 
b/drivers/net/ethernet/netronome/nfp/flower/match.c
index e54fb60..cdf7559 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/match.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/match.c
@@ -52,10 +52,13 @@ nfp_flower_compile_port(struct nfp_flower_in_port *frame, 
u32 cmsg_port,
return 0;
}
 
-   if (tun_type)
+   if (tun_type) {
frame->in_port = cpu_to_be32(NFP_FL_PORT_TYPE_TUN | tun_type);
-   else
+   } else {
+   if (!cmsg_port)
+   return -EOPNOTSUPP;
frame->in_port = cpu_to_be32(cmsg_port);
+   }
 
return 0;
 }
@@ -289,17 +292,21 @@ nfp_flower_compile_ipv4_udp_tun(struct 
nfp_flower_ipv4_udp_tun *frame,
}
 }
 
-int nfp_flower_compile_flow_match(struct tc_cls_flower_offload *flow,
+int nfp_flower_compile_flow_match(struct nfp_app *app,
+ struct tc_cls_flower_offload *flow,
  struct nfp_fl_key_ls *key_ls,
  struct net_device *netdev,
  struct nfp_fl_payload *nfp_flow,
  enum nfp_flower_tun_type tun_type)
 {
-   struct nfp_repr *netdev_repr;
+   u32 cmsg_port = 0;
int err;
u8 *ext;
u8 *msk;
 
+   if (nfp_netdev_is_nfp_repr(netdev))
+   cmsg_port = nfp_repr_get_port_id(netdev);
+
memset(nfp_flow->unmasked_data, 0, key_ls->key_size);
memset(nfp_flow->mask_data, 0, key_ls->key_size);
 
@@ -327,15 +334,13 @@ int nfp_flower_c

[RFC net-next v2 3/8] nfp: flower: include geneve as supported offload tunnel type

2018-10-25 Thread John Hurley

Offload of geneve decap rules is supported in NFP. Include geneve in the
check for supported types.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c 
b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index 8e5bec0..170f314 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -190,6 +190,8 @@ static bool nfp_tun_is_netdev_to_offload(struct net_device 
*netdev)
return true;
if (netif_is_vxlan(netdev))
return true;
+   if (netif_is_geneve(netdev))
+   return true;
 
return false;
 }
-- 
2.7.4

[RFC net-next v2 0/8] indirect tc block cb registration

2018-10-25 Thread John Hurley

This patchset introduces an alternative to egdev offload by allowing a
driver to register for block updates when an external device (e.g. tunnel
netdev) is bound to a TC block. Drivers can track new netdevs or register
to existing ones to receive information on such events. Based on this,
they may register for block offload rules using already existing
functions.

The patchset also implements this new indirect block registration in the
NFP driver to allow the offloading of tunnel rules. The use of egdev
offload (which is currently only used for tunnel offload) is subsequently
removed.

John Hurley (8):
  net: sched: register callbacks for indirect tc block binds
  net: add netif_is_geneve()
  nfp: flower: include geneve as supported offload tunnel type
  nfp: flower: allow non repr netdev offload
  nfp: flower: add infastructure for indirect TC block register
  nfp: flower: offload tunnel decap rules via indirect TC blocks
  nfp: flower: remove TC egdev offloads
  nfp: flower: remove unnecessary code in flow lookup

 drivers/net/ethernet/netronome/nfp/flower/action.c |  29 +-
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h   |  13 +
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  25 +-
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  17 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  |  38 +--
 .../net/ethernet/netronome/nfp/flower/metadata.c   |  12 +-
 .../net/ethernet/netronome/nfp/flower/offload.c| 246 +++--
 .../ethernet/netronome/nfp/flower/tunnel_conf.c|  11 +-
 include/net/geneve.h   |   6 +
 include/net/pkt_cls.h  |  56 
 include/net/sch_generic.h  |   3 +
 net/sched/cls_api.c| 299 -
 12 files changed, 609 insertions(+), 146 deletions(-)

-- 
2.7.4

[RFC net-next v2 1/8] net: sched: register callbacks for indirect tc block binds

2018-10-25 Thread John Hurley

Currently drivers can register to receive TC block bind/unbind callbacks
by implementing the setup_tc ndo in any of their given netdevs. However,
drivers may also be interested in binds to higher level devices (e.g.
tunnel drivers) to potentially offload filters applied to them.

Introduce indirect block devs which allows drivers to register callbacks
for block binds on other devices. The calling driver is expected to
reference an 'owner' struct that it will pass to all block registrations.
This is used to track the callbacks from a given driver and free them if
the driver is removed while the upper level device is still active.
Freeing a callback will also trigger an unbind event (if necessary) to
direct the driver to remove any offloaded rules and unreg any block filter
callbacks.

Allow registering an indirect block dev callback for a device that is
already bound to a block. In this case (if it is an ingress block),
register and also trigger the callback meaning that any already installed
rules can be replayed to the calling driver.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
 include/net/pkt_cls.h |  56 +
 include/net/sch_generic.h |   3 +
 net/sched/cls_api.c   | 299 +-
 3 files changed, 357 insertions(+), 1 deletion(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 72ffb31..1b47837 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -37,6 +37,7 @@ struct tcf_block_ext_info {
 };
 
 struct tcf_block_cb;
+struct tcf_indr_block_owner;
 bool tcf_queue_work(struct rcu_work *rwork, work_func_t func);
 
 #ifdef CONFIG_NET_CLS
@@ -81,6 +82,20 @@ void __tcf_block_cb_unregister(struct tcf_block *block,
   struct tcf_block_cb *block_cb);
 void tcf_block_cb_unregister(struct tcf_block *block,
 tc_setup_cb_t *cb, void *cb_ident);
+int __tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+   tc_indr_block_bind_cb_t *cb, void *cb_ident,
+   struct tcf_indr_block_owner *owner);
+int tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ tc_indr_block_bind_cb_t *cb, void *cb_ident,
+ struct tcf_indr_block_owner *owner);
+void __tc_indr_block_cb_unregister(struct net_device *dev,
+  tc_indr_block_bind_cb_t *cb, void *cb_ident);
+void tc_indr_block_cb_unregister(struct net_device *dev,
+tc_indr_block_bind_cb_t *cb,
+void *cb_ident);
+
+struct tcf_indr_block_owner *tc_indr_block_owner_create(void);
+void tc_indr_block_owner_clean(struct tcf_indr_block_owner *owner);
 
 int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 struct tcf_result *res, bool compat_mode);
@@ -183,6 +198,47 @@ void tcf_block_cb_unregister(struct tcf_block *block,
 {
 }
 
+static inline
+int __tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+   tc_indr_block_bind_cb_t *cb,
+   void *cb_ident,
+   struct tcf_indr_block_owner *owner)
+{
+   return 0;
+}
+
+static inline
+int tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ tc_indr_block_bind_cb_t *cb, void *cb_ident,
+ struct tcf_indr_block_owner *owner)
+{
+   return 0;
+}
+
+static inline
+void __tc_indr_block_cb_unregister(struct net_device *dev,
+  tc_indr_block_bind_cb_t *cb,
+  void *cb_ident)
+{
+}
+
+static inline
+void tc_indr_block_cb_unregister(struct net_device *dev,
+tc_indr_block_bind_cb_t *cb,
+void *cb_ident)
+{
+}
+
+static inline struct tcf_indr_block_owner *tc_indr_block_owner_create(void)
+{
+   /* NULL would mean an error, only CONFIG_NET_CLS can dereference this */
+   return (void *)1;
+}
+
+static inline void tc_indr_block_owner_clean(struct tcf_indr_block_owner 
*owner)
+{
+}
+
 static inline int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
   struct tcf_result *res, bool compat_mode)
 {
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4d73642..8301581 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -24,6 +24,9 @@ struct bpf_flow_keys;
 typedef int tc_setup_cb_t(enum tc_setup_type type,
  void *type_data, void *cb_priv);
 
+typedef int tc_indr_block_bind_cb_t(struct net_device *dev, void *cb_priv,
+   enum tc_setup_type type, void *type_data);
+
 struct qdisc_rate_table {
struct tc_ratespec rate;
u32 data[256];
diff --git a/net/sched/cls_api.c b/net/sched

[bpf-next v2 2/3] bpf: libbpf support for msg_push_data

2018-10-19 Thread John Fastabend

Add support for new bpf_msg_push_data in libbpf.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h| 20 +++-
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a2fb333..852dc17 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2240,6 +2240,23 @@ struct bpf_stack_build_id {
  * pointer that was returned from bpf_sk_lookup_xxx\ ().
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags)
+ * Description
+ * For socket policies, insert *len* bytes into msg at offset
+ * *start*.
+ *
+ * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ * *msg* it may want to insert metadata or options into the msg.
+ * This can later be read and used by any of the lower layer BPF
+ * hooks.
+ *
+ * This helper may fail if under memory pressure (a malloc
+ * fails) in these cases BPF programs will get an appropriate
+ * error and BPF programs will need to handle them.
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2331,7 +2348,8 @@ struct bpf_stack_build_id {
FN(sk_release), \
FN(map_push_elem),  \
FN(map_pop_elem),   \
-   FN(map_peek_elem),
+   FN(map_peek_elem),  \
+   FN(msg_push_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 6407a3d..686e57c 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -111,6 +111,8 @@ static int (*bpf_msg_cork_bytes)(void *ctx, int len) =
(void *) BPF_FUNC_msg_cork_bytes;
 static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
+static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) =
+   (void *) BPF_FUNC_msg_push_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
-- 
1.9.1

[bpf-next v2 0/3] sockmap, bpf_msg_push_data helper

2018-10-19 Thread John Fastabend

This series adds a new helper bpf_msg_push_data to be used by
sk_msg programs. The helper can be used to insert extra bytes into
the message that can then be used by the program as metadata tags
among other things.

The first patch adds the helper, second patch the libbpf support,
and last patch updates test_sockmap to run msg_push_data tests.

v2: rebase after queue map and in filter.c convert int -> u32

John Fastabend (3):
  bpf: sk_msg program helper bpf_msg_push_data
  bpf: libbpf support for msg_push_data
  bpf: test_sockmap add options to use msg_push_data

 include/linux/skmsg.h   |   5 +
 include/uapi/linux/bpf.h|  20 +++-
 net/core/filter.c   | 134 
 tools/include/uapi/linux/bpf.h  |  20 +++-
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/test_sockmap.c  |  58 +-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  97 +
 7 files changed, 308 insertions(+), 28 deletions(-)

-- 
1.9.1

[bpf-next v2 1/3] bpf: sk_msg program helper bpf_msg_push_data

2018-10-19 Thread John Fastabend

This allows user to push data into a msg using sk_msg program types.
The format is as follows,

bpf_msg_push_data(msg, offset, len, flags)

this will insert 'len' bytes at offset 'offset'. For example to
prepend 10 bytes at the front of the message the user can,

bpf_msg_push_data(msg, 0, 10, 0);

This will invalidate data bounds so BPF user will have to then recheck
data bounds after calling this. After this the msg size will have been
updated and the user is free to write into the added bytes. We allow
any offset/len as long as it is within the (data, data_end) range.
However, a copy will be required if the ring is full and its possible
for the helper to fail with ENOMEM or EINVAL errors which need to be
handled by the BPF program.

This can be used similar to XDP metadata to pass data between sk_msg
layer and lower layers.

Signed-off-by: John Fastabend 
---
 include/linux/skmsg.h|   5 ++
 include/uapi/linux/bpf.h |  20 ++-
 net/core/filter.c| 134 +++
 3 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 84e1886..2a11e9d 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -207,6 +207,11 @@ static inline struct scatterlist *sk_msg_elem(struct 
sk_msg *msg, int which)
return >sg.data[which];
 }
 
+static inline struct scatterlist sk_msg_elem_cpy(struct sk_msg *msg, int which)
+{
+   return msg->sg.data[which];
+}
+
 static inline struct page *sk_msg_page(struct sk_msg *msg, int which)
 {
return sg_page(sk_msg_elem(msg, which));
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a2fb333..852dc17 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2240,6 +2240,23 @@ struct bpf_stack_build_id {
  * pointer that was returned from bpf_sk_lookup_xxx\ ().
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags)
+ * Description
+ * For socket policies, insert *len* bytes into msg at offset
+ * *start*.
+ *
+ * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ * *msg* it may want to insert metadata or options into the msg.
+ * This can later be read and used by any of the lower layer BPF
+ * hooks.
+ *
+ * This helper may fail if under memory pressure (a malloc
+ * fails) in these cases BPF programs will get an appropriate
+ * error and BPF programs will need to handle them.
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2331,7 +2348,8 @@ struct bpf_stack_build_id {
FN(sk_release), \
FN(map_push_elem),  \
FN(map_pop_elem),   \
-   FN(map_peek_elem),
+   FN(map_peek_elem),  \
+   FN(msg_push_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 5fd5139..35c6933 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2297,6 +2297,137 @@ int skb_do_redirect(struct sk_buff *skb)
.arg4_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
+  u32, len, u64, flags)
+{
+   struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
+   u32 new, i = 0, l, space, copy = 0, offset = 0;
+   u8 *raw, *to, *from;
+   struct page *page;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   /* First find the starting scatterlist element */
+   i = msg->sg.start;
+   do {
+   l = sk_msg_elem(msg, i)->length;
+
+   if (start < offset + l)
+   break;
+   offset += l;
+   sk_msg_iter_var_next(i);
+   } while (i != msg->sg.end);
+
+   if (start >= offset + l)
+   return -EINVAL;
+
+   space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
+
+   /* If no space available will fallback to copy, we need at
+* least one scatterlist elem available to push data into
+* when start aligns to the beginning of an element or two
+* when it falls inside an element. We handle the start equals
+* offset case because its the common case for inserting a
+* header.
+*/
+   if (!space || (space == 1 && start != offset))
+   copy = msg->sg.data[i].length;
+
+   page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+  get_order(copy + len));
+   if (unlikely(!page))
+   return -ENOMEM;
+
+   if (copy) {
+   int fr

[bpf-next v2 3/3] bpf: test_sockmap add options to use msg_push_data

2018-10-19 Thread John Fastabend

Add options to run msg_push_data, this patch creates two more flags
in test_sockmap that can be used to specify the offset and length
of bytes to be added. The new options are --txmsg_start_push to
specify where bytes should be inserted and --txmsg_end_push to
specify how many bytes. This is analagous to the options that are
used to pull data, --txmsg_start and --txmsg_end.

In addition to adding the options tests are added to the test
suit to run the tests similar to what was done for msg_pull_data.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c  | 58 ++-
 tools/testing/selftests/bpf/test_sockmap_kern.h | 97 +++--
 2 files changed, 129 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index cbd1c0b..622ade0 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -77,6 +77,8 @@
 int txmsg_cork;
 int txmsg_start;
 int txmsg_end;
+int txmsg_start_push;
+int txmsg_end_push;
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
@@ -100,6 +102,8 @@
{"txmsg_cork",  required_argument,  NULL, 'k'},
{"txmsg_start", required_argument,  NULL, 's'},
{"txmsg_end",   required_argument,  NULL, 'e'},
+   {"txmsg_start_push", required_argument, NULL, 'p'},
+   {"txmsg_end_push",   required_argument, NULL, 'q'},
{"txmsg_ingress", no_argument,  _ingress, 1 },
{"txmsg_skb", no_argument,  _skb, 1 },
{"ktls", no_argument,   , 1 },
@@ -903,6 +907,30 @@ static int run_options(struct sockmap_options *options, 
int cg_fd,  int test)
}
}
 
+   if (txmsg_start_push) {
+   i = 2;
+   err = bpf_map_update_elem(map_fd[5],
+ , _start_push, 
BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+   "ERROR: bpf_map_update_elem 
(txmsg_start_push):  %d (%s)\n",
+   err, strerror(errno));
+   goto out;
+   }
+   }
+
+   if (txmsg_end_push) {
+   i = 3;
+   err = bpf_map_update_elem(map_fd[5],
+ , _end_push, BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+   "ERROR: bpf_map_update_elem %i@%i 
(txmsg_end_push):  %d (%s)\n",
+   txmsg_end_push, i, err, 
strerror(errno));
+   goto out;
+   }
+   }
+
if (txmsg_ingress) {
int in = BPF_F_INGRESS;
 
@@ -1235,6 +1263,8 @@ static int test_mixed(int cgrp)
txmsg_pass = txmsg_noisy = txmsg_redir_noisy = txmsg_drop = 0;
txmsg_apply = txmsg_cork = 0;
txmsg_start = txmsg_end = 0;
+   txmsg_start_push = txmsg_end_push = 0;
+
/* Test small and large iov_count values with pass/redir/apply/cork */
txmsg_pass = 1;
txmsg_redir = 0;
@@ -1351,6 +1381,8 @@ static int test_start_end(int cgrp)
/* Test basic start/end with lots of iov_count and iov_lengths */
txmsg_start = 1;
txmsg_end = 2;
+   txmsg_start_push = 1;
+   txmsg_end_push = 2;
err = test_txmsg(cgrp);
if (err)
goto out;
@@ -1364,6 +1396,8 @@ static int test_start_end(int cgrp)
for (i = 99; i <= 1600; i += 500) {
txmsg_start = 0;
txmsg_end = i;
+   txmsg_start_push = 0;
+   txmsg_end_push = i;
err = test_exec(cgrp, );
if (err)
goto out;
@@ -1373,6 +1407,8 @@ static int test_start_end(int cgrp)
for (i = 199; i <= 1600; i += 500) {
txmsg_start = 100;
txmsg_end = i;
+   txmsg_start_push = 100;
+   txmsg_end_push = i;
err = test_exec(cgrp, );
if (err)
goto out;
@@ -1381,6 +1417,8 @@ static int test_start_end(int cgrp)
/* Test start/end with cork pulling last sg entry */
txmsg_start = 1500;
txmsg_end = 1600;
+   txmsg_start_push = 1500;
+   txmsg_end_push = 1600;
err = test_exec(cgrp, );
if (err)
goto out;
@@ -1388,6 +1426,8 @@ static int test_start_end(int cgrp)
/* Test start/end pull of single byte in last page */
txmsg_start = ;
txmsg_end = 1112;
+   txmsg_start_push = ;
+   txmsg_end_p

Re: [bpf-next v3 0/2] Fix kcm + sockmap by checking psock type

2018-10-19 Thread John Fastabend

On 10/19/2018 03:57 PM, Daniel Borkmann wrote:
> On 10/20/2018 12:51 AM, Daniel Borkmann wrote:
>> On 10/18/2018 10:58 PM, John Fastabend wrote:
>>> We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
>>> type to late, after we read the refcnt which is an error. This
>>> series moves the check up before reading refcnt and also adds a test
>>> to test_maps to test trying to add a KCM socket into a sockmap.
>>>
>>> While reviewig this code I also found an issue with KCM and kTLS
>>> where each uses sk_data_ready hooks and associated stream parser
>>> breaking expectations in kcm, ktls or both. But that fix will need
>>> to go to net.
>>>
>>> Thanks to Eric for reporting.
>>>
>>> v2: Fix up file +/- my scripts lost track of them
>>> v3: return EBUSY if refcnt is zero
>>>
>>> John Fastabend (2):
>>>   bpf: skmsg, fix psock create on existing kcm/tls port
>>>   bpf: test_maps add a test to catch kcm + sockmap
>>>
>>>  include/linux/skmsg.h | 25 +---
>>>  net/core/sock_map.c   | 11 +++---
>>>  tools/testing/selftests/bpf/Makefile  |  2 +-
>>>  tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
>>>  tools/testing/selftests/bpf/test_maps.c   | 64 
>>> ++-
>>>  5 files changed, 103 insertions(+), 13 deletions(-)
>>>  create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c
>>
>> Applied, thanks!
> 
> Fyi, I've only applied patch 1/2 for now to get the bug fixed. The patch 2/2 
> throws
> a bunch of warnings that look like the below. Also, I think we leak kcm 
> socket in
> error paths and once we're done with testing, so would be good to close it 
> once
> unneeded. Please respin the test as a stand-alone commit, thanks:
> 

Thanks, I didn't see the warnings below locally but will look
into spinning a good version tonight with the closing sock fix
as well.

John

> [...]
> bpf-next/tools/testing/selftests/bpf/libbpf.a -lcap -lelf -lrt -lpthread -o 
> /home/darkstar/trees/bpf-next-ok/tools/testing/selftests/bpf/test_maps
> test_maps.c: In function ‘test_sockmap’:
> test_maps.c:869:0: warning: "AF_KCM" redefined
>  #define AF_KCM 41
> 
> In file included from /usr/include/sys/socket.h:38:0,
>  from test_maps.c:21:
> /usr/include/bits/socket.h:133:0: note: this is the location of the previous 
> definition
>  #define AF_KCM  PF_KCM
>

[bpf-next PATCH 0/3] sockmap, bpf_msg_push_data helper

2018-10-18 Thread John Fastabend

This series adds a new helper bpf_msg_push_data to be used by
sk_msg programs. The helper can be used to insert extra bytes into
the message that can then be used by the program as metadata tags
among other things.

The first patch adds the helper, second patch the libbpf support,
and last patch updates test_sockmap to run msg_push_data tests.

---

John Fastabend (3):
  bpf: sk_msg program helper bpf_msg_push_data
  bpf: libbpf support for msg_push_data
  bpf: test_sockmap add options to use msg_push_data


 include/linux/skmsg.h   |5 +
 include/uapi/linux/bpf.h|   20 +++
 net/core/filter.c   |  134 +++
 tools/include/uapi/linux/bpf.h  |   20 +++
 tools/testing/selftests/bpf/bpf_helpers.h   |2 
 tools/testing/selftests/bpf/test_sockmap.c  |   58 +-
 tools/testing/selftests/bpf/test_sockmap_kern.h |   97 +
 7 files changed, 308 insertions(+), 28 deletions(-)

--
Signature

[bpf-next PATCH 1/3] bpf: sk_msg program helper bpf_msg_push_data

2018-10-18 Thread John Fastabend

This allows user to push data into a msg using sk_msg program types.
The format is as follows,

bpf_msg_push_data(msg, offset, len, flags)

this will insert 'len' bytes at offset 'offset'. For example to
prepend 10 bytes at the front of the message the user can,

bpf_msg_push_data(msg, 0, 10, 0);

This will invalidate data bounds so BPF user will have to then recheck
data bounds after calling this. After this the msg size will have been
updated and the user is free to write into the added bytes. We allow
any offset/len as long as it is within the (data, data_end) range.
However, a copy will be required if the ring is full and its possible
for the helper to fail with ENOMEM or EINVAL errors which need to be
handled by the BPF program.

This can be used similar to XDP metadata to pass data between sk_msg
layer and lower layers.

Signed-off-by: John Fastabend 
---
 include/linux/skmsg.h|5 ++
 include/uapi/linux/bpf.h |   20 +++
 net/core/filter.c|  134 ++
 3 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 22347b0..677b673 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -207,6 +207,11 @@ static inline struct scatterlist *sk_msg_elem(struct 
sk_msg *msg, int which)
return >sg.data[which];
 }
 
+static inline struct scatterlist sk_msg_elem_cpy(struct sk_msg *msg, int which)
+{
+   return msg->sg.data[which];
+}
+
 static inline struct page *sk_msg_page(struct sk_msg *msg, int which)
 {
return sg_page(sk_msg_elem(msg, which));
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5e46f67..1e9fbc5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2215,6 +2215,23 @@ struct bpf_stack_build_id {
  * pointer that was returned from bpf_sk_lookup_xxx\ ().
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags)
+ * Description
+ * For socket policies, insert *len* bytes into msg at offset
+ * *start*.
+ *
+ * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ * *msg* it may want to insert metadata or options into the msg.
+ * This can later be read and used by any of the lower layer BPF
+ * hooks.
+ *
+ * This helper may fail if under memory pressure (a malloc
+ * fails) in these cases BPF programs will get an appropriate
+ * error and BPF programs will need to handle them.
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2303,7 +2320,8 @@ struct bpf_stack_build_id {
FN(skb_ancestor_cgroup_id), \
FN(sk_lookup_tcp),  \
FN(sk_lookup_udp),  \
-   FN(sk_release),
+   FN(sk_release), \
+   FN(msg_push_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c..4bcf238 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2297,6 +2297,137 @@ int skb_do_redirect(struct sk_buff *skb)
.arg4_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
+  u32, len, u64, flags)
+{
+   struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
+   int new, i = 0, l, space, copy = 0, offset = 0;
+   u8 *raw, *to, *from;
+   struct page *page;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   /* First find the starting scatterlist element */
+   i = msg->sg.start;
+   do {
+   l = sk_msg_elem(msg, i)->length;
+
+   if (start < offset + l)
+   break;
+   offset += l;
+   sk_msg_iter_var_next(i);
+   } while (i != msg->sg.end);
+
+   if (start >= offset + l)
+   return -EINVAL;
+
+   space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
+
+   /* If no space available will fallback to copy, we need at
+* least one scatterlist elem available to push data into
+* when start aligns to the beginning of an element or two
+* when it falls inside an element. We handle the start equals
+* offset case because its the common case for inserting a
+* header.
+*/
+   if (!space || (space == 1 && start != offset))
+   copy = msg->sg.data[i].length;
+
+   page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+  get_order(copy + len));
+   if (unlikely(!page))
+   return -ENOMEM;
+
+   if (copy) {
+   int fr

[bpf-next PATCH 2/3] bpf: libbpf support for msg_push_data

2018-10-18 Thread John Fastabend

Add support for new bpf_msg_push_data in libbpf.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h|   20 +++-
 tools/testing/selftests/bpf/bpf_helpers.h |2 ++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 5e46f67..1e9fbc5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2215,6 +2215,23 @@ struct bpf_stack_build_id {
  * pointer that was returned from bpf_sk_lookup_xxx\ ().
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags)
+ * Description
+ * For socket policies, insert *len* bytes into msg at offset
+ * *start*.
+ *
+ * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ * *msg* it may want to insert metadata or options into the msg.
+ * This can later be read and used by any of the lower layer BPF
+ * hooks.
+ *
+ * This helper may fail if under memory pressure (a malloc
+ * fails) in these cases BPF programs will get an appropriate
+ * error and BPF programs will need to handle them.
+ *
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2303,7 +2320,8 @@ struct bpf_stack_build_id {
FN(skb_ancestor_cgroup_id), \
FN(sk_lookup_tcp),  \
FN(sk_lookup_udp),  \
-   FN(sk_release),
+   FN(sk_release), \
+   FN(msg_push_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index fda8c16..4e33511 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -104,6 +104,8 @@ static int (*bpf_msg_cork_bytes)(void *ctx, int len) =
(void *) BPF_FUNC_msg_cork_bytes;
 static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
+static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) =
+   (void *) BPF_FUNC_msg_push_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =

[bpf-next PATCH 3/3] bpf: test_sockmap add options to use msg_push_data

2018-10-18 Thread John Fastabend

Add options to run msg_push_data, this patch creates two more flags
in test_sockmap that can be used to specify the offset and length
of bytes to be added. The new options are --txmsg_start_push to
specify where bytes should be inserted and --txmsg_end_push to
specify how many bytes. This is analagous to the options that are
used to pull data, --txmsg_start and --txmsg_end.

In addition to adding the options tests are added to the test
suit to run the tests similar to what was done for msg_pull_data.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c  |   58 +-
 tools/testing/selftests/bpf/test_sockmap_kern.h |   97 ++-
 2 files changed, 129 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index cbd1c0b..622ade0 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -77,6 +77,8 @@
 int txmsg_cork;
 int txmsg_start;
 int txmsg_end;
+int txmsg_start_push;
+int txmsg_end_push;
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
@@ -100,6 +102,8 @@
{"txmsg_cork",  required_argument,  NULL, 'k'},
{"txmsg_start", required_argument,  NULL, 's'},
{"txmsg_end",   required_argument,  NULL, 'e'},
+   {"txmsg_start_push", required_argument, NULL, 'p'},
+   {"txmsg_end_push",   required_argument, NULL, 'q'},
{"txmsg_ingress", no_argument,  _ingress, 1 },
{"txmsg_skb", no_argument,  _skb, 1 },
{"ktls", no_argument,   , 1 },
@@ -903,6 +907,30 @@ static int run_options(struct sockmap_options *options, 
int cg_fd,  int test)
}
}
 
+   if (txmsg_start_push) {
+   i = 2;
+   err = bpf_map_update_elem(map_fd[5],
+ , _start_push, 
BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+   "ERROR: bpf_map_update_elem 
(txmsg_start_push):  %d (%s)\n",
+   err, strerror(errno));
+   goto out;
+   }
+   }
+
+   if (txmsg_end_push) {
+   i = 3;
+   err = bpf_map_update_elem(map_fd[5],
+ , _end_push, BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+   "ERROR: bpf_map_update_elem %i@%i 
(txmsg_end_push):  %d (%s)\n",
+   txmsg_end_push, i, err, 
strerror(errno));
+   goto out;
+   }
+   }
+
if (txmsg_ingress) {
int in = BPF_F_INGRESS;
 
@@ -1235,6 +1263,8 @@ static int test_mixed(int cgrp)
txmsg_pass = txmsg_noisy = txmsg_redir_noisy = txmsg_drop = 0;
txmsg_apply = txmsg_cork = 0;
txmsg_start = txmsg_end = 0;
+   txmsg_start_push = txmsg_end_push = 0;
+
/* Test small and large iov_count values with pass/redir/apply/cork */
txmsg_pass = 1;
txmsg_redir = 0;
@@ -1351,6 +1381,8 @@ static int test_start_end(int cgrp)
/* Test basic start/end with lots of iov_count and iov_lengths */
txmsg_start = 1;
txmsg_end = 2;
+   txmsg_start_push = 1;
+   txmsg_end_push = 2;
err = test_txmsg(cgrp);
if (err)
goto out;
@@ -1364,6 +1396,8 @@ static int test_start_end(int cgrp)
for (i = 99; i <= 1600; i += 500) {
txmsg_start = 0;
txmsg_end = i;
+   txmsg_start_push = 0;
+   txmsg_end_push = i;
err = test_exec(cgrp, );
if (err)
goto out;
@@ -1373,6 +1407,8 @@ static int test_start_end(int cgrp)
for (i = 199; i <= 1600; i += 500) {
txmsg_start = 100;
txmsg_end = i;
+   txmsg_start_push = 100;
+   txmsg_end_push = i;
err = test_exec(cgrp, );
if (err)
goto out;
@@ -1381,6 +1417,8 @@ static int test_start_end(int cgrp)
/* Test start/end with cork pulling last sg entry */
txmsg_start = 1500;
txmsg_end = 1600;
+   txmsg_start_push = 1500;
+   txmsg_end_push = 1600;
err = test_exec(cgrp, );
if (err)
goto out;
@@ -1388,6 +1426,8 @@ static int test_start_end(int cgrp)
/* Test start/end pull of single byte in last page */
txmsg_start = ;
txmsg_end = 1112;
+   txmsg_start_push = ;
+   txmsg_end_p

Re: [PATCH bpf-next 2/2] samples: bpf: get ifindex from ifname

2018-10-18 Thread John Fastabend

On 10/18/2018 01:47 PM, Matteo Croce wrote:
> Find the ifindex via ioctl(SIOCGIFINDEX) instead of requiring the
> numeric ifindex.
> 
> Signed-off-by: Matteo Croce 
> ---

I don't think there are any expectation that samples have to be
stable as far as inputs over versions. And because I consistently
run this with the ifname before realizing its the ifindex not
string name I'll Ack it.

Acked-by: John Fastabend

[bpf-next v3 0/2] Fix kcm + sockmap by checking psock type

2018-10-18 Thread John Fastabend

We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
type to late, after we read the refcnt which is an error. This
series moves the check up before reading refcnt and also adds a test
to test_maps to test trying to add a KCM socket into a sockmap.

While reviewig this code I also found an issue with KCM and kTLS
where each uses sk_data_ready hooks and associated stream parser
breaking expectations in kcm, ktls or both. But that fix will need
to go to net.

Thanks to Eric for reporting.

v2: Fix up file +/- my scripts lost track of them
v3: return EBUSY if refcnt is zero

John Fastabend (2):
  bpf: skmsg, fix psock create on existing kcm/tls port
  bpf: test_maps add a test to catch kcm + sockmap

 include/linux/skmsg.h | 25 +---
 net/core/sock_map.c   | 11 +++---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
 tools/testing/selftests/bpf/test_maps.c   | 64 ++-
 5 files changed, 103 insertions(+), 13 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

-- 
1.9.1

[bpf-next v3 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-18 Thread John Fastabend

Before using the psock returned by sk_psock_get() when adding it to a
sockmap we need to ensure it is actually a sockmap based psock.
Previously we were only checking this after incrementing the reference
counter which was an error. This resulted in a slab-out-of-bounds
error when the psock was not actually a sockmap type.

This moves the check up so the reference counter is only used
if it is a sockmap psock.

Eric reported the following KASAN BUG,

BUG: KASAN: slab-out-of-bounds in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
lib/refcount.c:120
Read of size 4 at addr 88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818

Signed-off-by: John Fastabend 
Reported-by: Eric Dumazet 
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
---
 include/linux/skmsg.h | 25 -
 net/core/sock_map.c   | 11 ++-
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 22347b0..84e1886 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -270,11 +270,6 @@ static inline struct sk_psock *sk_psock(const struct sock 
*sk)
return rcu_dereference_sk_user_data(sk);
 }
 
-static inline bool sk_has_psock(struct sock *sk)
-{
-   return sk_psock(sk) != NULL && sk->sk_prot->recvmsg == tcp_bpf_recvmsg;
-}
-
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
  struct sk_msg *msg)
 {
@@ -374,6 +369,26 @@ static inline bool sk_psock_test_state(const struct 
sk_psock *psock,
return test_bit(bit, >state);
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+   struct sk_psock *psock;
+
+   rcu_read_lock();
+   psock = sk_psock(sk);
+   if (psock) {
+   if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
+   psock = ERR_PTR(-EBUSY);
+   goto out;
+   }
+
+   if (!refcount_inc_not_zero(>refcnt))
+   psock = ERR_PTR(-EBUSY);
+   }
+out:
+   rcu_read_unlock();
+   return psock;
+}
+
 static inline struct sk_psock *sk_psock_get(struct sock *sk)
 {
struct sk_psock *psock;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3c0e44c..be6092a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -175,12 +175,13 @@ static int sock_map_link(struct bpf_map *map, struct 
sk_psock_progs *progs,
}
}
 
-   psock = sk_psock_get(sk);
+   psock = sk_psock_get_checked(sk);
+   if (IS_ERR(psock)) {
+   ret = PTR_ERR(psock);
+   goto out_progs;
+   }
+
if (psock) {
-   if (!sk_has_psock(sk)) {
-   ret = -EBUSY;
-   goto out_progs;
-   }
if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
(skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
sk_psock_put(sk, psock);
-- 
1.9.1

[bpf-next v3 2/2] bpf: test_maps add a test to catch kcm + sockmap

2018-10-18 Thread John Fastabend

Adding a socket to both sockmap and kcm is not supported due to
collision on sk_user_data usage.

If selftests is run without KCM support we will issue a warning
and continue with the tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
 tools/testing/selftests/bpf/test_maps.c   | 64 ++-
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d99dd6f..f290554 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -28,7 +28,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
+   sockmap_verdict_prog.o sockmap_kcm.o dev_cgroup.o sample_ret0.o 
test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
diff --git a/tools/testing/selftests/bpf/sockmap_kcm.c 
b/tools/testing/selftests/bpf/sockmap_kcm.c
new file mode 100644
index 000..4377adc
--- /dev/null
+++ b/tools/testing/selftests/bpf/sockmap_kcm.c
@@ -0,0 +1,14 @@
+#include 
+#include "bpf_helpers.h"
+#include "bpf_util.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+
+SEC("socket_kcm")
+int bpf_prog1(struct __sk_buff *skb)
+{
+   return skb->len;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9b552c0..be20f1d 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -479,14 +480,16 @@ static void test_devmap(int task, void *data)
 #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
 #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
 #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
+#define KCM_PROG "./sockmap_kcm.o"
 static void test_sockmap(int tasks, void *data)
 {
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
-   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
+   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
-   int parse_prog, verdict_prog, msg_prog;
+   int parse_prog, verdict_prog, msg_prog, kcm_prog;
+   struct kcm_attach attach_info;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
@@ -744,6 +747,62 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
 
+   /* Test adding a KCM socket into map */
+#define AF_KCM 41
+   kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
+   if (kcm == -1) {
+   printf("Warning, KCM+Sockmap could not be tested.\n");
+   goto skip_kcm;
+   }
+
+   err = bpf_prog_load(KCM_PROG,
+   BPF_PROG_TYPE_SOCKET_FILTER,
+   , _prog);
+   if (err) {
+   printf("Failed to load SK_SKB parse prog\n");
+   goto out_sockmap;
+   }
+
+   i = 2;
+   memset(_info, 0, sizeof(attach_info));
+   attach_info.fd = sfd[i];
+   attach_info.bpf_fd = kcm_prog;
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (!err) {
+   perror("Failed KCM attached to sockmap fd: ");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_delete_elem(fd, );
+   if (err) {
+   printf("Failed delete sockmap from empty map %i %i\n", err, 
errno);
+   goto out_sockmap;
+   }
+
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (err) {
+   perror("Failed KCM attach");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (!err) {
+   printf("Failed sockmap attached KCM sock!\n");
+   goto out_sockmap;
+   }
+   err = ioctl(kcm, SIOCKCMUNATTACH, _info);
+   if (err) {
+   printf("Failed detach KCM sock!\n");
+   goto out_sockmap;
+

Re: [bpf-next v2 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-18 Thread John Fastabend

On 10/18/2018 10:34 AM, Eric Dumazet wrote:
> 
> 
> On 10/17/2018 10:20 PM, John Fastabend wrote:
>> Before using the psock returned by sk_psock_get() when adding it to a
>> sockmap we need to ensure it is actually a sockmap based psock.
>> Previously we were only checking this after incrementing the reference
>> counter which was an error. This resulted in a slab-out-of-bounds
>> error when the psock was not actually a sockmap type.
>>
>> This moves the check up so the reference counter is only used
>> if it is a sockmap psock.
>>
>> Eric reported the following KASAN BUG,
>>
>> BUG: KASAN: slab-out-of-bounds in atomic_read 
>> include/asm-generic/atomic-instrumented.h:21 [inline]
>> BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
>> lib/refcount.c:120
>> Read of size 4 at addr 88019548be58 by task syz-executor4/22387
>>
>> CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
>>  print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
>>  kasan_report_error mm/kasan/report.c:354 [inline]
>>  kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
>>  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>>  check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
>>  kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
>>  atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
>>  refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
>>  sk_psock_get include/linux/skmsg.h:379 [inline]
>>  sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
>>  sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
>>  sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
>>  map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
>>
>> Signed-off-by: John Fastabend 
>> Reported-by: Eric Dumazet 
>> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
>> ---


[...]

>> +static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
>> +{
>> +struct sk_psock *psock;
>> +
>> +rcu_read_lock();
>> +psock = sk_psock(sk);
>> +if (psock) {
>> +if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
>> +psock = ERR_PTR(-EBUSY);
>> +goto out;
>> +}
>> +
>> +if (!refcount_inc_not_zero(>refcnt))
>> +psock = NULL;
> 
> Caller is using IS_ERR(), so you probably want to :
> 
>   psock = ERR_PTR(-E);
> 
> 

Yeah we can make this EBUSY as well. Originally I was
thinking that we could create the psock and attach it in
this case but it would be racy and require an rcu sync
most likely.

To hit this case users would need to have multiple
maps and be adding/deleting socks from those maps at
the same time. Seems pretty rare and not worth punishing
the normal case with synchronization.

Nice catch.

.John

[bpf-next v2 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-17 Thread John Fastabend

Before using the psock returned by sk_psock_get() when adding it to a
sockmap we need to ensure it is actually a sockmap based psock.
Previously we were only checking this after incrementing the reference
counter which was an error. This resulted in a slab-out-of-bounds
error when the psock was not actually a sockmap type.

This moves the check up so the reference counter is only used
if it is a sockmap psock.

Eric reported the following KASAN BUG,

BUG: KASAN: slab-out-of-bounds in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
lib/refcount.c:120
Read of size 4 at addr 88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818

Signed-off-by: John Fastabend 
Reported-by: Eric Dumazet 
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
---
 include/linux/skmsg.h | 25 -
 net/core/sock_map.c   | 11 ++-
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 677b673..f44ca6b 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -275,11 +275,6 @@ static inline struct sk_psock *sk_psock(const struct sock 
*sk)
return rcu_dereference_sk_user_data(sk);
 }
 
-static inline bool sk_has_psock(struct sock *sk)
-{
-   return sk_psock(sk) != NULL && sk->sk_prot->recvmsg == tcp_bpf_recvmsg;
-}
-
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
  struct sk_msg *msg)
 {
@@ -379,6 +374,26 @@ static inline bool sk_psock_test_state(const struct 
sk_psock *psock,
return test_bit(bit, >state);
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+   struct sk_psock *psock;
+
+   rcu_read_lock();
+   psock = sk_psock(sk);
+   if (psock) {
+   if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
+   psock = ERR_PTR(-EBUSY);
+   goto out;
+   }
+
+   if (!refcount_inc_not_zero(>refcnt))
+   psock = NULL;
+   }
+out:
+   rcu_read_unlock();
+   return psock;
+}
+
 static inline struct sk_psock *sk_psock_get(struct sock *sk)
 {
struct sk_psock *psock;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3c0e44c..be6092a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -175,12 +175,13 @@ static int sock_map_link(struct bpf_map *map, struct 
sk_psock_progs *progs,
}
}
 
-   psock = sk_psock_get(sk);
+   psock = sk_psock_get_checked(sk);
+   if (IS_ERR(psock)) {
+   ret = PTR_ERR(psock);
+   goto out_progs;
+   }
+
if (psock) {
-   if (!sk_has_psock(sk)) {
-   ret = -EBUSY;
-   goto out_progs;
-   }
if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
(skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
sk_psock_put(sk, psock);
-- 
1.9.1

[bpf-next v2 0/2] Fix kcm + sockmap by checking psock type

2018-10-17 Thread John Fastabend

We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
type to late, after we read the refcnt which is an error. This
series moves the check up before reading refcnt and also adds a test
to test_maps to test trying to add a KCM socket into a sockmap.

While reviewig this code I also found an issue with KCM and kTLS
where each uses sk_data_ready hooks and associated stream parser
breaking expectations in kcm, ktls or both. But that fix will need
to go to net.

Thanks to Eric for reporting.

v2: Fix up file +/- my scripts lost track of them

John Fastabend (2):
  bpf: skmsg, fix psock create on existing kcm/tls port
  bpf: test_maps add a test to catch kcm + sockmap

 include/linux/skmsg.h | 25 +---
 net/core/sock_map.c   | 11 +++---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
 tools/testing/selftests/bpf/test_maps.c   | 64 ++-
 5 files changed, 103 insertions(+), 13 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

-- 
1.9.1

[bpf-next v2 2/2] bpf: test_maps add a test to catch kcm + sockmap

2018-10-17 Thread John Fastabend

Adding a socket to both sockmap and kcm is not supported due to
collision on sk_user_data usage.

If selftests is run without KCM support we will issue a warning
and continue with the tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++
 tools/testing/selftests/bpf/test_maps.c   | 64 ++-
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d99dd6f..f290554 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -28,7 +28,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
+   sockmap_verdict_prog.o sockmap_kcm.o dev_cgroup.o sample_ret0.o 
test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
diff --git a/tools/testing/selftests/bpf/sockmap_kcm.c 
b/tools/testing/selftests/bpf/sockmap_kcm.c
new file mode 100644
index 000..4377adc
--- /dev/null
+++ b/tools/testing/selftests/bpf/sockmap_kcm.c
@@ -0,0 +1,14 @@
+#include 
+#include "bpf_helpers.h"
+#include "bpf_util.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+
+SEC("socket_kcm")
+int bpf_prog1(struct __sk_buff *skb)
+{
+   return skb->len;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9b552c0..be20f1d 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -479,14 +480,16 @@ static void test_devmap(int task, void *data)
 #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
 #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
 #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
+#define KCM_PROG "./sockmap_kcm.o"
 static void test_sockmap(int tasks, void *data)
 {
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
-   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
+   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
-   int parse_prog, verdict_prog, msg_prog;
+   int parse_prog, verdict_prog, msg_prog, kcm_prog;
+   struct kcm_attach attach_info;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
@@ -744,6 +747,62 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
 
+   /* Test adding a KCM socket into map */
+#define AF_KCM 41
+   kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
+   if (kcm == -1) {
+   printf("Warning, KCM+Sockmap could not be tested.\n");
+   goto skip_kcm;
+   }
+
+   err = bpf_prog_load(KCM_PROG,
+   BPF_PROG_TYPE_SOCKET_FILTER,
+   , _prog);
+   if (err) {
+   printf("Failed to load SK_SKB parse prog\n");
+   goto out_sockmap;
+   }
+
+   i = 2;
+   memset(_info, 0, sizeof(attach_info));
+   attach_info.fd = sfd[i];
+   attach_info.bpf_fd = kcm_prog;
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (!err) {
+   perror("Failed KCM attached to sockmap fd: ");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_delete_elem(fd, );
+   if (err) {
+   printf("Failed delete sockmap from empty map %i %i\n", err, 
errno);
+   goto out_sockmap;
+   }
+
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (err) {
+   perror("Failed KCM attach");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (!err) {
+   printf("Failed sockmap attached KCM sock!\n");
+   goto out_sockmap;
+   }
+   err = ioctl(kcm, SIOCKCMUNATTACH, _info);
+   if (err) {
+   printf("Failed detach KCM sock!\n");
+   goto out_sockmap;
+

[bpf-next PATCH 0/2] Fix kcm + sockmap by checking psock type

2018-10-17 Thread John Fastabend

We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
type to late, after we read the refcnt which is an error. This
series moves the check up before reading refcnt and also adds a test
to test_maps to test trying to add a KCM socket into a sockmap.

While reviewig this code I also found an issue with KCM and kTLS
where each uses sk_data_ready hooks and associated stream parser
breaking expectations in kcm, ktls or both. But that fix will need
to go to net.

Thanks to Eric for reporting.

---

John Fastabend (2):
  bpf: skmsg, fix psock create on existing kcm/tls port
  bpf: test_maps add a test to catch kcm + sockmap


 tools/testing/selftests/bpf/Makefile  |2 -
 tools/testing/selftests/bpf/sockmap_kcm.c |   14 ++
 tools/testing/selftests/bpf/test_maps.c   |   64 -
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

--
Signature

[bpf-next PATCH 2/2] bpf: test_maps add a test to catch kcm + sockmap

2018-10-17 Thread John Fastabend

Adding a socket to both sockmap and kcm is not supported due to
collision on sk_user_data usage.

If selftests is run without KCM support we will issue a warning
and continue with the tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile  |2 -
 tools/testing/selftests/bpf/sockmap_kcm.c |   14 ++
 tools/testing/selftests/bpf/test_maps.c   |   64 -
 3 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d99dd6f..f290554 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -28,7 +28,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
+   sockmap_verdict_prog.o sockmap_kcm.o dev_cgroup.o sample_ret0.o 
test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
diff --git a/tools/testing/selftests/bpf/sockmap_kcm.c 
b/tools/testing/selftests/bpf/sockmap_kcm.c
new file mode 100644
index 000..4377adc
--- /dev/null
+++ b/tools/testing/selftests/bpf/sockmap_kcm.c
@@ -0,0 +1,14 @@
+#include 
+#include "bpf_helpers.h"
+#include "bpf_util.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+
+SEC("socket_kcm")
+int bpf_prog1(struct __sk_buff *skb)
+{
+   return skb->len;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9b552c0..be20f1d 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -479,14 +480,16 @@ static void test_devmap(int task, void *data)
 #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
 #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
 #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
+#define KCM_PROG "./sockmap_kcm.o"
 static void test_sockmap(int tasks, void *data)
 {
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
-   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
+   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
-   int parse_prog, verdict_prog, msg_prog;
+   int parse_prog, verdict_prog, msg_prog, kcm_prog;
+   struct kcm_attach attach_info;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
@@ -744,6 +747,62 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
 
+   /* Test adding a KCM socket into map */
+#define AF_KCM 41
+   kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
+   if (kcm == -1) {
+   printf("Warning, KCM+Sockmap could not be tested.\n");
+   goto skip_kcm;
+   }
+
+   err = bpf_prog_load(KCM_PROG,
+   BPF_PROG_TYPE_SOCKET_FILTER,
+   , _prog);
+   if (err) {
+   printf("Failed to load SK_SKB parse prog\n");
+   goto out_sockmap;
+   }
+
+   i = 2;
+   memset(_info, 0, sizeof(attach_info));
+   attach_info.fd = sfd[i];
+   attach_info.bpf_fd = kcm_prog;
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (!err) {
+   perror("Failed KCM attached to sockmap fd: ");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_delete_elem(fd, );
+   if (err) {
+   printf("Failed delete sockmap from empty map %i %i\n", err, 
errno);
+   goto out_sockmap;
+   }
+
+   err = ioctl(kcm, SIOCKCMATTACH, _info);
+   if (err) {
+   perror("Failed KCM attach");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, , [i], BPF_ANY);
+   if (!err) {
+   printf("Failed sockmap attached KCM sock!\n");
+   goto out_sockmap;
+   }
+   err = ioctl(kcm, SIOCKCMUNATTACH, _info);
+   if (err) {
+   printf("Failed detach KCM sock!\n");
+   goto out_sockmap;
+

[bpf-next PATCH 1/2] bpf: skmsg, fix psock create on existing kcm/tls port

2018-10-17 Thread John Fastabend

Before using the psock returned by sk_psock_get() when adding it to a
sockmap we need to ensure it is actually a sockmap based psock.
Previously we were only checking this after incrementing the reference
counter which was an error. This resulted in a slab-out-of-bounds
error when the psock was not actually a sockmap type.

This moves the check up so the reference counter is only used
if it is a sockmap psock.

Eric reported the following KASAN BUG,

BUG: KASAN: slab-out-of-bounds in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 
lib/refcount.c:120
Read of size 4 at addr 88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818

Signed-off-by: John Fastabend 
Reported-by: Eric Dumazet 
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
---
 0 files changed

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 677b673..f44ca6b 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -275,11 +275,6 @@ static inline struct sk_psock *sk_psock(const struct sock 
*sk)
return rcu_dereference_sk_user_data(sk);
 }
 
-static inline bool sk_has_psock(struct sock *sk)
-{
-   return sk_psock(sk) != NULL && sk->sk_prot->recvmsg == tcp_bpf_recvmsg;
-}
-
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
  struct sk_msg *msg)
 {
@@ -379,6 +374,26 @@ static inline bool sk_psock_test_state(const struct 
sk_psock *psock,
return test_bit(bit, >state);
 }
 
+static inline struct sk_psock *sk_psock_get_checked(struct sock *sk)
+{
+   struct sk_psock *psock;
+
+   rcu_read_lock();
+   psock = sk_psock(sk);
+   if (psock) {
+   if (sk->sk_prot->recvmsg != tcp_bpf_recvmsg) {
+   psock = ERR_PTR(-EBUSY);
+   goto out;
+   }
+
+   if (!refcount_inc_not_zero(>refcnt))
+   psock = NULL;
+   }
+out:
+   rcu_read_unlock();
+   return psock;
+}
+
 static inline struct sk_psock *sk_psock_get(struct sock *sk)
 {
struct sk_psock *psock;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3c0e44c..be6092a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -175,12 +175,13 @@ static int sock_map_link(struct bpf_map *map, struct 
sk_psock_progs *progs,
}
}
 
-   psock = sk_psock_get(sk);
+   psock = sk_psock_get_checked(sk);
+   if (IS_ERR(psock)) {
+   ret = PTR_ERR(psock);
+   goto out_progs;
+   }
+
if (psock) {
-   if (!sk_has_psock(sk)) {
-   ret = -EBUSY;
-   goto out_progs;
-   }
if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
(skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
sk_psock_put(sk, psock);

Re: [PATCH linux-firmware] linux-firmware: liquidio: fix GPL compliance issue

2018-10-17 Thread John W. Linville

On Wed, Oct 17, 2018 at 07:34:42PM +, Manlunas, Felix wrote:
> On Fri, Sep 28, 2018 at 04:50:51PM -0700, Felix Manlunas wrote:
> > Part of the code inside the lio_vsw_23xx.bin firmware image is under GPL,
> > but the LICENCE.cavium file neglects to indicate that.  However,
> > LICENCE.cavium does correctly specify the license that covers the other
> > Cavium firmware images that do not contain any GPL code.
> > 
> > Fix the GPL compliance issue by adding a new file, LICENCE.cavium_liquidio,
> > which correctly shows the GPL boilerplate.  This new file specifies the
> > licenses for all liquidio firmware, including the ones that do not have
> > GPL code.
> > 
> > Change the liquidio section of WHENCE to point to LICENCE.cavium_liquidio.
> > 
> > Reported-by: Florian Weimer 
> > Signed-off-by: Manish Awasthi 
> > Signed-off-by: Manoj Panicker 
> > Signed-off-by: Faisal Masood 
> > Signed-off-by: Felix Manlunas 
> > ---
> >  LICENCE.cavium_liquidio | 429 
> > 
> >  WHENCE  |   2 +-
> >  2 files changed, 430 insertions(+), 1 deletion(-)
> >  create mode 100644 LICENCE.cavium_liquidio
> 
> Hello Maintainers of linux-firmware.git,
> 
> Any feedback about this patch?

I would prefer to see an offer that included a defined URL for anyone
to download the source for the kernel in question without having to
announce themselves. The "send an email to i...@cavium.com" offer may
(or may not) be sufficient for the letter of the law. But it seems
both fragile and prone to subjective frustrations and delays for
users to obtain the sources at some future date.

Respectfully,

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

[bpf-next PATCH 2/3] bpf: sockmap, support for msg_peek in sk_msg with redirect ingress

2018-10-16 Thread John Fastabend

This adds support for the MSG_PEEK flag when doing redirect to ingress
and receiving on the sk_msg psock queue. Previously the flag was
being ignored which could confuse applications if they expected the
flag to work as normal.

Signed-off-by: John Fastabend 
---
 include/net/tcp.h  |2 +-
 net/ipv4/tcp_bpf.c |   42 +++---
 net/tls/tls_sw.c   |3 ++-
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3600ae0..14fdd7c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2089,7 +2089,7 @@ int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg 
*msg, u32 bytes,
 int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int nonblock, int flags, int *addr_len);
 int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
- struct msghdr *msg, int len);
+ struct msghdr *msg, int len, int flags);
 
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index f9d3cf1..b7918d4 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -39,17 +39,19 @@ static int tcp_bpf_wait_data(struct sock *sk, struct 
sk_psock *psock,
 }
 
 int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
- struct msghdr *msg, int len)
+ struct msghdr *msg, int len, int flags)
 {
struct iov_iter *iter = >msg_iter;
+   int peek = flags & MSG_PEEK;
int i, ret, copied = 0;
+   struct sk_msg *msg_rx;
+
+   msg_rx = list_first_entry_or_null(>ingress_msg,
+ struct sk_msg, list);
 
while (copied != len) {
struct scatterlist *sge;
-   struct sk_msg *msg_rx;
 
-   msg_rx = list_first_entry_or_null(>ingress_msg,
- struct sk_msg, list);
if (unlikely(!msg_rx))
break;
 
@@ -70,22 +72,30 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
*psock,
}
 
copied += copy;
-   sge->offset += copy;
-   sge->length -= copy;
-   sk_mem_uncharge(sk, copy);
-   msg_rx->sg.size -= copy;
-   if (!sge->length) {
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
-   if (!msg_rx->skb)
-   put_page(page);
+   if (likely(!peek)) {
+   sge->offset += copy;
+   sge->length -= copy;
+   sk_mem_uncharge(sk, copy);
+   msg_rx->sg.size -= copy;
+
+   if (!sge->length) {
+   sk_msg_iter_var_next(i);
+   if (!msg_rx->skb)
+   put_page(page);
+   }
+   } else {
+   sk_msg_iter_var_next(i);
}
 
if (copied == len)
break;
} while (i != msg_rx->sg.end);
 
+   if (unlikely(peek)) {
+   msg_rx = list_next_entry(msg_rx, list);
+   continue;
+   }
+
msg_rx->sg.start = i;
if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
list_del(_rx->list);
@@ -93,6 +103,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock 
*psock,
consume_skb(msg_rx->skb);
kfree(msg_rx);
}
+   msg_rx = list_first_entry_or_null(>ingress_msg,
+ struct sk_msg, list);
}
 
return copied;
@@ -115,7 +127,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
lock_sock(sk);
 msg_bytes_ready:
-   copied = __tcp_bpf_recvmsg(sk, psock, msg, len);
+   copied = __tcp_bpf_recvmsg(sk, psock, msg, len, flags);
if (!copied) {
int data, err = 0;
long timeo;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a525fc4..5cd88ba 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1478,7 +1478,8 @@ int tls_sw_recvmsg(struct sock *sk,
skb = tls_wait_data(sk, psock, flags, timeo, );
if (!skb)

[bpf-next PATCH 1/3] bpf: skmsg, improve sk_msg_used_element to work in cork context

2018-10-16 Thread John Fastabend

Currently sk_msg_used_element is only called in zerocopy context where
cork is not possible and if this case happens we fallback to copy
mode. However the helper is more useful if it works in all contexts.

This patch resolved the case where if end == head indicating a full
or empty ring the helper always reports an empty ring. To fix this
add a test for the full ring case to avoid reporting a full ring
has 0 elements. This additional functionality will be used in the
next patches from recvmsg context where end = head with a full ring
is a valid case.

Signed-off-by: John Fastabend 
---
 include/linux/skmsg.h |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 31df0d9..22347b0 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -187,18 +187,21 @@ static inline void sk_msg_xfer_full(struct sk_msg *dst, 
struct sk_msg *src)
sk_msg_init(src);
 }
 
+static inline bool sk_msg_full(const struct sk_msg *msg)
+{
+   return (msg->sg.end == msg->sg.start) && msg->sg.size;
+}
+
 static inline u32 sk_msg_elem_used(const struct sk_msg *msg)
 {
+   if (sk_msg_full(msg))
+   return MAX_MSG_FRAGS;
+
return msg->sg.end >= msg->sg.start ?
msg->sg.end - msg->sg.start :
msg->sg.end + (MAX_MSG_FRAGS - msg->sg.start);
 }
 
-static inline bool sk_msg_full(const struct sk_msg *msg)
-{
-   return (msg->sg.end == msg->sg.start) && msg->sg.size;
-}
-
 static inline struct scatterlist *sk_msg_elem(struct sk_msg *msg, int which)
 {
return >sg.data[which];

[bpf-next PATCH 3/3] bpf: sockmap, add msg_peek tests to test_sockmap

2018-10-16 Thread John Fastabend

Add tests that do a MSG_PEEK recv followed by a regular receive to
test flag support.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c |  167 +++-
 1 file changed, 115 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 7cb69ce..cbd1c0b 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -80,6 +80,7 @@
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
+int peek_flag;
 
 static const struct option long_options[] = {
{"help",no_argument,NULL, 'h' },
@@ -102,6 +103,7 @@
{"txmsg_ingress", no_argument,  _ingress, 1 },
{"txmsg_skb", no_argument,  _skb, 1 },
{"ktls", no_argument,   , 1 },
+   {"peek", no_argument,   _flag, 1 },
{0, 0, NULL, 0 }
 };
 
@@ -352,33 +354,40 @@ static int msg_loop_sendpage(int fd, int iov_length, int 
cnt,
return 0;
 }
 
-static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
-   struct msg_stats *s, bool tx,
-   struct sockmap_options *opt)
+static void msg_free_iov(struct msghdr *msg)
 {
-   struct msghdr msg = {0};
-   int err, i, flags = MSG_NOSIGNAL;
+   int i;
+
+   for (i = 0; i < msg->msg_iovlen; i++)
+   free(msg->msg_iov[i].iov_base);
+   free(msg->msg_iov);
+   msg->msg_iov = NULL;
+   msg->msg_iovlen = 0;
+}
+
+static int msg_alloc_iov(struct msghdr *msg,
+int iov_count, int iov_length,
+bool data, bool xmit)
+{
+   unsigned char k = 0;
struct iovec *iov;
-   unsigned char k;
-   bool data_test = opt->data_test;
-   bool drop = opt->drop_expected;
+   int i;
 
iov = calloc(iov_count, sizeof(struct iovec));
if (!iov)
return errno;
 
-   k = 0;
for (i = 0; i < iov_count; i++) {
unsigned char *d = calloc(iov_length, sizeof(char));
 
if (!d) {
fprintf(stderr, "iov_count %i/%i OOM\n", i, iov_count);
-   goto out_errno;
+   goto unwind_iov;
}
iov[i].iov_base = d;
iov[i].iov_len = iov_length;
 
-   if (data_test && tx) {
+   if (data && xmit) {
int j;
 
for (j = 0; j < iov_length; j++)
@@ -386,9 +395,60 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
}
}
 
-   msg.msg_iov = iov;
-   msg.msg_iovlen = iov_count;
-   k = 0;
+   msg->msg_iov = iov;
+   msg->msg_iovlen = iov_count;
+
+   return 0;
+unwind_iov:
+   for (i--; i >= 0 ; i--)
+   free(msg->msg_iov[i].iov_base);
+   return -ENOMEM;
+}
+
+static int msg_verify_data(struct msghdr *msg, int size, int chunk_sz)
+{
+   int i, j, bytes_cnt = 0;
+   unsigned char k = 0;
+
+   for (i = 0; i < msg->msg_iovlen; i++) {
+   unsigned char *d = msg->msg_iov[i].iov_base;
+
+   for (j = 0;
+j < msg->msg_iov[i].iov_len && size; j++) {
+   if (d[j] != k++) {
+   fprintf(stderr,
+   "detected data corruption @iov[%i]:%i 
%02x != %02x, %02x ?= %02x\n",
+   i, j, d[j], k - 1, d[j+1], k);
+   return -EIO;
+   }
+   bytes_cnt++;
+   if (bytes_cnt == chunk_sz) {
+   k = 0;
+   bytes_cnt = 0;
+   }
+   size--;
+   }
+   }
+   return 0;
+}
+
+static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
+   struct msg_stats *s, bool tx,
+   struct sockmap_options *opt)
+{
+   struct msghdr msg = {0}, msg_peek = {0};
+   int err, i, flags = MSG_NOSIGNAL;
+   bool drop = opt->drop_expected;
+   bool data = opt->data_test;
+
+   err = msg_alloc_iov(, iov_count, iov_length, data, tx);
+   if (err)
+   goto out_errno;
+   if (peek_flag) {
+   err = msg_alloc_iov(_peek, iov_count, iov_length, data, tx);
+   if (err)
+   goto out_errno;
+   }
 
if (tx) {
clock_gettime(CLOCK_MONOTONIC, >start);
@@ -408,19 +468,12 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
}
clock_gettime(CLOCK_MONOTONIC, >en

[bpf-next PATCH 0/3] sockmap support for msg_peek flag

2018-10-16 Thread John Fastabend

This adds support for the MSG_PEEK flag when redirecting into an
ingress psock sk_msg queue.

The first patch adds some base support to the helpers, then the
feature, and finally we add an option for the test suite to do
a duplicate MSG_PEEK call on every recv to test the feature.

With duplicate MSG_PEEK call all tests continue to PASS.

---

John Fastabend (3):
  bpf: skmsg, improve sk_msg_used_element to work in cork context
  bpf: sockmap, support for msg_peek in sk_msg with redirect ingress
  bpf: sockmap, add msg_peek tests to test_sockmap


 include/linux/skmsg.h  |   13 +-
 include/net/tcp.h  |2 
 net/ipv4/tcp_bpf.c |   42 +--
 net/tls/tls_sw.c   |3 -
 tools/testing/selftests/bpf/test_sockmap.c |  167 +++-
 5 files changed, 153 insertions(+), 74 deletions(-)

--
Signature

[bpf-next PATCH] bpf: sockmap, fix skmsg recvmsg handler to track size correctly

2018-10-16 Thread John Fastabend

When converting sockmap to new skmsg generic data structures we missed
that the recvmsg handler did not correctly use sg.size and instead was
using individual elements length. The result is if a sock is closed
with outstanding data we omit the call to sk_mem_uncharge() and can
get the warning below.

[   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 
sk_stream_kill_queues+0x1fa/0x210

To fix this correct the redirect handler to xfer the size along with
the scatterlist and also decrement the size from the recvmsg handler.
Now when a sock is closed the remaining 'size' will be decremented
with sk_mem_uncharge().

Signed-off-by: John Fastabend 
---
 include/linux/skmsg.h |1 +
 net/ipv4/tcp_bpf.c|1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 0b919f0..31df0d9 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -176,6 +176,7 @@ static inline void sk_msg_xfer(struct sk_msg *dst, struct 
sk_msg *src,
 {
dst->sg.data[which] = src->sg.data[which];
dst->sg.data[which].length  = size;
+   dst->sg.size   += size;
src->sg.data[which].length -= size;
src->sg.data[which].offset += size;
 }
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 80debb0..f9d3cf1 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -73,6 +73,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
sge->offset += copy;
sge->length -= copy;
sk_mem_uncharge(sk, copy);
+   msg_rx->sg.size -= copy;
if (!sge->length) {
i++;
if (i == MAX_SKB_FRAGS)

[bpf-next PATCH v3 2/2] bpf: bpftool, add flag to allow non-compat map definitions

2018-10-15 Thread John Fastabend

Multiple map definition structures exist and user may have non-zero
fields in their definition that are not recognized by bpftool and
libbpf. The normal behavior is to then fail loading the map. Although
this is a good default behavior users may still want to load the map
for debugging or other reasons. This patch adds a --mapcompat flag
that can be used to override the default behavior and allow loading
the map even when it has additional non-zero fields.

For now the only user is 'bpftool prog' we can switch over other
subcommands as needed. The library exposes an API that consumes
a flags field now but I kept the original API around also in case
users of the API don't want to expose this. The flags field is an
int in case we need more control over how the API call handles
errors/features/etc in the future.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/Documentation/bpftool.rst |4 
 tools/bpf/bpftool/bash-completion/bpftool   |2 +-
 tools/bpf/bpftool/main.c|7 ++-
 tools/bpf/bpftool/main.h|3 ++-
 tools/bpf/bpftool/prog.c|2 +-
 5 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 25c0872..6548831 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -57,6 +57,10 @@ OPTIONS
-p, --pretty
  Generate human-readable JSON output. Implies **-j**.
 
+   -m, --mapcompat
+ Allow loading maps with unknown map definitions.
+
+
 SEE ALSO
 
**bpftool-map**\ (8), **bpftool-prog**\ (8), **bpftool-cgroup**\ (8)
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 0826519..ac85207 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -184,7 +184,7 @@ _bpftool()
 
 # Deal with options
 if [[ ${words[cword]} == -* ]]; then
-local c='--version --json --pretty --bpffs'
+local c='--version --json --pretty --bpffs --mapcompat'
 COMPREPLY=( $( compgen -W "$c" -- "$cur" ) )
 return 0
 fi
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 79dc3f1..828dde3 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -55,6 +55,7 @@
 bool pretty_output;
 bool json_output;
 bool show_pinned;
+int bpf_flags;
 struct pinned_obj_table prog_table;
 struct pinned_obj_table map_table;
 
@@ -341,6 +342,7 @@ int main(int argc, char **argv)
{ "pretty", no_argument,NULL,   'p' },
{ "version",no_argument,NULL,   'V' },
{ "bpffs",  no_argument,NULL,   'f' },
+   { "mapcompat",  no_argument,NULL,   'm' },
{ 0 }
};
int opt, ret;
@@ -355,7 +357,7 @@ int main(int argc, char **argv)
hash_init(map_table.table);
 
opterr = 0;
-   while ((opt = getopt_long(argc, argv, "Vhpjf",
+   while ((opt = getopt_long(argc, argv, "Vhpjfm",
  options, NULL)) >= 0) {
switch (opt) {
case 'V':
@@ -379,6 +381,9 @@ int main(int argc, char **argv)
case 'f':
show_pinned = true;
break;
+   case 'm':
+   bpf_flags = MAPS_RELAX_COMPAT;
+   break;
default:
p_err("unrecognized option '%s'", argv[optind - 1]);
if (json_output)
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cd..91fd697 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -74,7 +74,7 @@
 #define HELP_SPEC_PROGRAM  \
"PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }"
 #define HELP_SPEC_OPTIONS  \
-   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }"
+   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | 
{-m|--mapcompat}"
 #define HELP_SPEC_MAP  \
"MAP := { id MAP_ID | pinned FILE }"
 
@@ -89,6 +89,7 @@ enum bpf_obj_type {
 extern json_writer_t *json_wtr;
 extern bool json_output;
 extern bool show_pinned;
+extern int bpf_flags;
 extern struct pinned_obj_table prog_table;
 extern struct pinned_obj_table map_table;
 
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 99ab42c..3350289 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -908,7 +908,7 @@ static int do_load(int argc, char **argv)
}
}
 
-   obj = bpf_object

[bpf-next PATCH v3 1/2] bpf: bpftool, add support for attaching programs to maps

2018-10-15 Thread John Fastabend

Sock map/hash introduce support for attaching programs to maps. To
date I have been doing this with custom tooling but this is less than
ideal as we shift to using bpftool as the single CLI for our BPF uses.
This patch adds new sub commands 'attach' and 'detach' to the 'prog'
command to attach programs to maps and then detach them.

Signed-off-by: John Fastabend 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |2 
 tools/bpf/bpftool/bash-completion/bpftool|   19 
 tools/bpf/bpftool/prog.c |   99 ++
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 64156a1..12c8030 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -25,6 +25,8 @@ MAP COMMANDS
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
 |  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** 
{**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
+|   **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
+|   **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
 |  **bpftool** **prog help**
 |
 |  *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -37,6 +39,7 @@ MAP COMMANDS
 |  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
 |  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6**
 |  }
+|   *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
 
 
 DESCRIPTION
@@ -90,6 +93,14 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
+**bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
+  Attach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  to the map *MAP*.
+
+**bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
+  Detach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  from the map *MAP*.
+
**bpftool prog help**
  Print short help message.
 
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 8dda77d..25c0872 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -26,7 +26,7 @@ SYNOPSIS
| **pin** | **event_pipe** | **help** }
 
*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump 
xlated** | **pin**
-   | **load** | **help** }
+   | **load** | **attach** | **detach** | **help** }
 
*CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | 
**help** }
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index df1060b..0826519 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -292,6 +292,23 @@ _bpftool()
 fi
 return 0
 ;;
+attach|detach)
+if [[ ${#words[@]} == 7 ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+
+if [[ ${#words[@]} == 6 ]]; then
+COMPREPLY=( $( compgen -W "msg_verdict skb_verdict 
skb_parse" -- "$cur" ) )
+return 0
+fi
+
+if [[ $prev == "$command" ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+return 0
+;;
 load)
 local obj
 
@@ -347,7 +364,7 @@ _bpftool()
 ;;
 *)
 [[ $prev == $object ]] && \
-COMPREPLY=( $( compgen -W 'dump help pin load \
+COMPREPLY=( $( compgen -W 'dump help pin attach detach 
load \
 show list' -- "$cur" ) )
 ;;
 esac
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index b1cd3bc..99ab42c 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -77,6 +77,26 @@
[BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
+static const char * const attach_type_strings[] = {
+   [BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
+   [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+   [BPF_SK_MSG_VERDICT] = "msg_verdi

[bpf-next PATCH v3 0/2] bpftool support for sockmap use cases

2018-10-15 Thread John Fastabend

The first patch adds support for attaching programs to maps. This is
needed to support sock{map|hash} use from bpftool. Currently, I carry
around custom code to do this so doing it using standard bpftool will
be great.

The second patch adds a compat mode to ignore non-zero entries in
the map def. This allows using bpftool with maps that have a extra
fields that the user knows can be ignored. This is needed to work
correctly with maps being loaded by other tools or directly via
syscalls.

v3: add bash completion and doc updates for --mapcompat

---

John Fastabend (2):
  bpf: bpftool, add support for attaching programs to maps
  bpf: bpftool, add flag to allow non-compat map definitions


 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |6 +
 tools/bpf/bpftool/bash-completion/bpftool|   21 -
 tools/bpf/bpftool/main.c |7 +-
 tools/bpf/bpftool/main.h |3 -
 tools/bpf/bpftool/prog.c |  101 ++
 6 files changed, 142 insertions(+), 7 deletions(-)

--
Signature

[bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions

2018-10-15 Thread John Fastabend

Multiple map definition structures exist and user may have non-zero
fields in their definition that are not recognized by bpftool and
libbpf. The normal behavior is to then fail loading the map. Although
this is a good default behavior users may still want to load the map
for debugging or other reasons. This patch adds a --mapcompat flag
that can be used to override the default behavior and allow loading
the map even when it has additional non-zero fields.

For now the only user is 'bpftool prog' we can switch over other
subcommands as needed. The library exposes an API that consumes
a flags field now but I kept the original API around also in case
users of the API don't want to expose this. The flags field is an
int in case we need more control over how the API call handles
errors/features/etc in the future.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/main.c |7 ++-
 tools/bpf/bpftool/main.h |3 ++-
 tools/bpf/bpftool/prog.c |2 +-
 tools/lib/bpf/bpf.h  |3 +++
 tools/lib/bpf/libbpf.c   |   27 ++-
 tools/lib/bpf/libbpf.h   |2 ++
 6 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 79dc3f1..828dde3 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -55,6 +55,7 @@
 bool pretty_output;
 bool json_output;
 bool show_pinned;
+int bpf_flags;
 struct pinned_obj_table prog_table;
 struct pinned_obj_table map_table;
 
@@ -341,6 +342,7 @@ int main(int argc, char **argv)
{ "pretty", no_argument,NULL,   'p' },
{ "version",no_argument,NULL,   'V' },
{ "bpffs",  no_argument,NULL,   'f' },
+   { "mapcompat",  no_argument,NULL,   'm' },
{ 0 }
};
int opt, ret;
@@ -355,7 +357,7 @@ int main(int argc, char **argv)
hash_init(map_table.table);
 
opterr = 0;
-   while ((opt = getopt_long(argc, argv, "Vhpjf",
+   while ((opt = getopt_long(argc, argv, "Vhpjfm",
  options, NULL)) >= 0) {
switch (opt) {
case 'V':
@@ -379,6 +381,9 @@ int main(int argc, char **argv)
case 'f':
show_pinned = true;
break;
+   case 'm':
+   bpf_flags = MAPS_RELAX_COMPAT;
+   break;
default:
p_err("unrecognized option '%s'", argv[optind - 1]);
if (json_output)
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cd..91fd697 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -74,7 +74,7 @@
 #define HELP_SPEC_PROGRAM  \
"PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }"
 #define HELP_SPEC_OPTIONS  \
-   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }"
+   "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | 
{-m|--mapcompat}"
 #define HELP_SPEC_MAP  \
"MAP := { id MAP_ID | pinned FILE }"
 
@@ -89,6 +89,7 @@ enum bpf_obj_type {
 extern json_writer_t *json_wtr;
 extern bool json_output;
 extern bool show_pinned;
+extern int bpf_flags;
 extern struct pinned_obj_table prog_table;
 extern struct pinned_obj_table map_table;
 
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 99ab42c..3350289 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -908,7 +908,7 @@ static int do_load(int argc, char **argv)
}
}
 
-   obj = bpf_object__open_xattr();
+   obj = __bpf_object__open_xattr(, bpf_flags);
if (IS_ERR_OR_NULL(obj)) {
p_err("failed to open object file");
goto err_free_reuse_maps;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 87520a8..69a4d40 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -69,6 +69,9 @@ struct bpf_load_program_attr {
__u32 prog_ifindex;
 };
 
+/* Flags to direct loading requirements */
+#define MAPS_RELAX_COMPAT  0x01
+
 /* Recommend log buffer size */
 #define BPF_LOG_BUF_SIZE (256 * 1024)
 int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 176cf55..bd71efc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -562,8 +562,9 @@ static int compare_bpf_map(const void *_a, const void *_b)
 }
 
 static int
-bpf_object__init_maps(struct bpf_object *obj)
+bpf_object__init_maps(struct bpf_object *obj, int flags)
 {
+   bool strict = !(flags & MAPS_RELAX_COMPAT);
int i, map_idx, map_def_sz, nr_maps =

[bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps

2018-10-15 Thread John Fastabend

Sock map/hash introduce support for attaching programs to maps. To
date I have been doing this with custom tooling but this is less than
ideal as we shift to using bpftool as the single CLI for our BPF uses.
This patch adds new sub commands 'attach' and 'detach' to the 'prog'
command to attach programs to maps and then detach them.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |2 
 tools/bpf/bpftool/bash-completion/bpftool|   19 
 tools/bpf/bpftool/prog.c |   99 ++
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 64156a1..12c8030 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -25,6 +25,8 @@ MAP COMMANDS
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
 |  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** 
{**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
+|   **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
+|   **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
 |  **bpftool** **prog help**
 |
 |  *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -37,6 +39,7 @@ MAP COMMANDS
 |  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
 |  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6**
 |  }
+|   *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
 
 
 DESCRIPTION
@@ -90,6 +93,14 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
+**bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
+  Attach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  to the map *MAP*.
+
+**bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
+  Detach bpf program *PROG* (with type specified by 
*ATTACH_TYPE*)
+  from the map *MAP*.
+
**bpftool prog help**
  Print short help message.
 
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst 
b/tools/bpf/bpftool/Documentation/bpftool.rst
index 8dda77d..25c0872 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -26,7 +26,7 @@ SYNOPSIS
| **pin** | **event_pipe** | **help** }
 
*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump 
xlated** | **pin**
-   | **load** | **help** }
+   | **load** | **attach** | **detach** | **help** }
 
*CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | 
**help** }
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index df1060b..0826519 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -292,6 +292,23 @@ _bpftool()
 fi
 return 0
 ;;
+attach|detach)
+if [[ ${#words[@]} == 7 ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+
+if [[ ${#words[@]} == 6 ]]; then
+COMPREPLY=( $( compgen -W "msg_verdict skb_verdict 
skb_parse" -- "$cur" ) )
+return 0
+fi
+
+if [[ $prev == "$command" ]]; then
+COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+return 0
+fi
+return 0
+;;
 load)
 local obj
 
@@ -347,7 +364,7 @@ _bpftool()
 ;;
 *)
 [[ $prev == $object ]] && \
-COMPREPLY=( $( compgen -W 'dump help pin load \
+COMPREPLY=( $( compgen -W 'dump help pin attach detach 
load \
 show list' -- "$cur" ) )
 ;;
 esac
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index b1cd3bc..99ab42c 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -77,6 +77,26 @@
[BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
+static const char * const attach_type_strings[] = {
+   [BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
+   [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+   [BPF_SK_MSG_VERDICT] = "msg_verdict",
+   [__

[bpf-next PATCH v2 0/2] bpftool support for sockmap use cases

2018-10-15 Thread John Fastabend

The first patch adds support for attaching programs to maps. This is
needed to support sock{map|hash} use from bpftool. Currently, I carry
around custom code to do this so doing it using standard bpftool will
be great.

The second patch adds a compat mode to ignore non-zero entries in
the map def. This allows using bpftool with maps that have a extra
fields that the user knows can be ignored. This is needed to work
correctly with maps being loaded by other tools or directly via
syscalls.

---

John Fastabend (2):
  bpf: bpftool, add support for attaching programs to maps
  bpf: bpftool, add flag to allow non-compat map definitions


 tools/bpf/bpftool/Documentation/bpftool-prog.rst |   11 ++
 tools/bpf/bpftool/Documentation/bpftool.rst  |2 
 tools/bpf/bpftool/bash-completion/bpftool|   19 
 tools/bpf/bpftool/main.c |7 +-
 tools/bpf/bpftool/main.h |3 -
 tools/bpf/bpftool/prog.c |  101 ++
 tools/lib/bpf/bpf.h  |3 +
 tools/lib/bpf/libbpf.c   |   27 --
 tools/lib/bpf/libbpf.h   |2 
 9 files changed, 160 insertions(+), 15 deletions(-)

--
Signature

Re: [PATCH bpf-next 3/8] bpf, sockmap: convert to generic sk_msg interface

2018-10-11 Thread John Fastabend

On 10/11/2018 03:57 PM, Alexei Starovoitov wrote:
> On Thu, Oct 11, 2018 at 02:45:42AM +0200, Daniel Borkmann wrote:
>> Add a generic sk_msg layer, and convert current sockmap and later
>> kTLS over to make use of it. While sk_buff handles network packet
>> representation from netdevice up to socket, sk_msg handles data
>> representation from application to socket layer.
>>
>> This means that sk_msg framework spans across ULP users in the
>> kernel, and enables features such as introspection or filtering
>> of data with the help of BPF programs that operate on this data
>> structure.
>>
>> Latter becomes in particular useful for kTLS where data encryption
>> is deferred into the kernel, and as such enabling the kernel to
>> perform L7 introspection and policy based on BPF for TLS connections
>> where the record is being encrypted after BPF has run and came to
>> a verdict. In order to get there, first step is to transform open
>> coding of scatter-gather list handling into a common core framework
>> that subsystems use.
>>
>> Joint work with John.
>>
>> Signed-off-by: Daniel Borkmann 
>> Signed-off-by: John Fastabend 
>> ---
>>  include/linux/bpf.h   |   33 +-
>>  include/linux/bpf_types.h |2 +-
>>  include/linux/filter.h|   21 -
>>  include/linux/skmsg.h |  371 +++
>>  include/net/tcp.h |   27 +
>>  kernel/bpf/Makefile   |5 -
>>  kernel/bpf/core.c |2 -
>>  kernel/bpf/sockmap.c  | 2610 
>> -
>>  kernel/bpf/syscall.c  |6 +-
>>  net/Kconfig   |   11 +
>>  net/core/Makefile |2 +
>>  net/core/filter.c |  270 ++---
>>  net/core/skmsg.c  |  763 +
>>  net/core/sock_map.c   | 1002 +
>>  net/ipv4/Makefile |1 +
>>  net/ipv4/tcp_bpf.c|  655 
>>  net/strparser/Kconfig |4 +-
>>  17 files changed, 2925 insertions(+), 2860 deletions(-)
> 
>> +void sk_msg_trim(struct sock *sk, struct sk_msg *msg, int len)
>> +{
>> +int trim = msg->sg.size - len;
>> +u32 i = msg->sg.end;
>> +
>> +if (trim <= 0) {
>> +WARN_ON(trim < 0);
>> +return;
>> +}
>> +
>> +sk_msg_iter_var_prev(i);
>> +msg->sg.size = len;
>> +while (msg->sg.data[i].length &&
>> +   trim >= msg->sg.data[i].length) {
>> +trim -= msg->sg.data[i].length;
>> +sk_msg_free_elem(sk, msg, i, true);
>> +sk_msg_iter_var_prev(i);
>> +if (!trim)
>> +goto out;
>> +}
>> +
>> +msg->sg.data[i].length -= trim;
>> +sk_mem_uncharge(sk, trim);
>> +out:
>> +/* If we trim data before curr pointer update copybreak and current
>> + * so that any future copy operations start at new copy location.
>> + * However trimed data that has not yet been used in a copy op
>> + * does not require an update.
>> + */
>> +if (msg->sg.curr >= i) {
>> +msg->sg.curr = i;//msg->sg.end;
> 
> is this a leftover of some debugging ?

Correct the comment needs to be removed.

> 
> I think such giant patchset is impossible to review in reasonable
> amount of time. I guess you've considered splitting it, but
> couldn't find a way to do so ?

Right, so I looked at splitting this up but it was hard to make
any sense out of how to split this up well. This patch is moving existing
code in ./kernel/bpf/sockmap.c into skmsg.c, sock_map.c, and tcp_bpf.c
Along the way some of the naming/structures are changed a bit to
fit into the new file layout. Splitting it up to just move bits and
pieces at a time didn't seem to help much, made it more difficult
to review IMO, and I also had trouble breaking it into isolated changes.

> 
> May be expand the commit log a bit more to explain not only _why_
> (which is typical requiremnt), but _how_ the patchset is doing it?
> 

Sounds like a good idea Daniel and I can work something up tomorrow.
The gist is to isolate the sock_map changes a bit so they can be
used by multiple ULPs (as noted in the commit log). But the how here
is to create a sock_map.c file to manage BPF map APIs, a skmsg.c
file full of all the generic code used with sk_msg structs (the
core structure throughout) and finally tcp_bpf to handle the TCP
specific parts. The nice fallout of all this is we then can add
_just_ the TLS bits in the subsequent patches. Also note the overall
+/- diff for the entire series is actual

Re: [RFC 0/2] net: sched: indirect/remote setup tc block cb registering

2018-10-11 Thread John Hurley

On Wed, Oct 10, 2018 at 2:38 PM Or Gerlitz  wrote:
>
> On Thu, Oct 4, 2018 at 8:19 PM Jakub Kicinski
>  wrote:
> > On Thu, 4 Oct 2018 17:20:43 +0100, John Hurley wrote:
> > > > > In this case the hw driver will receive the rules from the tunnel 
> > > > > device directly.
> > > > > The driver can then offload them as it sees fit.
> > > >
> > > > if both instances of the hw drivers (uplink0, uplink1) register to get
> > > > the rules installed on the block of the tunnel device we have exactly
> > > > what we want, isn't that?
> > > >
> > >
> > > The design here is that each hw driver should only need to register
> > > for callbacks on a 'higher level' device's block once.
> > > When a callback is triggered the driver receives one instance of the
> > > rule and can make its own decision about what to do.
> > > This is slightly different from registering ingress devs where each
> > > uplink registers for its own block.
> > > It is probably more akin to the egdev setup in that if a rule on a
> > > block egresses to an uplink, the driver receives 1 callback on the
> > > rule, irrespective of how may underlying netdevs are on the block.
> >
> > Right, though nothing stops the driver from registering multiple
> > callbacks for the same device, if its somehow useful.
>
> I must be missing something.. put uplink bonding a side. If the user
> is setting tc ingress rule
> on a tunnel device (vxlan0/gre0) over a system with multiple unrelated
> NICs/uplinks that support
> TC decap offload, wouldn't each of these netdevs want to install the
> rule into HW? why do we want
> the HW driver to duplicate the rule between the potential candidates
> among the netdev instances they created?
> and not each of them to get the callback and decide??
>
> we want each netdev instance of these NIC

Hi Or,
It depends on how we want to offload tunnels.
In the case of the NFP, we offload 1 instance of a tunnel rule, not
one instance per uplink.
With this, it makes sense to have 1 callback per tunnel netdev (and
per driver) rather that per uplink (although as Jakub pointed out, the
option is there to register more callbacks).
If we consider the egdev model for offload, we only got a single
callback per rule if the egress device was registered and did not know
the ingress dev - is this not a similar in that the driver gets 1
callback for the rule and decides what to do with it?
John

Re: [PATCH] bpf: bpftool, add support for attaching programs to maps

2018-10-10 Thread John Fastabend

On 10/10/2018 10:11 AM, Jakub Kicinski wrote:
> On Wed, 10 Oct 2018 09:44:26 -0700, John Fastabend wrote:
>> Sock map/hash introduce support for attaching programs to maps. To
>> date I have been doing this with custom tooling but this is less than
>> ideal as we shift to using bpftool as the single CLI for our BPF uses.
>> This patch adds new sub commands 'attach' and 'detach' to the 'prog'
>> command to attach programs to maps and then detach them.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  tools/bpf/bpftool/main.h |1 +
>>  tools/bpf/bpftool/prog.c |   92 
>> ++
>>  2 files changed, 92 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
>> index 40492cd..9ceb2b6 100644
>> --- a/tools/bpf/bpftool/main.h
>> +++ b/tools/bpf/bpftool/main.h
>> @@ -137,6 +137,7 @@ int cmd_select(const struct cmd *cmds, int argc, char 
>> **argv,
>>  int do_cgroup(int argc, char **arg);
>>  int do_perf(int argc, char **arg);
>>  int do_net(int argc, char **arg);
>> +int do_attach_cmd(int argc, char **arg);
> 
> Looks like a leftover?
> 

Yeah original I made attach/detach its own top level command
but it seems better fit under prog.

[..]

>> +if (!REQ_ARGS(4)) {
> 
> Hm, 4 or 5?  id $prog $type id $map ?
> 

Yep thanks.

[...]

>> +
>> +NEXT_ARG();
> 
> nit: maybe NEXT_ARG() should be grouped with the code that consumes the
>  parameter, i.e. new line after not before?

sure.

> 
>> +mapfd = map_parse_fd(, );
>> +if (mapfd < 0)
>> +return mapfd;
>> +
>> +err = bpf_prog_attach(progfd, mapfd, attach_type, 0);
>> +if (err) {
>> +p_err("failed prog attach to map");
>> +return -EINVAL;
>> +}
> 
> Could you plunk a
> 
> if (json_output)
>   jsonw_null(json_wtr);
> 
> here to always produce valid JSON even for commands with no output
> today?
> 

Makes sense.

> Same comments for detach.

[...]

> Would you mind updating the man page and the bash completions?
> 

Will do this. Thanks.

Re: [PATCH net-next] net: enable RPS on vlan devices

2018-10-10 Thread John Fastabend

On 10/10/2018 10:14 AM, Eric Dumazet wrote:
> 
> 
> On 10/10/2018 09:18 AM, Shannon Nelson wrote:
>> On 10/9/2018 7:17 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 10/09/2018 07:11 PM, Shannon Nelson wrote:

 Hence the reason we sent this as an RFC a couple of weeks ago.  We got no 
 response, so followed up with this patch in order to get some input. Do 
 you have any suggestions for how we might accomplish this in a less ugly 
 way?
>>>
>>> I dunno, maybe a modern way for all these very specific needs would be to 
>>> use an eBPF
>>> hook to implement whatever combination of RPS/RFS/what_have_you
>>>
>>> Then, we no longer have to review what various strategies are used by users.
>>
>> We're trying to make use of an existing useful feature that was designed for 
>> exactly this kind of problem.  It is already there and no new user training 
>> is needed.  We're actually fixing what could arguably be called a bug since 
>> the /sys/class/net//queues/rx-0/rps_cpus entry exists for vlan devices 
>> but currently doesn't do anything.  We're also addressing a security concern 
>> related to the recent L1TF excitement.
>>
>> For this case, we want to target the network stack processing to happen on a 
>> certain subset of CPUs.  With admittedly only a cursory look through eBPF, I 
>> don't see an obvious way to target the packet processing to an alternative 
>> CPU, unless we add yet another field to the skb that eBPF/XDP could fill and 
>> then query that field in the same time as we currently check get_rps_cpu().  
>> But adding to the skb is usually frowned upon unless absolutely necessary, 
>> and this seems like a duplication of what we already have with RPS, so why 
>> add a competing feature?
>>
>> Back to my earlier question: are there any suggestions for how we might 
>> accomplish this in a less ugly way?
> 
> 
> What if you want to have efficient multi queue processing ?
> The Vlan device could have multiple RX queues, but you forced queue_mapping=0
> 
> Honestly, RPS & RFS show their age and complexity (look at 
> net/core/net-sysfs.c ...)
> 
> We should not expand it, we should put in place a new infrastructure, fully 
> expandable.
> With socket lookups, we even can avoid having a hashtable for flow 
> information, removing
> one cache miss, and removing flow collisions.
> 
> eBPF seems perfect to me.
> 

Latest tree has a sk_lookup() helper supported in 'tc' layer now
to lookup the socket. And XDP has support for a "cpumap" object
that allows redirect to remote CPUs. Neither was specifically
designed for this but I suspect with some extra work these might
be what is needed.

I would start by looking at bpf_sk_lookup() in filter.c and the
cpumap type in ./kernel/bpf/cpumap.c, also in general sk_lookup
from XDP layer will likely be needed shortly anyways.

> It is time that we stop adding core infra that most users do not need/use.
> (RPS and RFS are default off)
>

[PATCH] bpf: bpftool, add support for attaching programs to maps

2018-10-10 Thread John Fastabend

Sock map/hash introduce support for attaching programs to maps. To
date I have been doing this with custom tooling but this is less than
ideal as we shift to using bpftool as the single CLI for our BPF uses.
This patch adds new sub commands 'attach' and 'detach' to the 'prog'
command to attach programs to maps and then detach them.

Signed-off-by: John Fastabend 
---
 tools/bpf/bpftool/main.h |1 +
 tools/bpf/bpftool/prog.c |   92 ++
 2 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cd..9ceb2b6 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -137,6 +137,7 @@ int cmd_select(const struct cmd *cmds, int argc, char 
**argv,
 int do_cgroup(int argc, char **arg);
 int do_perf(int argc, char **arg);
 int do_net(int argc, char **arg);
+int do_attach_cmd(int argc, char **arg);
 
 int prog_parse_fd(int *argc, char ***argv);
 int map_parse_fd(int *argc, char ***argv);
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index b1cd3bc..280881d 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -77,6 +77,26 @@
[BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
 };
 
+static const char * const attach_type_strings[] = {
+   [BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
+   [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+   [BPF_SK_MSG_VERDICT] = "msg_verdict",
+   [__MAX_BPF_ATTACH_TYPE] = NULL,
+};
+
+enum bpf_attach_type parse_attach_type(const char *str)
+{
+   enum bpf_attach_type type;
+
+   for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) {
+   if (attach_type_strings[type] &&
+   is_prefix(str, attach_type_strings[type]))
+   return type;
+   }
+
+   return __MAX_BPF_ATTACH_TYPE;
+}
+
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
 {
struct timespec real_time_ts, boot_time_ts;
@@ -697,6 +717,71 @@ int map_replace_compar(const void *p1, const void *p2)
return a->idx - b->idx;
 }
 
+static int do_attach(int argc, char **argv)
+{
+   enum bpf_attach_type attach_type;
+   int err, mapfd, progfd;
+
+   if (!REQ_ARGS(4)) {
+   p_err("too few parameters for map attach");
+   return -EINVAL;
+   }
+
+   progfd = prog_parse_fd(, );
+   if (progfd < 0)
+   return progfd;
+
+   attach_type = parse_attach_type(*argv);
+   if (attach_type == __MAX_BPF_ATTACH_TYPE) {
+   p_err("invalid attach type");
+   return -EINVAL;
+   }
+
+   NEXT_ARG();
+   mapfd = map_parse_fd(, );
+   if (mapfd < 0)
+   return mapfd;
+
+   err = bpf_prog_attach(progfd, mapfd, attach_type, 0);
+   if (err) {
+   p_err("failed prog attach to map");
+   return -EINVAL;
+   }
+   return 0;
+}
+
+static int do_detach(int argc, char **argv)
+{
+   enum bpf_attach_type attach_type;
+   int err, mapfd, progfd;
+
+   if (!REQ_ARGS(4)) {
+   p_err("too few parameters for map detach");
+   return -EINVAL;
+   }
+
+   progfd = prog_parse_fd(, );
+   if (progfd < 0)
+   return progfd;
+
+   attach_type = parse_attach_type(*argv);
+   if (attach_type == __MAX_BPF_ATTACH_TYPE) {
+   p_err("invalid attach type");
+   return -EINVAL;
+   }
+
+   NEXT_ARG();
+   mapfd = map_parse_fd(, );
+   if (mapfd < 0)
+   return mapfd;
+
+   err = bpf_prog_detach2(progfd, mapfd, attach_type);
+   if (err) {
+   p_err("failed prog detach from map");
+   return -EINVAL;
+   }
+   return 0;
+}
 static int do_load(int argc, char **argv)
 {
enum bpf_attach_type expected_attach_type;
@@ -942,6 +1027,7 @@ static int do_help(int argc, char **argv)
"   %s %s pin   PROG FILE\n"
"   %s %s load  OBJ  FILE [type TYPE] [dev NAME] \\\n"
" [map { idx IDX | name NAME } MAP]\n"
+   "   %s %s attach PROG ATTACH_TYPE MAP\n"
"   %s %s help\n"
"\n"
"   " HELP_SPEC_MAP "\n"
@@ -953,10 +1039,12 @@ static int do_help(int argc, char **argv)
" cgroup/bind4 | cgroup/bind6 | 
cgroup/post_bind4 |\n"
" cgroup/post_bind6 | cgroup/connect4 | 
cgroup/connect6 |\n"
" cgroup/sendmsg4 | cgroup/sendmsg6 }\n"
+   "   ATTACH_TYPE := { msg_verdict | skb_verdict | skb_parse 
}\n"

Re: [PATCH ethtool v2] ethtool: Fix uninitialized variable use at qsfp dump

2018-10-04 Thread John W. Linville

On Tue, Oct 02, 2018 at 10:24:19AM +0300, Eran Ben Elisha wrote:
> Struct sff_diags can be used uninitialized at sff8636_show_dom, this
> caused the tool to show unreported fields (supports_alarms) by the lower
> level driver.
> 
> In addition, make sure the same struct is being initialized at
> sff8472_parse_eeprom function, to avoid the same issue here.
> 
> Fixes: a5e73bb05ee4 ("ethtool:QSFP Plus/QSFP28 Diagnostics Information 
> Support")
> Signed-off-by: Eran Ben Elisha 

OK, queued for next release...

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-10-04 Thread John W. Linville

On Thu, Oct 04, 2018 at 05:06:29PM +0100, Edward Cree wrote:
> On 04/10/18 15:08, John W. Linville wrote:
> > Ping?
> >
> > On Mon, Oct 01, 2018 at 02:59:10PM -0400, John W. Linville wrote:
> >> Is this patch still RFC?
> Feel free to de-RFC and apply it.

Great -- queued for next release.

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: [RFC 0/2] net: sched: indirect/remote setup tc block cb registering

2018-10-04 Thread John Hurley

On Thu, Oct 4, 2018 at 4:53 PM Or Gerlitz  wrote:
>
> On Thu, Oct 4, 2018 at 6:44 PM John Hurley  wrote:
> > On Thu, Oct 4, 2018 at 3:28 PM Or Gerlitz  wrote:
> > > On Thu, Oct 4, 2018 at 7:55 AM Jakub Kicinski 
> > >  wrote:
>
> > > > This patchset introduces as an alternative to egdev offload by allowing 
> > > > a
> > > > driver to register for block updates when an external device (e.g. 
> > > > tunnel
> > > > netdev) is bound to a TC block.
>
> > > In a slightly different but hopefully somehow related context, regarding
> > > the case of flow offloading in the presence of upper devices 
> > > (specifically LAG),
> > > your ovs user patch [1]  applied TC block sharing on the slave of lag 
> > > (bond/team)
> > > device which serves as ovs port. This way, flows that are installed on
> > > the bond are propagated to both uplink devices - good!
>
> > > However, when tunneling comes into play, the bond device is not part of
> > > the virtual switch but rather the tunnel device, so the SW DP is
> > >
> > > wire --> hw driver --> bond --> stack --> tunnel driver --> virtual switch
> > >
> > > So now, if the HW driver uses your new facility to register for rules 
> > > installed on the
> > > tunnel device, we are again properly sharing (duplicating) the rules
> > > to both uplinks, right?!
> > >
> > > [1] d22f892 netdev-linux: monitor and offload LAG slaves to TC
>
> > Because the bond is not on the vSwitch, the TC rule will not be
> > offloaded to it and therefore not duplicated to its devices.
>
> indeed
>
> > In this case the hw driver will receive the rules from the tunnel device 
> > directly.
> > The driver can then offload them as it sees fit.
>
> if both instances of the hw drivers (uplink0, uplink1) register to get
> the rules installed on the block of the tunnel device we have exactly
> what we want, isn't that?
>

The design here is that each hw driver should only need to register
for callbacks on a 'higher level' device's block once.
When a callback is triggered the driver receives one instance of the
rule and can make its own decision about what to do.
This is slightly different from registering ingress devs where each
uplink registers for its own block.
It is probably more akin to the egdev setup in that if a rule on a
block egresses to an uplink, the driver receives 1 callback on the
rule, irrespective of how may underlying netdevs are on the block.

> > Currently, this setup would be offloaded via egdev.
>
> not following, egdev I thought could be removed... and it's not needed
> as I explained above, unless I miss something?

Apologies - my use of 'currently' meant with the current upstream
kernel (i.e. prior to this patch).

Re: [RFC 0/2] net: sched: indirect/remote setup tc block cb registering

2018-10-04 Thread John Hurley

On Thu, Oct 4, 2018 at 3:28 PM Or Gerlitz  wrote:
>
> On Thu, Oct 4, 2018 at 7:55 AM Jakub Kicinski
>  wrote:
> >
> > Hi!
> >
> > This set contains a rough RFC implementation of a proposed [1] replacement
> > for egdev cls_flower offloads.  I did some last minute restructuring
> > and removal of parts I felt were unnecessary, so if there are glaring bugs
> > they are probably mine, not John's :)  but hopefully this will give an idea
> > of the general direction.  We need to beef up the driver part to see how
> > it fully comes together.
> >
> > [1] http://vger.kernel.org/netconf2018_files/JakubKicinski_netconf2018.pdf
> > slides 10-13
> >
> > John's says:
> >
> > This patchset introduces as an alternative to egdev offload by allowing a
> > driver to register for block updates when an external device (e.g. tunnel
> > netdev) is bound to a TC block.
>
> In a slightly different but hopefully somehow related context, regarding
> the case of flow offloading in the presence of upper devices (specifically 
> LAG),
> your ovs user patch [1]  applied TC block sharing on the slave of lag
> (bond/team)
> device which serves as ovs port. This way, flows that are installed on
> the bond are
> propagated to both uplink devices - good!
>
> However, when tunneling comes into play, the bond device is not part of
> the virtual switch but rather the tunnel device, so the SW DP is
>
> wire --> hw driver --> bond --> stack --> tunnel driver --> virtual switch
>
> So now, if the HW driver uses your new facility to register for rules
> installed on the
> tunnel device, we are again properly sharing (duplicating) the rules
> to both uplinks, right?!
>
> [1] d22f892 netdev-linux: monitor and offload LAG slaves to TC
>

Hi Or,
In this case the hw driver will receive the rules from the tunnel
device directly.
The driver can then offload them as it sees fit.
Because the bond is not on the vSwitch, the TC rule will not be
offloaded to it and therefore not duplicated to its devices.
Currently, this setup would be offloaded via egdev.

> > Drivers can track new netdevs or register
> > to existing ones to receive information on such events. Based on this,
> > they may register for block offload rules using already existing functions.
>
> Just to make it clear, (part of) the claim to fame here is that once
> we have this
> code in, we can just go and remove all the egdev related code from the
> kernel (both
> core and drivers), right? only nfp and mlx5 use egdev, so the removal
> should be simple
> exercise.
>

Along with simplifying things and removing the need for duplicate rule
offload checks, I see (at least) 2 main functional benefits of using
this instead of egdev:
1. we can get access to the ingress netdev and so can check for things
like tunnel type rather than relying on TC rules and well known ports
to determine this.
2. we can offload rules that do not have an uplink/repr as ingress or
egress dev (currently, the hw driver will not recieve a callback) -
e.g. VXLAN -->bond.

> > Included with this RFC is a patch to the NFP driver. This is only supposed
> > to provide an example of how the remote block setup can be used.
>
> We will look and play with the patches next week and provide feedback, cool
> that you took the lead to improve the facilities here!
cool, thanks

Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-10-04 Thread John W. Linville

Ping?

On Mon, Oct 01, 2018 at 02:59:10PM -0400, John W. Linville wrote:
> Is this patch still RFC?
> 
> On Wed, Sep 19, 2018 at 05:06:25PM +0100, Edward Cree wrote:
> > Instead of commas, just have them as separate argvs.
> > 
> > The parsing state machine might look heavyweight, but it makes it easy to 
> > add
> >  more parameters later and distinguish parameter names from encoding names.
> > 
> > Suggested-by: Michal Kubecek 
> > Signed-off-by: Edward Cree 
> > ---
> >  ethtool.8.in   |  6 +++---
> >  ethtool.c  | 63 
> > --
> >  test-cmdline.c | 10 +-
> >  3 files changed, 25 insertions(+), 54 deletions(-)
> > 
> > diff --git a/ethtool.8.in b/ethtool.8.in
> > index 414eaa1..7ea2cc0 100644
> > --- a/ethtool.8.in
> > +++ b/ethtool.8.in
> > @@ -390,7 +390,7 @@ ethtool \- query or control network driver and hardware 
> > settings
> >  .B ethtool \-\-set\-fec
> >  .I devname
> >  .B encoding
> > -.BR auto | off | rs | baser [ , ...]
> > +.BR auto | off | rs | baser \ [...]
> >  .
> >  .\" Adjust lines (i.e. full justification) and hyphenate.
> >  .ad
> > @@ -1120,11 +1120,11 @@ current FEC mode, the driver or firmware must take 
> > the link down
> >  administratively and report the problem in the system logs for users to 
> > correct.
> >  .RS 4
> >  .TP
> > -.BR encoding\ auto | off | rs | baser [ , ...]
> > +.BR encoding\ auto | off | rs | baser \ [...]
> >  
> >  Sets the FEC encoding for the device.  Combinations of options are 
> > specified as
> >  e.g.
> > -.B auto,rs
> > +.B encoding auto rs
> >  ; the semantics of such combinations vary between drivers.
> >  .TS
> >  nokeep;
> > diff --git a/ethtool.c b/ethtool.c
> > index 9997930..2f7e96b 100644
> > --- a/ethtool.c
> > +++ b/ethtool.c
> > @@ -4979,39 +4979,6 @@ static int fecmode_str_to_type(const char *str)
> > return 0;
> >  }
> >  
> > -/* Takes a comma-separated list of FEC modes, returns the bitwise OR of 
> > their
> > - * corresponding ETHTOOL_FEC_* constants.
> > - * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
> > - */
> > -static int parse_fecmode(const char *str)
> > -{
> > -   int fecmode = 0;
> > -   char buf[6];
> > -
> > -   if (!str)
> > -   return 0;
> > -   while (*str) {
> > -   size_t next;
> > -   int mode;
> > -
> > -   next = strcspn(str, ",");
> > -   if (next >= 6) /* Bad mode, longest name is 5 chars */
> > -   return 0;
> > -   /* Copy into temp buffer and nul-terminate */
> > -   memcpy(buf, str, next);
> > -   buf[next] = 0;
> > -   mode = fecmode_str_to_type(buf);
> > -   if (!mode) /* Bad mode encountered */
> > -   return 0;
> > -   fecmode |= mode;
> > -   str += next;
> > -   /* Skip over ',' (but not nul) */
> > -   if (*str)
> > -   str++;
> > -   }
> > -   return fecmode;
> > -}
> > -
> >  static int do_gfec(struct cmd_context *ctx)
> >  {
> > struct ethtool_fecparam feccmd = { 0 };
> > @@ -5041,22 +5008,26 @@ static int do_gfec(struct cmd_context *ctx)
> >  
> >  static int do_sfec(struct cmd_context *ctx)
> >  {
> > -   char *fecmode_str = NULL;
> > +   enum { ARG_NONE, ARG_ENCODING } state = ARG_NONE;
> > struct ethtool_fecparam feccmd;
> > -   struct cmdline_info cmdline_fec[] = {
> > -   { "encoding", CMDL_STR,  _str,  },
> > -   };
> > -   int changed;
> > -   int fecmode;
> > -   int rv;
> > +   int fecmode = 0, newmode;
> > +   int rv, i;
> >  
> > -   parse_generic_cmdline(ctx, , cmdline_fec,
> > - ARRAY_SIZE(cmdline_fec));
> > -
> > -   if (!fecmode_str)
> > +   for (i = 0; i < ctx->argc; i++) {
> > +   if (!strcmp(ctx->argp[i], "encoding")) {
> > +   state = ARG_ENCODING;
> > +   continue;
> > +   }
> > +   if (state == ARG_ENCODING) {
> > +   newmode = fecmode_str_to_type(ctx->argp[i]);
> > +   if (!newmode)
> > +   exit_bad_args();
> > +   fecmode |= newmode;
> > +   continue;

Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-10-01 Thread John W. Linville

t;   { "-h|--help", 0, show_usage, "Show this help" },
>   { "--version", 0, do_version, "Show version number" },
>   {}
> diff --git a/test-cmdline.c b/test-cmdline.c
> index 9c51cca..84630a5 100644
> --- a/test-cmdline.c
> +++ b/test-cmdline.c
> @@ -268,12 +268,12 @@ static struct test_case {
>   { 1, "--set-eee devname advertise foo" },
>   { 1, "--set-fec devname" },
>   { 0, "--set-fec devname encoding auto" },
> - { 0, "--set-fec devname encoding off," },
> -     { 0, "--set-fec devname encoding baser,rs" },
> - { 0, "--set-fec devname encoding auto,auto," },
> + { 0, "--set-fec devname encoding off" },
> + { 0, "--set-fec devname encoding baser rs" },
> + { 0, "--set-fec devname encoding auto auto" },
>   { 1, "--set-fec devname encoding foo" },
> - { 1, "--set-fec devname encoding auto,foo" },
> - { 1, "--set-fec devname encoding auto,," },
> + { 1, "--set-fec devname encoding auto foo" },
> + { 1, "--set-fec devname encoding none" },
>   { 1, "--set-fec devname auto" },
>   /* can't test --set-priv-flags yet */
>   { 0, "-h" },
> 

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: bpf: Massive skbuff_head_cache memory leak?

2018-09-26 Thread John Johansen

On 09/26/2018 02:22 PM, Daniel Borkmann wrote:
> On 09/26/2018 11:09 PM, Tetsuo Handa wrote:
>> Hello, Alexei and Daniel.
>>
>> Can you show us how to run testcases you are testing?
> 
> Sorry for the delay; currently quite backlogged but will definitely take a 
> look
> at these reports. Regarding your question: majority of test cases are in the
> kernel tree under selftests, see tools/testing/selftests/bpf/ .
> 

Its unlikely to be apparmor. I went through the reports and saw nothing that
would indicate apparmor involvement, but the primary reason is what is being 
tested
in upstream apparmor atm.

The current upstream code does nothing directly with skbuffs. Its
possible that the audit code paths (kernel audit does grab skbuffs)
could, but there are only a couple cases that would be triggered in
the current fuzzing so this seems to be an unlikely source for such a
large leak.

>> On 2018/09/22 22:25, Tetsuo Handa wrote:
>>> Hello.
>>>
>>> syzbot is reporting many lockup problems on bpf.git / bpf-next.git / 
>>> net.git / net-next.git trees.
>>>
>>>   INFO: rcu detected stall in br_multicast_port_group_expired (2)
>>>   
>>> https://syzkaller.appspot.com/bug?id=15c7ad8cf35a07059e8a697a22527e11d294bc94
>>>
>>>   INFO: rcu detected stall in tun_chr_close
>>>   
>>> https://syzkaller.appspot.com/bug?id=6c50618bde03e5a2eefdd0269cf9739c5ebb8270
>>>
>>>   INFO: rcu detected stall in discover_timer
>>>   
>>> https://syzkaller.appspot.com/bug?id=55da031ddb910e58ab9c6853a5784efd94f03b54
>>>
>>>   INFO: rcu detected stall in ret_from_fork (2)
>>>   
>>> https://syzkaller.appspot.com/bug?id=c83129a6683b44b39f5b8864a1325893c9218363
>>>
>>>   INFO: rcu detected stall in addrconf_rs_timer
>>>   
>>> https://syzkaller.appspot.com/bug?id=21c029af65f81488edbc07a10ed20792444711b6
>>>
>>>   INFO: rcu detected stall in kthread (2)
>>>   
>>> https://syzkaller.appspot.com/bug?id=6accd1ed11c31110fed1982f6ad38cc9676477d2
>>>
>>>   INFO: rcu detected stall in ext4_filemap_fault
>>>   
>>> https://syzkaller.appspot.com/bug?id=817e38d20e9ee53390ac361bf0fd2007eaf188af
>>>
>>>   INFO: rcu detected stall in run_timer_softirq (2)
>>>   
>>> https://syzkaller.appspot.com/bug?id=f5a230a3ff7822f8d39fddf8485931bd06ae47fe
>>>
>>>   INFO: rcu detected stall in bpf_prog_ADDR
>>>   
>>> https://syzkaller.appspot.com/bug?id=fb4911fd0e861171cc55124e209f810a0dd68744
>>>
>>>   INFO: rcu detected stall in __run_timers (2)
>>>   
>>> https://syzkaller.appspot.com/bug?id=65416569ddc8d2feb8f19066aa761f5a47f7451a
>>>
>>> The cause of lockup seems to be flood of printk() messages from memory 
>>> allocation
>>> failures, and one of out_of_memory() messages indicates that 
>>> skbuff_head_cache
>>> usage is huge enough to suspect in-kernel memory leaks.
>>>
>>>   [ 1554.547011] skbuff_head_cache1847887KB1847887KB
>>>
>>> Unfortunately, we cannot find from logs what syzbot is trying to do
>>> because constant printk() messages is flooding away syzkaller messages.
>>> Can you try running your testcases with kmemleak enabled?
>>>
>>
>> On 2018/09/27 2:35, Dmitry Vyukov wrote:
>>> I also started suspecting Apparmor. We switched to Apparmor on Aug 30:
>>> https://groups.google.com/d/msg/syzkaller-bugs/o73lO4KGh0w/j9pcH2tSBAAJ
>>> Now the instances that use SELinux and Smack explicitly contain that
>>> in the name, but the rest are Apparmor.
>>> Aug 30 roughly matches these assorted "task hung" reports. Perhaps
>>> some Apparmor hook leaks a reference to skbs?
>>
>> Maybe. They have CONFIG_DEFAULT_SECURITY="apparmor". But I'm wondering why
>> this problem is not occurring on linux-next.git when this problem is 
>> occurring
>> on bpf.git / bpf-next.git / net.git / net-next.git trees. Is syzbot running
>> different testcases depending on which git tree is targeted?
>>
> 

this is another reason that it is doubtful that its apparmor.

[bpf PATCH v4 2/3] bpf: sockmap, fix transition through disconnect without close

2018-09-18 Thread John Fastabend

It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.

To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
Acked-by: Yonghong Song 
---
 kernel/bpf/sockmap.c |   60 ++
 1 file changed, 41 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 1f97b55..0a0f2ec 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -132,6 +132,7 @@ struct smap_psock {
struct work_struct gc_work;
 
struct proto *sk_proto;
+   void (*save_unhash)(struct sock *sk);
void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
@@ -143,6 +144,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+static void bpf_tcp_unhash(struct sock *sk);
 static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -184,6 +186,7 @@ static void build_protos(struct proto 
prot[SOCKMAP_NUM_CONFIGS],
 struct proto *base)
 {
prot[SOCKMAP_BASE]  = *base;
+   prot[SOCKMAP_BASE].unhash   = bpf_tcp_unhash;
prot[SOCKMAP_BASE].close= bpf_tcp_close;
prot[SOCKMAP_BASE].recvmsg  = bpf_tcp_recvmsg;
prot[SOCKMAP_BASE].stream_memory_read   = bpf_tcp_stream_read;
@@ -217,6 +220,7 @@ static int bpf_tcp_init(struct sock *sk)
return -EBUSY;
}
 
+   psock->save_unhash = sk->sk_prot->unhash;
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
 
@@ -305,30 +309,12 @@ static struct smap_psock_map_entry *psock_map_pop(struct 
sock *sk,
return e;
 }
 
-static void bpf_tcp_close(struct sock *sk, long timeout)
+static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
 {
-   void (*close_fun)(struct sock *sk, long timeout);
struct smap_psock_map_entry *e;
struct sk_msg_buff *md, *mtmp;
-   struct smap_psock *psock;
struct sock *osk;
 
-   lock_sock(sk);
-   rcu_read_lock();
-   psock = smap_psock_sk(sk);
-   if (unlikely(!psock)) {
-   rcu_read_unlock();
-   release_sock(sk);
-   return sk->sk_prot->close(sk, timeout);
-   }
-
-   /* The psock may be destroyed anytime after exiting the RCU critial
-* section so by the time we use close_fun the psock may no longer
-* be valid. However, bpf_tcp_close is called with the sock lock
-* held so the close hook and sk are still valid.
-*/
-   close_fun = psock->save_close;
-
if (psock->cork) {
free_start_sg(psock->sock, psock->cork, true);
kfree(psock->cork);
@@ -379,6 +365,42 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
kfree(e);
e = psock_map_pop(sk, psock);
}
+}
+
+static void bpf_tcp_unhash(struct sock *sk)
+{
+   void (*unhash_fun)(struct sock *sk);
+   struct smap_psock *psock;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   if (sk->sk_prot->unhash)
+   sk->sk_prot->unhash(sk);
+   return;
+   }
+   unhash_fun = psock->save_unhash;
+   bpf_tcp_remove(sk, psock);
+   rcu_read_unlock();
+   unhash_fun(sk);
+}
+
+static void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   void (*close_fun)(struct sock *sk, long timeout);
+   struct smap_psock *psock;
+
+   lock_sock(sk);
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   release_sock(sk);
+   return sk->sk_prot->close(sk, timeout);
+   }
+   close_fun = psock->save_close;
+   bpf_tcp_remove(sk, psock);
rcu_read_unlock();
release_sock(sk);
close_fun(sk, timeout);

[bpf PATCH v4 1/3] bpf: sockmap only allow ESTABLISHED sock state

2018-09-18 Thread John Fastabend

After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

Similar to TLS ULP this ensures sk_user_data is correct.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
Acked-by: Yonghong Song 
---
 kernel/bpf/sockmap.c |   31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 488ef96..1f97b55 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2097,8 +2097,12 @@ static int sock_map_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
if (skops.sk->sk_type != SOCK_STREAM ||
-   skops.sk->sk_protocol != IPPROTO_TCP) {
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
fput(socket->file);
return -EOPNOTSUPP;
}
@@ -2453,6 +2457,16 @@ static int sock_hash_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
+   if (skops.sk->sk_type != SOCK_STREAM ||
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
+   fput(socket->file);
+   return -EOPNOTSUPP;
+   }
+
lock_sock(skops.sk);
preempt_disable();
rcu_read_lock();
@@ -2543,10 +2557,22 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
.map_check_btf = map_check_no_btf,
 };
 
+static bool bpf_is_valid_sock_op(struct bpf_sock_ops_kern *ops)
+{
+   return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
+  ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+}
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state. This checks that the sock ops triggering the update is
+* one indicating we are (or will be soon) in an ESTABLISHED state.
+*/
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
 }
 
@@ -2565,6 +2591,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
 }

[bpf PATCH v4 3/3] bpf: test_maps, only support ESTABLISHED socks

2018-09-18 Thread John Fastabend

Ensure that sockets added to a sock{map|hash} that is not in the
ESTABLISHED state is rejected.

Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
Acked-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_maps.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84..9b552c0 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -580,7 +580,11 @@ static void test_sockmap(int tasks, void *data)
/* Test update without programs */
for (i = 0; i < 6; i++) {
err = bpf_map_update_elem(fd, , [i], BPF_ANY);
-   if (err) {
+   if (i < 2 && !err) {
+   printf("Allowed update sockmap '%i:%i' not in 
ESTABLISHED\n",
+  i, sfd[i]);
+   goto out_sockmap;
+   } else if (i >= 2 && err) {
printf("Failed noprog update sockmap '%i:%i'\n",
   i, sfd[i]);
goto out_sockmap;
@@ -741,7 +745,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Test map update elem afterwards fd lives in fd and map_fd */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_update_elem(map_fd_rx, , [i], BPF_ANY);
if (err) {
printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
@@ -845,7 +849,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Delete the elems without programs */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_delete_elem(fd, );
if (err) {
printf("Failed delete sockmap %i '%i:%i'\n",

[bpf PATCH v4 0/3] bpf, sockmap ESTABLISHED state only

2018-09-18 Thread John Fastabend

Eric noted that using the close callback is not sufficient
to catch all transitions from ESTABLISHED state to a LISTEN
state. So this series does two things. First, only allow
adding socks in ESTABLISH state and second use unhash callback
to catch tcp_disconnect() transitions.

v2: added check for ESTABLISH state in hash update sockmap as well
v3: Do not release lock from unhash in error path, no lock was
used in the first place. And drop not so useful code comments
v4: convert,
if (unhash()) return unhash(); return
 to if (unhash()) unhash(); return;

Thanks for reviewing Yonghong I carried your ACKs forward.

---

John Fastabend (3):
  bpf: sockmap only allow ESTABLISHED sock state
  bpf: sockmap, fix transition through disconnect without close
  bpf: test_maps, only support ESTABLISHED socks


 kernel/bpf/sockmap.c|   91 ---
 tools/testing/selftests/bpf/test_maps.c |   10 ++-
 2 files changed, 78 insertions(+), 23 deletions(-)

--
Signature

Re: [bpf PATCH 2/3] bpf: sockmap, fix transition through disconnect without close

2018-09-17 Thread John Fastabend

On 09/17/2018 02:09 PM, Y Song wrote:
> On Mon, Sep 17, 2018 at 10:32 AM John Fastabend
>  wrote:
>>
>> It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
>> state via tcp_disconnect() without actually calling tcp_close which
>> would then call our bpf_tcp_close() callback. Because of this a user
>> could disconnect a socket then put it in a LISTEN state which would
>> break our assumptions about sockets always being ESTABLISHED state.
>>
>> To resolve this rely on the unhash hook, which is called in the
>> disconnect case, to remove the sock from the sockmap.
>>
>> Reported-by: Eric Dumazet 
>> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
>> Signed-off-by: John Fastabend 
>> ---

[...]

>> +{
>> +   void (*unhash_fun)(struct sock *sk);
>> +   struct smap_psock *psock;
>> +
>> +   rcu_read_lock();
>> +   psock = smap_psock_sk(sk);
>> +   if (unlikely(!psock)) {
>> +   rcu_read_unlock();
>> +   release_sock(sk);
> 
> Can socket be released here?
>

Right, it was an error (it can not be released) fixed in v3.


>> +   return sk->sk_prot->unhash(sk);
>> +   }
>> +
>> +   /* The psock may be destroyed anytime after exiting the RCU critial
>> +* section so by the time we use close_fun the psock may no longer
>> +* be valid. However, bpf_tcp_close is called with the sock lock
>> +* held so the close hook and sk are still valid.
>> +*/
> 
> the comments above are not correct. A copy-paste mistake?

I just removed the comments they are not overly helpful at this point
and the commit msg is more useful anyways.

Thanks,
John

Re: [bpf PATCH 3/3] bpf: test_maps, only support ESTABLISHED socks

2018-09-17 Thread John Fastabend

On 09/17/2018 02:21 PM, Y Song wrote:
> On Mon, Sep 17, 2018 at 10:33 AM John Fastabend
>  wrote:
>>
>> Ensure that sockets added to a sock{map|hash} that is not in the
>> ESTABLISHED state is rejected.
>>
>> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
>> Signed-off-by: John Fastabend 
>> ---
>>  tools/testing/selftests/bpf/test_maps.c |   10 +++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/testing/selftests/bpf/test_maps.c 
>> b/tools/testing/selftests/bpf/test_maps.c
>> index 6f54f84..0f2090f 100644
>> --- a/tools/testing/selftests/bpf/test_maps.c
>> +++ b/tools/testing/selftests/bpf/test_maps.c
>> @@ -580,7 +580,11 @@ static void test_sockmap(int tasks, void *data)
>> /* Test update without programs */
>> for (i = 0; i < 6; i++) {
>> err = bpf_map_update_elem(fd, , [i], BPF_ANY);
>> -   if (err) {
>> +   if (i < 2 && !err) {
>> +   printf("Allowed update sockmap '%i:%i' not in 
>> ESTABLISHED\n",
>> +  i, sfd[i]);
>> +   goto out_sockmap;
>> +   } else if (i > 1 && err) {
> 
> Just a nit. Maybe "i >= 2" since it will be more clear since it is
> opposite of "i < 2"?
> 

Seems reasonable changed in v3 to 'i >= 2'. Thanks.

[bpf PATCH v3 1/3] bpf: sockmap only allow ESTABLISHED sock state

2018-09-17 Thread John Fastabend

After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

Similar to TLS ULP this ensures sk_user_data is correct.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
Acked-by: Yonghong Song 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 488ef96..1f97b55 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2097,8 +2097,12 @@ static int sock_map_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
if (skops.sk->sk_type != SOCK_STREAM ||
-   skops.sk->sk_protocol != IPPROTO_TCP) {
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
fput(socket->file);
return -EOPNOTSUPP;
}
@@ -2453,6 +2457,16 @@ static int sock_hash_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
+   if (skops.sk->sk_type != SOCK_STREAM ||
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
+   fput(socket->file);
+   return -EOPNOTSUPP;
+   }
+
lock_sock(skops.sk);
preempt_disable();
rcu_read_lock();
@@ -2543,10 +2557,22 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
.map_check_btf = map_check_no_btf,
 };
 
+static bool bpf_is_valid_sock_op(struct bpf_sock_ops_kern *ops)
+{
+   return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
+  ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+}
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state. This checks that the sock ops triggering the update is
+* one indicating we are (or will be soon) in an ESTABLISHED state.
+*/
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
 }
 
@@ -2565,6 +2591,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
 }

[bpf PATCH v3 3/3] bpf: test_maps, only support ESTABLISHED socks

2018-09-17 Thread John Fastabend

Ensure that sockets added to a sock{map|hash} that is not in the
ESTABLISHED state is rejected.

Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_maps.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84..9b552c0 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -580,7 +580,11 @@ static void test_sockmap(int tasks, void *data)
/* Test update without programs */
for (i = 0; i < 6; i++) {
err = bpf_map_update_elem(fd, , [i], BPF_ANY);
-   if (err) {
+   if (i < 2 && !err) {
+   printf("Allowed update sockmap '%i:%i' not in 
ESTABLISHED\n",
+  i, sfd[i]);
+   goto out_sockmap;
+   } else if (i >= 2 && err) {
printf("Failed noprog update sockmap '%i:%i'\n",
   i, sfd[i]);
goto out_sockmap;
@@ -741,7 +745,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Test map update elem afterwards fd lives in fd and map_fd */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_update_elem(map_fd_rx, , [i], BPF_ANY);
if (err) {
printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
@@ -845,7 +849,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Delete the elems without programs */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_delete_elem(fd, );
if (err) {
printf("Failed delete sockmap %i '%i:%i'\n",

[bpf PATCH v3 2/3] bpf: sockmap, fix transition through disconnect without close

2018-09-17 Thread John Fastabend

It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.

To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 1f97b55..5680d65 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -132,6 +132,7 @@ struct smap_psock {
struct work_struct gc_work;
 
struct proto *sk_proto;
+   void (*save_unhash)(struct sock *sk);
void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
@@ -143,6 +144,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+static void bpf_tcp_unhash(struct sock *sk);
 static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -184,6 +186,7 @@ static void build_protos(struct proto 
prot[SOCKMAP_NUM_CONFIGS],
 struct proto *base)
 {
prot[SOCKMAP_BASE]  = *base;
+   prot[SOCKMAP_BASE].unhash   = bpf_tcp_unhash;
prot[SOCKMAP_BASE].close= bpf_tcp_close;
prot[SOCKMAP_BASE].recvmsg  = bpf_tcp_recvmsg;
prot[SOCKMAP_BASE].stream_memory_read   = bpf_tcp_stream_read;
@@ -217,6 +220,7 @@ static int bpf_tcp_init(struct sock *sk)
return -EBUSY;
}
 
+   psock->save_unhash = sk->sk_prot->unhash;
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
 
@@ -305,30 +309,12 @@ static struct smap_psock_map_entry *psock_map_pop(struct 
sock *sk,
return e;
 }
 
-static void bpf_tcp_close(struct sock *sk, long timeout)
+static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
 {
-   void (*close_fun)(struct sock *sk, long timeout);
struct smap_psock_map_entry *e;
struct sk_msg_buff *md, *mtmp;
-   struct smap_psock *psock;
struct sock *osk;
 
-   lock_sock(sk);
-   rcu_read_lock();
-   psock = smap_psock_sk(sk);
-   if (unlikely(!psock)) {
-   rcu_read_unlock();
-   release_sock(sk);
-   return sk->sk_prot->close(sk, timeout);
-   }
-
-   /* The psock may be destroyed anytime after exiting the RCU critial
-* section so by the time we use close_fun the psock may no longer
-* be valid. However, bpf_tcp_close is called with the sock lock
-* held so the close hook and sk are still valid.
-*/
-   close_fun = psock->save_close;
-
if (psock->cork) {
free_start_sg(psock->sock, psock->cork, true);
kfree(psock->cork);
@@ -379,6 +365,42 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
kfree(e);
e = psock_map_pop(sk, psock);
}
+}
+
+static void bpf_tcp_unhash(struct sock *sk)
+{
+   void (*unhash_fun)(struct sock *sk);
+   struct smap_psock *psock;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   if (sk->sk_prot->unhash)
+   return sk->sk_prot->unhash(sk);
+   return;
+   }
+   unhash_fun = psock->save_unhash;
+   bpf_tcp_remove(sk, psock);
+   rcu_read_unlock();
+   unhash_fun(sk);
+}
+
+static void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   void (*close_fun)(struct sock *sk, long timeout);
+   struct smap_psock *psock;
+
+   lock_sock(sk);
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   release_sock(sk);
+   return sk->sk_prot->close(sk, timeout);
+   }
+   close_fun = psock->save_close;
+   bpf_tcp_remove(sk, psock);
rcu_read_unlock();
release_sock(sk);
close_fun(sk, timeout);

[bpf PATCH v3 0/3] bpf, sockmap ESTABLISHED state only

2018-09-17 Thread John Fastabend

Eric noted that using the close callback is not sufficient
to catch all transitions from ESTABLISHED state to a LISTEN
state. So this series does two things. First, only allow
adding socks in ESTABLISH state and second use unhash callback
to catch tcp_disconnect() transitions.

v2: added check for ESTABLISH state in hash update sockmap as well
v3: Do not release lock from unhash in error path, no lock was
used in the first place. And drop not so useful code comments

Thanks for reviewing Yonghong I carried your ACK forward
on patch 1/3.

Thanks,
John

---

John Fastabend (3):
  bpf: sockmap only allow ESTABLISHED sock state
  bpf: sockmap, fix transition through disconnect without close
  bpf: test_maps, only support ESTABLISHED socks


 tools/testing/selftests/bpf/test_maps.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--
Signature

Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-17 Thread John W. Linville

On Wed, Sep 05, 2018 at 06:54:57PM +0100, Edward Cree wrote:
> Of the three drivers that currently support FEC configuration, two (sfc
>  and cxgb4[vf]) accept configurations with more than one bit set in the
>  feccmd.fec bitmask.  (The precise semantics of these combinations vary.)
> Thus, this patch adds the ability to specify such combinations through a
>  comma-separated list of FEC modes in the 'encoding' argument on the
>  command line.
> 
> Also adds --set-fec tests to test-cmdline.c, and corrects the man page
>  (the encoding argument is not optional) while updating it.
> 
> Signed-off-by: Edward Cree 
> ---
> I've CCed the maintainers of the other drivers (cxgb4, nfp) that support
>  --set-fec, in case they have opinions on this.
> I'm not totally happy with the man page changebar; it might be clearer
>  just to leave the comma-less version in the syntax synopsis and only
>  mention the commas in the running-text.

LGTM -- queued for next release...thanks!

John
 
>  ethtool.8.in   | 11 ---
>  ethtool.c  | 50 +++---
>  test-cmdline.c |  9 +
>  3 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/ethtool.8.in b/ethtool.8.in
> index c8a902a..414eaa1 100644
> --- a/ethtool.8.in
> +++ b/ethtool.8.in
> @@ -389,7 +389,8 @@ ethtool \- query or control network driver and hardware 
> settings
>  .HP
>  .B ethtool \-\-set\-fec
>  .I devname
> -.B4 encoding auto off rs baser
> +.B encoding
> +.BR auto | off | rs | baser [ , ...]
>  .
>  .\" Adjust lines (i.e. full justification) and hyphenate.
>  .ad
> @@ -1119,8 +1120,12 @@ current FEC mode, the driver or firmware must take the 
> link down
>  administratively and report the problem in the system logs for users to 
> correct.
>  .RS 4
>  .TP
> -.A4 encoding auto off rs baser
> -Sets the FEC encoding for the device.
> +.BR encoding\ auto | off | rs | baser [ , ...]
> +
> +Sets the FEC encoding for the device.  Combinations of options are specified 
> as
> +e.g.
> +.B auto,rs
> +; the semantics of such combinations vary between drivers.
>  .TS
>  nokeep;
>  lB   l.
> diff --git a/ethtool.c b/ethtool.c
> index e8b7703..9997930 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -4967,20 +4967,48 @@ static int do_set_phy_tunable(struct cmd_context *ctx)
>  
>  static int fecmode_str_to_type(const char *str)
>  {
> + if (!strcasecmp(str, "auto"))
> + return ETHTOOL_FEC_AUTO;
> + if (!strcasecmp(str, "off"))
> + return ETHTOOL_FEC_OFF;
> + if (!strcasecmp(str, "rs"))
> + return ETHTOOL_FEC_RS;
> + if (!strcasecmp(str, "baser"))
> + return ETHTOOL_FEC_BASER;
> +
> + return 0;
> +}
> +
> +/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their
> + * corresponding ETHTOOL_FEC_* constants.
> + * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
> + */
> +static int parse_fecmode(const char *str)
> +{
>   int fecmode = 0;
> + char buf[6];
>  
>   if (!str)
> - return fecmode;
> -
> - if (!strcasecmp(str, "auto"))
> - fecmode |= ETHTOOL_FEC_AUTO;
> - else if (!strcasecmp(str, "off"))
> - fecmode |= ETHTOOL_FEC_OFF;
> - else if (!strcasecmp(str, "rs"))
> - fecmode |= ETHTOOL_FEC_RS;
> - else if (!strcasecmp(str, "baser"))
> - fecmode |= ETHTOOL_FEC_BASER;
> + return 0;
> + while (*str) {
> + size_t next;
> + int mode;
>  
> + next = strcspn(str, ",");
> + if (next >= 6) /* Bad mode, longest name is 5 chars */
> + return 0;
> + /* Copy into temp buffer and nul-terminate */
> + memcpy(buf, str, next);
> + buf[next] = 0;
> + mode = fecmode_str_to_type(buf);
> + if (!mode) /* Bad mode encountered */
> + return 0;
> + fecmode |= mode;
> + str += next;
> + /* Skip over ',' (but not nul) */
> + if (*str)
> + str++;
> + }
>   return fecmode;
>  }
>  
> @@ -5028,7 +5056,7 @@ static int do_sfec(struct cmd_context *ctx)
>   if (!fecmode_str)
>   exit_bad_args();
>  
> - fecmode = fecmode_str_to_type(fecmode_str);
> + fecmode = parse_fecmode(fecmode_str);
>   if (!fecmode)
>   exit_bad_args();
>  
> diff --git a/test-cmdline.c b/test-cmdline.c
> index

Re: [bpf PATCH v2 2/3] bpf: sockmap, fix transition through disconnect without close

2018-09-17 Thread John Fastabend

On 09/17/2018 10:59 AM, John Fastabend wrote:
> It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
> state via tcp_disconnect() without actually calling tcp_close which
> would then call our bpf_tcp_close() callback. Because of this a user
> could disconnect a socket then put it in a LISTEN state which would
> break our assumptions about sockets always being ESTABLISHED state.
> 
> To resolve this rely on the unhash hook, which is called in the
> disconnect case, to remove the sock from the sockmap.
> 

Sorry for the noise will need a v3 actually.

> Reported-by: Eric Dumazet 
> Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
> Signed-off-by: John Fastabend 
> ---
>  kernel/bpf/sockmap.c |   71 
> +-
>  1 file changed, 52 insertions(+), 19 deletions(-)

[...]


> +}
> +
> +static void bpf_tcp_unhash(struct sock *sk)
> +{
> + void (*unhash_fun)(struct sock *sk);
> + struct smap_psock *psock;
> +
> + rcu_read_lock();
> + psock = smap_psock_sk(sk);
> + if (unlikely(!psock)) {
> + rcu_read_unlock();
> + release_sock(sk);
 ^
> + return sk->sk_prot->unhash(sk);

if (sk->sk_prot->unhash) ...
else return;

Thanks,
John

[bpf PATCH v2 3/3] bpf: test_maps, only support ESTABLISHED socks

2018-09-17 Thread John Fastabend

Ensure that sockets added to a sock{map|hash} that is not in the
ESTABLISHED state is rejected.

Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_maps.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84..0f2090f 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -580,7 +580,11 @@ static void test_sockmap(int tasks, void *data)
/* Test update without programs */
for (i = 0; i < 6; i++) {
err = bpf_map_update_elem(fd, , [i], BPF_ANY);
-   if (err) {
+   if (i < 2 && !err) {
+   printf("Allowed update sockmap '%i:%i' not in 
ESTABLISHED\n",
+  i, sfd[i]);
+   goto out_sockmap;
+   } else if (i > 1 && err) {
printf("Failed noprog update sockmap '%i:%i'\n",
   i, sfd[i]);
goto out_sockmap;
@@ -741,7 +745,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Test map update elem afterwards fd lives in fd and map_fd */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_update_elem(map_fd_rx, , [i], BPF_ANY);
if (err) {
printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
@@ -845,7 +849,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Delete the elems without programs */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_delete_elem(fd, );
if (err) {
printf("Failed delete sockmap %i '%i:%i'\n",

[bpf PATCH v2 1/3] bpf: sockmap only allow ESTABLISHED sock state

2018-09-17 Thread John Fastabend

After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

Similar to TLS ULP this ensures sk_user_data is correct.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 488ef96..1f97b55 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2097,8 +2097,12 @@ static int sock_map_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
if (skops.sk->sk_type != SOCK_STREAM ||
-   skops.sk->sk_protocol != IPPROTO_TCP) {
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
fput(socket->file);
return -EOPNOTSUPP;
}
@@ -2453,6 +2457,16 @@ static int sock_hash_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
+   if (skops.sk->sk_type != SOCK_STREAM ||
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
+   fput(socket->file);
+   return -EOPNOTSUPP;
+   }
+
lock_sock(skops.sk);
preempt_disable();
rcu_read_lock();
@@ -2543,10 +2557,22 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
.map_check_btf = map_check_no_btf,
 };
 
+static bool bpf_is_valid_sock_op(struct bpf_sock_ops_kern *ops)
+{
+   return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
+  ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+}
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state. This checks that the sock ops triggering the update is
+* one indicating we are (or will be soon) in an ESTABLISHED state.
+*/
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
 }
 
@@ -2565,6 +2591,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
 }

[bpf PATCH v2 2/3] bpf: sockmap, fix transition through disconnect without close

2018-09-17 Thread John Fastabend

It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.

To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   71 +-
 1 file changed, 52 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 1f97b55..7deb362 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -132,6 +132,7 @@ struct smap_psock {
struct work_struct gc_work;
 
struct proto *sk_proto;
+   void (*save_unhash)(struct sock *sk);
void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
@@ -143,6 +144,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+static void bpf_tcp_unhash(struct sock *sk);
 static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -184,6 +186,7 @@ static void build_protos(struct proto 
prot[SOCKMAP_NUM_CONFIGS],
 struct proto *base)
 {
prot[SOCKMAP_BASE]  = *base;
+   prot[SOCKMAP_BASE].unhash   = bpf_tcp_unhash;
prot[SOCKMAP_BASE].close= bpf_tcp_close;
prot[SOCKMAP_BASE].recvmsg  = bpf_tcp_recvmsg;
prot[SOCKMAP_BASE].stream_memory_read   = bpf_tcp_stream_read;
@@ -217,6 +220,7 @@ static int bpf_tcp_init(struct sock *sk)
return -EBUSY;
}
 
+   psock->save_unhash = sk->sk_prot->unhash;
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
 
@@ -305,30 +309,12 @@ static struct smap_psock_map_entry *psock_map_pop(struct 
sock *sk,
return e;
 }
 
-static void bpf_tcp_close(struct sock *sk, long timeout)
+static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
 {
-   void (*close_fun)(struct sock *sk, long timeout);
struct smap_psock_map_entry *e;
struct sk_msg_buff *md, *mtmp;
-   struct smap_psock *psock;
struct sock *osk;
 
-   lock_sock(sk);
-   rcu_read_lock();
-   psock = smap_psock_sk(sk);
-   if (unlikely(!psock)) {
-   rcu_read_unlock();
-   release_sock(sk);
-   return sk->sk_prot->close(sk, timeout);
-   }
-
-   /* The psock may be destroyed anytime after exiting the RCU critial
-* section so by the time we use close_fun the psock may no longer
-* be valid. However, bpf_tcp_close is called with the sock lock
-* held so the close hook and sk are still valid.
-*/
-   close_fun = psock->save_close;
-
if (psock->cork) {
free_start_sg(psock->sock, psock->cork, true);
kfree(psock->cork);
@@ -379,6 +365,53 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
kfree(e);
e = psock_map_pop(sk, psock);
}
+}
+
+static void bpf_tcp_unhash(struct sock *sk)
+{
+   void (*unhash_fun)(struct sock *sk);
+   struct smap_psock *psock;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   release_sock(sk);
+   return sk->sk_prot->unhash(sk);
+   }
+
+   /* The psock may be destroyed anytime after exiting the RCU critial
+* section so by the time we use close_fun the psock may no longer
+* be valid. However, bpf_tcp_close is called with the sock lock
+* held so the close hook and sk are still valid.
+*/
+   unhash_fun = psock->save_unhash;
+   bpf_tcp_remove(sk, psock);
+   rcu_read_unlock();
+   unhash_fun(sk);
+}
+
+static void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   void (*close_fun)(struct sock *sk, long timeout);
+   struct smap_psock *psock;
+
+   lock_sock(sk);
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   release_sock(sk);
+   return sk->sk_prot->close(sk, timeout);
+   }
+
+   /* The psock may be destroyed anytime

[bpf PATCH v2 0/3] bpf, sockmap ESTABLISHED state only

2018-09-17 Thread John Fastabend

Eric noted that using the close callback is not sufficient
to catch all transitions from ESTABLISHED state to a LISTEN
state. So this series does two things. First, only allow
adding socks in ESTABLISH state and second use unhash callback
to catch tcp_disconnect() transitions.

v2: Added check for ESTABLISH state in hash update sockmap as well.

Thanks,
John

---

John Fastabend (3):
  bpf: sockmap only allow ESTABLISHED sock state
  bpf: sockmap, fix transition through disconnect without close
  bpf: test_maps, only support ESTABLISHED socks


 kernel/bpf/sockmap.c|  102 +--
 tools/testing/selftests/bpf/test_maps.c |   10 ++-
 2 files changed, 89 insertions(+), 23 deletions(-)

--
Signature

[bpf PATCH 3/3] bpf: test_maps, only support ESTABLISHED socks

2018-09-17 Thread John Fastabend

Ensure that sockets added to a sock{map|hash} that is not in the
ESTABLISHED state is rejected.

Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_maps.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84..0f2090f 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -580,7 +580,11 @@ static void test_sockmap(int tasks, void *data)
/* Test update without programs */
for (i = 0; i < 6; i++) {
err = bpf_map_update_elem(fd, , [i], BPF_ANY);
-   if (err) {
+   if (i < 2 && !err) {
+   printf("Allowed update sockmap '%i:%i' not in 
ESTABLISHED\n",
+  i, sfd[i]);
+   goto out_sockmap;
+   } else if (i > 1 && err) {
printf("Failed noprog update sockmap '%i:%i'\n",
   i, sfd[i]);
goto out_sockmap;
@@ -741,7 +745,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Test map update elem afterwards fd lives in fd and map_fd */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_update_elem(map_fd_rx, , [i], BPF_ANY);
if (err) {
printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
@@ -845,7 +849,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Delete the elems without programs */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_delete_elem(fd, );
if (err) {
printf("Failed delete sockmap %i '%i:%i'\n",

[bpf PATCH 2/3] bpf: sockmap, fix transition through disconnect without close

2018-09-17 Thread John Fastabend

It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.

To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   71 +-
 1 file changed, 52 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 998b7bd..f6ab7f3 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -132,6 +132,7 @@ struct smap_psock {
struct work_struct gc_work;
 
struct proto *sk_proto;
+   void (*save_unhash)(struct sock *sk);
void (*save_close)(struct sock *sk, long timeout);
void (*save_data_ready)(struct sock *sk);
void (*save_write_space)(struct sock *sk);
@@ -143,6 +144,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+static void bpf_tcp_unhash(struct sock *sk);
 static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
@@ -184,6 +186,7 @@ static void build_protos(struct proto 
prot[SOCKMAP_NUM_CONFIGS],
 struct proto *base)
 {
prot[SOCKMAP_BASE]  = *base;
+   prot[SOCKMAP_BASE].unhash   = bpf_tcp_unhash;
prot[SOCKMAP_BASE].close= bpf_tcp_close;
prot[SOCKMAP_BASE].recvmsg  = bpf_tcp_recvmsg;
prot[SOCKMAP_BASE].stream_memory_read   = bpf_tcp_stream_read;
@@ -217,6 +220,7 @@ static int bpf_tcp_init(struct sock *sk)
return -EBUSY;
}
 
+   psock->save_unhash = sk->sk_prot->unhash;
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
 
@@ -305,30 +309,12 @@ static struct smap_psock_map_entry *psock_map_pop(struct 
sock *sk,
return e;
 }
 
-static void bpf_tcp_close(struct sock *sk, long timeout)
+static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
 {
-   void (*close_fun)(struct sock *sk, long timeout);
struct smap_psock_map_entry *e;
struct sk_msg_buff *md, *mtmp;
-   struct smap_psock *psock;
struct sock *osk;
 
-   lock_sock(sk);
-   rcu_read_lock();
-   psock = smap_psock_sk(sk);
-   if (unlikely(!psock)) {
-   rcu_read_unlock();
-   release_sock(sk);
-   return sk->sk_prot->close(sk, timeout);
-   }
-
-   /* The psock may be destroyed anytime after exiting the RCU critial
-* section so by the time we use close_fun the psock may no longer
-* be valid. However, bpf_tcp_close is called with the sock lock
-* held so the close hook and sk are still valid.
-*/
-   close_fun = psock->save_close;
-
if (psock->cork) {
free_start_sg(psock->sock, psock->cork, true);
kfree(psock->cork);
@@ -379,6 +365,53 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
kfree(e);
e = psock_map_pop(sk, psock);
}
+}
+
+static void bpf_tcp_unhash(struct sock *sk)
+{
+   void (*unhash_fun)(struct sock *sk);
+   struct smap_psock *psock;
+
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   release_sock(sk);
+   return sk->sk_prot->unhash(sk);
+   }
+
+   /* The psock may be destroyed anytime after exiting the RCU critial
+* section so by the time we use close_fun the psock may no longer
+* be valid. However, bpf_tcp_close is called with the sock lock
+* held so the close hook and sk are still valid.
+*/
+   unhash_fun = psock->save_unhash;
+   bpf_tcp_remove(sk, psock);
+   rcu_read_unlock();
+   unhash_fun(sk);
+}
+
+static void bpf_tcp_close(struct sock *sk, long timeout)
+{
+   void (*close_fun)(struct sock *sk, long timeout);
+   struct smap_psock *psock;
+
+   lock_sock(sk);
+   rcu_read_lock();
+   psock = smap_psock_sk(sk);
+   if (unlikely(!psock)) {
+   rcu_read_unlock();
+   release_sock(sk);
+   return sk->sk_prot->close(sk, timeout);
+   }
+
+   /* The psock may be destroyed anytime

[bpf PATCH 0/3] bpf, sockmap ESTABLISHED state only

2018-09-17 Thread John Fastabend

Eric noted that using the close callback is not sufficient
to catch all transitions from ESTABLISHED state to a LISTEN
state. So this series does two things. First, only allow
adding socks in ESTABLISH state and second use unhash callback
to catch tcp_disconnect() transitions.

Thanks,
John


---

John Fastabend (3):
  bpf: sockmap only allow ESTABLISHED sock state
  bpf: sockmap, fix transition through disconnect without close
  bpf: test_maps, only support ESTABLISHED socks


 kernel/bpf/sockmap.c|   92 ---
 tools/testing/selftests/bpf/test_maps.c |   10 ++-
 2 files changed, 79 insertions(+), 23 deletions(-)

--
Signature

[bpf PATCH 1/3] bpf: sockmap only allow ESTABLISHED sock state

2018-09-17 Thread John Fastabend

After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

Similar to TLS ULP this ensures sk_user_data is correct.

Reported-by: Eric Dumazet 
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 488ef96..998b7bd 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2097,8 +2097,12 @@ static int sock_map_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state.
+*/
if (skops.sk->sk_type != SOCK_STREAM ||
-   skops.sk->sk_protocol != IPPROTO_TCP) {
+   skops.sk->sk_protocol != IPPROTO_TCP ||
+   skops.sk->sk_state != TCP_ESTABLISHED) {
fput(socket->file);
return -EOPNOTSUPP;
}
@@ -2543,10 +2547,22 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
.map_check_btf = map_check_no_btf,
 };
 
+static bool bpf_is_valid_sock_op(struct bpf_sock_ops_kern *ops)
+{
+   return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
+  ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+}
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   /* ULPs are currently supported only for TCP sockets in ESTABLISHED
+* state. This checks that the sock ops triggering the update is
+* one indicating we are (or will be soon) in an ESTABLISHED state.
+*/
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
 }
 
@@ -2565,6 +2581,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
   struct bpf_map *, map, void *, key, u64, flags)
 {
WARN_ON_ONCE(!rcu_read_lock_held());
+
+   if (!bpf_is_valid_sock_op(bpf_sock))
+   return -EOPNOTSUPP;
return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
 }

[net-next PATCH] tls: async support causes out-of-bounds access in crypto APIs

2018-09-14 Thread John Fastabend

When async support was added it needed to access the sk from the async
callback to report errors up the stack. The patch tried to use space
after the aead request struct by directly setting the reqsize field in
aead_request. This is an internal field that should not be used
outside the crypto APIs. It is used by the crypto code to define extra
space for private structures used in the crypto context. Users of the
API then use crypto_aead_reqsize() and add the returned amount of
bytes to the end of the request memory allocation before posting the
request to encrypt/decrypt APIs.

So this breaks (with general protection fault and KASAN error, if
enabled) because the request sent to decrypt is shorter than required
causing the crypto API out-of-bounds errors. Also it seems unlikely the
sk is even valid by the time it gets to the callback because of memset
in crypto layer.

Anyways, fix this by holding the sk in the skb->sk field when the
callback is set up and because the skb is already passed through to
the callback handler via void* we can access it in the handler. Then
in the handler we need to be careful to NULL the pointer again before
kfree_skb. I added comments on both the setup (in tls_do_decryption)
and when we clear it from the crypto callback handler
tls_decrypt_done(). After this selftests pass again and fixes KASAN
errors/warnings.

Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records")
Signed-off-by: John Fastabend 
---
 include/net/tls.h |4 
 net/tls/tls_sw.c  |   39 +++
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index cd0a65b..8630d28 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -128,10 +128,6 @@ struct tls_sw_context_rx {
bool async_notify;
 };
 
-struct decrypt_req_ctx {
-   struct sock *sk;
-};
-
 struct tls_record_info {
struct list_head list;
u32 end_seq;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index be4f2e9..cef69b6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -122,25 +122,32 @@ static int skb_nsg(struct sk_buff *skb, int offset, int 
len)
 static void tls_decrypt_done(struct crypto_async_request *req, int err)
 {
struct aead_request *aead_req = (struct aead_request *)req;
-   struct decrypt_req_ctx *req_ctx =
-   (struct decrypt_req_ctx *)(aead_req + 1);
-
struct scatterlist *sgout = aead_req->dst;
-
-   struct tls_context *tls_ctx = tls_get_ctx(req_ctx->sk);
-   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-   int pending = atomic_dec_return(>decrypt_pending);
+   struct tls_sw_context_rx *ctx;
+   struct tls_context *tls_ctx;
struct scatterlist *sg;
+   struct sk_buff *skb;
unsigned int pages;
+   int pending;
+
+   skb = (struct sk_buff *)req->data;
+   tls_ctx = tls_get_ctx(skb->sk);
+   ctx = tls_sw_ctx_rx(tls_ctx);
+   pending = atomic_dec_return(>decrypt_pending);
 
/* Propagate if there was an err */
if (err) {
ctx->async_wait.err = err;
-   tls_err_abort(req_ctx->sk, err);
+   tls_err_abort(skb->sk, err);
}
 
+   /* After using skb->sk to propagate sk through crypto async callback
+* we need to NULL it again.
+*/
+   skb->sk = NULL;
+
/* Release the skb, pages and memory allocated for crypto req */
-   kfree_skb(req->data);
+   kfree_skb(skb);
 
/* Skip the first S/G entry as it points to AAD */
for_each_sg(sg_next(sgout), sg, UINT_MAX, pages) {
@@ -175,11 +182,13 @@ static int tls_do_decryption(struct sock *sk,
   (u8 *)iv_recv);
 
if (async) {
-   struct decrypt_req_ctx *req_ctx;
-
-   req_ctx = (struct decrypt_req_ctx *)(aead_req + 1);
-   req_ctx->sk = sk;
-
+   /* Using skb->sk to push sk through to crypto async callback
+* handler. This allows propagating errors up to the socket
+* if needed. It _must_ be cleared in the async handler
+* before kfree_skb is called. We _know_ skb->sk is NULL
+* because it is a clone from strparser.
+*/
+   skb->sk = sk;
aead_request_set_callback(aead_req,
  CRYPTO_TFM_REQ_MAY_BACKLOG,
  tls_decrypt_done, skb);
@@ -1455,8 +1464,6 @@ int tls_set_sw_offload(struct sock *sk, struct 
tls_context *ctx, int tx)
goto free_aead;
 
if (sw_ctx_rx) {
-   (*aead)->reqsize = sizeof(struct decrypt_req_ctx);
-
/* Set up strparser */
memset(, 0, sizeof(cb));
cb.rcv_msg = tls_queue;

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2993 matches

Mail list logo