from:"Jesper Dangaard Brouer"

Re: [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX

2019-05-23 Thread Jesper Dangaard Brouer

On Thu, 23 May 2019 19:56:47 +0900
Toshiaki Makita  wrote:

> This is introduced for admins to check what is happening on XDP_TX when
> bulk XDP_TX is in use.
> 
> Signed-off-by: Toshiaki Makita 
> ---
>  include/trace/events/xdp.h | 25 +
>  kernel/bpf/core.c  |  1 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index e95cb86..e06ea65 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -50,6 +50,31 @@
> __entry->ifindex)
>  );
>  
> +TRACE_EVENT(xdp_bulk_tx,
> +

You are using this tracepoint like/instead of trace_xdp_devmap_xmit if
I understand correctly?  Or maybe the trace_xdp_redirect tracepoint.

The point is that is will be good if the tracepoints can share the
TP_STRUCT layout beginning, as it allows for attaching and reusing eBPF
code that is only interested in the top part of the struct.

I would also want to see some identifier, that trace programs can use
to group and corrolate these events, you do have ifindex, but most
other XDP tracepoints also have "prog_id".

> + TP_PROTO(const struct net_device *dev,
> +  int sent, int drops, int err),
> +
> + TP_ARGS(dev, sent, drops, err),
> +
> + TP_STRUCT__entry(
> + __field(int, ifindex)
> + __field(int, drops)
> + __field(int, sent)
> + __field(int, err)
> + ),

The xdp_redirect_template have:

TP_STRUCT__entry(
__field(int, prog_id)
__field(u32, act)
__field(int, ifindex)
__field(int, err)
__field(int, to_ifindex)
__field(u32, map_id)
__field(int, map_index)
),

And e.g. TRACE_EVENT xdp_exception have:

TP_STRUCT__entry(
__field(int, prog_id)
__field(u32, act)
__field(int, ifindex)
),


> +
> + TP_fast_assign(
> + __entry->ifindex= dev->ifindex;
> + __entry->drops  = drops;
> + __entry->sent   = sent;
> + __entry->err= err;
> + ),
> +
> + TP_printk("ifindex=%d sent=%d drops=%d err=%d",
> +   __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
> +);
> +
>  DECLARE_EVENT_CLASS(xdp_redirect_template,
>  
>   TP_PROTO(const struct net_device *dev,
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 242a643..7687488 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2108,3 +2108,4 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int 
> offset, void *to,
>  #include 
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX

2019-05-23 Thread Jesper Dangaard Brouer

On Thu, 23 May 2019 20:35:50 +0900
Toshiaki Makita  wrote:

> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
> > Toshiaki Makita  writes:
> >   
> >> This improves XDP_TX performance by about 8%.
> >>
> >> Here are single core XDP_TX test results. CPU consumptions are taken
> >> from "perf report --no-child".
> >>
> >> - Before:
> >>
> >>   7.26 Mpps
> >>
> >>   _raw_spin_lock  7.83%
> >>   veth_xdp_xmit  12.23%
> >>
> >> - After:
> >>
> >>   7.84 Mpps
> >>
> >>   _raw_spin_lock  1.17%
> >>   veth_xdp_xmit   6.45%
> >>
> >> Signed-off-by: Toshiaki Makita 
> >> ---
> >>  drivers/net/veth.c | 26 +-
> >>  1 file changed, 25 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >> index 52110e5..4edc75f 100644
> >> --- a/drivers/net/veth.c
> >> +++ b/drivers/net/veth.c
> >> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int 
> >> n,
> >>return ret;
> >>  }
> >>  
> >> +static void veth_xdp_flush_bq(struct net_device *dev)
> >> +{
> >> +  struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> >> +  int sent, i, err = 0;
> >> +
> >> +  sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
> > 
> > Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> > you're introducing an additional per-cpu bulk queue, only to avoid lock
> > contention around the existing pointer ring. But the pointer ring is
> > per-rq, so if you have lock contention, this means you must have
> > multiple CPUs servicing the same rq, no?  
> 
> Yes, it's possible. Not recommended though.
> 

I think the general per-cpu TX bulk queue is overkill.  There is a loop
over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
the caller veth_poll() will call veth_xdp_flush(rq->dev).

Why can't you store this "temp" bulk array in struct veth_rq ?

You could even alloc/create it on the stack of veth_poll() and send it
along via a pointer to veth_xdp_rcv).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.

2019-05-24 Thread Jesper Dangaard Brouer

On Thu, 23 May 2019 13:15:44 -0700
Stephen Hemminger  wrote:

> On Thu, 23 May 2019 19:19:40 +
> Saeed Mahameed  wrote:
> 
> > On Thu, 2019-05-23 at 10:54 -0700, Stephen Hemminger wrote:  
> > > When a device is stacked like (team, bonding, failsafe or netvsc) the
> > > XDP generic program for the parent device was not called.
> > > 
> > > Move the call to XDP generic inside __netif_receive_skb_core where
> > > it can be done multiple times for stacked case.
> > > 
> > > Suggested-by: Jiri Pirko 
> > > Fixes: d445516966dc ("net: xdp: support xdp generic on virtual
> > > devices")
> > > Signed-off-by: Stephen Hemminger 
> > > ---
> > > v1 - call xdp_generic in netvsc handler
> > > v2 - do xdp_generic in generic rx handler processing
> > > v3 - move xdp_generic call inside the another pass loop
> > > 
> > >  net/core/dev.c | 56 ++
> > > 
> > >  1 file changed, 11 insertions(+), 45 deletions(-)
> > > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index b6b8505cfb3e..696776e14d00 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff
> > > *skb)
> > >  
> > >   trace_netif_rx(skb);
> > >  
> > > - if (static_branch_unlikely(&generic_xdp_needed_key)) {
> > > - int ret;
> > > -
> > > - preempt_disable();
> > > - rcu_read_lock();
> > > - ret = do_xdp_generic(rcu_dereference(skb->dev-
> > > >xdp_prog), skb);
> > > - rcu_read_unlock();
> > > - preempt_enable();
> > > -
> > > - /* Consider XDP consuming the packet a success from
> > > -  * the netdev point of view we do not want to count
> > > -  * this as an error.
> > > -  */
> > > - if (ret != XDP_PASS)
> > > - return NET_RX_SUCCESS;
> > > - }
> > > -
> > 
> > Adding Jesper, 
> > 
> > There is a small behavioral change due to this patch, 
> > the XDP program after this patch will run on the RPS CPU, if
> > configured, which could cause some behavioral changes in
> > xdp_redirect_cpu: bpf_redirect_map(cpu_map).
> > 
> > Maybe this is acceptable, but it should be documented, as the current
> > assumption dictates: XDP program runs on the core where the XDP
> > frame/SKB was first seen.  

This does break some assumptions, that I worry about.  I've not
optimized generic XDP much, as this is suppose to be slow-path, but as
you can see in my evaluation[1] generic-XDP do have a performance
potential (XDP drop: native=12Mpps and generic=8.4Mpps), but the
XDP-generic performance dies as soon as we e.g. do XDP_TX
(native=10Mpps and generic=4Mpps).  The reason is lack of bulking.

We could implement the same kind of RX-bulking tricks as we do for
XDP_REDIRECT, where bpf_redirect_map store frames in the map and send
them once NAPI-poll exit via a xdp_do_flush_map().  These tricks
depend on per-CPU data (bpf_redirect_info), thus I cannot see how this
could work if XDP-generic now happens after RPS on a remote CPU.

Notice, that TX bulking at XDP-generic level, is actually rather
simple, as netstack TX path-code support xmit_more via creating a list
of SKBs... Last time I hacked it up, I saw 20%-30% speedup... anyone
motivated to do this?

> 
> Or maybe XDP should just force off RPS (like it does gro)

I guess, we could force off RPS.  But I do see one valid use-case of
combining CPUMAP redirect with RFS (Receive Flow Steering) which is part
of RPS.  Yes, I know we/I *also* have to implement generic-XDP-cpumap
support. But for native-XDP CPUMAP redirect, from 1-2 RX-CPUs into
N-remote CPUs via CPUMAP, and then lets RFS send SKBs to where the
application runs, does make sense to me. (I do have an "assignment" to
implement this in eBPF here[2]).

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/Documentation/blogposts/xdp25_eval_generic_xdp_tx.rst

[2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap.org#cpumap-implement-dynamic-load-balancer-that-is-ooo-safe
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX

2019-05-24 Thread Jesper Dangaard Brouer

On Thu, 23 May 2019 22:51:34 +0900
Toshiaki Makita  wrote:

> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
> > On Thu, 23 May 2019 20:35:50 +0900
> > Toshiaki Makita  wrote:
> >   
> >> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:  
> >>> Toshiaki Makita  writes:
> >>>  
> >>>> This improves XDP_TX performance by about 8%.
> >>>>
> >>>> Here are single core XDP_TX test results. CPU consumptions are taken
> >>>> from "perf report --no-child".
> >>>>
> >>>> - Before:
> >>>>
> >>>>7.26 Mpps
> >>>>
> >>>>_raw_spin_lock  7.83%
> >>>>veth_xdp_xmit  12.23%
> >>>>
> >>>> - After:
> >>>>
> >>>>7.84 Mpps
> >>>>
> >>>>_raw_spin_lock  1.17%
> >>>>veth_xdp_xmit   6.45%
> >>>>
> >>>> Signed-off-by: Toshiaki Makita 
> >>>> ---
> >>>>   drivers/net/veth.c | 26 +-
> >>>>   1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >>>> index 52110e5..4edc75f 100644
> >>>> --- a/drivers/net/veth.c
> >>>> +++ b/drivers/net/veth.c
> >>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, 
> >>>> int n,
> >>>>  return ret;
> >>>>   }
> >>>>   
> >>>> +static void veth_xdp_flush_bq(struct net_device *dev)
> >>>> +{
> >>>> +struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> >>>> +int sent, i, err = 0;
> >>>> +
> >>>> +sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
> >>>
> >>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> >>> you're introducing an additional per-cpu bulk queue, only to avoid lock
> >>> contention around the existing pointer ring. But the pointer ring is
> >>> per-rq, so if you have lock contention, this means you must have
> >>> multiple CPUs servicing the same rq, no?  
> >>
> >> Yes, it's possible. Not recommended though.
> >>  
> > 
> > I think the general per-cpu TX bulk queue is overkill.  There is a loop
> > over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
> > the caller veth_poll() will call veth_xdp_flush(rq->dev).
> > 
> > Why can't you store this "temp" bulk array in struct veth_rq ?  
> 
> Of course I can. But I thought tun has the same problem and we can 
> decrease memory footprint by sharing the same storage between devices.
> Or if other devices want to reduce queues so that we can use XDP on 
> many-cpu servers and introduce locks, we can use this storage for
> that case as well.
> 
> Still do you prefer veth-specific solution?

Yes.  Another reason is that with this shared/general per-cpu TX bulk
queue, I can easily see bugs resulting in xdp_frames getting
transmitted on a completely other NIC, which will be hard to debug for
people.

> > 
> > You could even alloc/create it on the stack of veth_poll() and send
> > it along via a pointer to veth_xdp_rcv).

IHMO it would be cleaner code wise to place the "temp" bulk array in
struct veth_rq.  But if you worry about performance and want a hot
cacheline for this, then you could just use the call-stack for
veth_poll(), as I described.  It should not be too ugly code wise to do
this I think.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.

2019-05-24 Thread Jesper Dangaard Brouer

On Fri, 24 May 2019 12:17:27 +0800
Jason Wang  wrote:

> > Maybe this is acceptable, but it should be documented, as the current
> > assumption dictates: XDP program runs on the core where the XDP
> > frame/SKB was first seen.  
> 
> 
> At lest for TUN, this is not true. XDP frames were built by vhost_net 
> and passed to TUN. There's no guarantee that vhost_net kthread won't 
> move to another core.

This sound a little scary, as we depend on per-CPU variables (e.g.
bpf_redirect_info).  Can the vhost_net kthread move between CPUs
within/during the NAPI-poll?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v3 2/3] libbpf: add bpf_object__load_xattr() API function to pass log_level

2019-05-24 Thread Jesper Dangaard Brouer

On Fri, 24 May 2019 11:36:47 +0100
Quentin Monnet  wrote:

> libbpf was recently made aware of the log_level attribute for programs,
> used to specify the level of information expected to be dumped by the
> verifier. Function bpf_prog_load_xattr() got support for this log_level
> parameter.
> 
> But some applications using libbpf rely on another function to load
> programs, bpf_object__load(), which does accept any parameter for log
> level. Create an API function based on bpf_object__load(), but accepting
> an "attr" object as a parameter. Then add a log_level field to that
> object, so that applications calling the new bpf_object__load_xattr()
> can pick the desired log level.

Does this allow us to extend struct bpf_object_load_attr later?

> v3:
> - Rewrite commit log.
> 
> v2:
> - We are in a new cycle, bump libbpf extraversion number.
> 
> Signed-off-by: Quentin Monnet 
> Reviewed-by: Jakub Kicinski 
> ---
>  tools/lib/bpf/Makefile   |  2 +-
>  tools/lib/bpf/libbpf.c   | 20 +---
>  tools/lib/bpf/libbpf.h   |  6 ++
>  tools/lib/bpf/libbpf.map |  5 +
>  4 files changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> index a2aceadf68db..9312066a1ae3 100644
> --- a/tools/lib/bpf/Makefile
> +++ b/tools/lib/bpf/Makefile
> @@ -3,7 +3,7 @@
>  
>  BPF_VERSION = 0
>  BPF_PATCHLEVEL = 0
> -BPF_EXTRAVERSION = 3
> +BPF_EXTRAVERSION = 4
>  
>  MAKEFLAGS += --no-print-directory
>  
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 197b574406b3..1c6fb7a3201e 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -,7 +,7 @@ static bool bpf_program__is_function_storage(struct 
> bpf_program *prog,
>  }
>  
>  static int
> -bpf_object__load_progs(struct bpf_object *obj)
> +bpf_object__load_progs(struct bpf_object *obj, int log_level)
>  {
>   size_t i;
>   int err;
> @@ -2230,6 +2230,7 @@ bpf_object__load_progs(struct bpf_object *obj)
>   for (i = 0; i < obj->nr_programs; i++) {
>   if (bpf_program__is_function_storage(&obj->programs[i], obj))
>   continue;
> + obj->programs[i].log_level = log_level;
>   err = bpf_program__load(&obj->programs[i],
>   obj->license,
>   obj->kern_version);
> @@ -2381,10 +2382,14 @@ int bpf_object__unload(struct bpf_object *obj)
>   return 0;
>  }
>  
> -int bpf_object__load(struct bpf_object *obj)
> +int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
>  {
> + struct bpf_object *obj;
>   int err;
>  
> + if (!attr)
> + return -EINVAL;
> + obj = attr->obj;
>   if (!obj)
>   return -EINVAL;
>  
> @@ -2397,7 +2402,7 @@ int bpf_object__load(struct bpf_object *obj)
>  
>   CHECK_ERR(bpf_object__create_maps(obj), err, out);
>   CHECK_ERR(bpf_object__relocate(obj), err, out);
> - CHECK_ERR(bpf_object__load_progs(obj), err, out);
> + CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out);
>  
>   return 0;
>  out:
> @@ -2406,6 +2411,15 @@ int bpf_object__load(struct bpf_object *obj)
>   return err;
>  }
>  
> +int bpf_object__load(struct bpf_object *obj)
> +{
> + struct bpf_object_load_attr attr = {
> + .obj = obj,
> + };
> +
> + return bpf_object__load_xattr(&attr);
> +}
> +
>  static int check_path(const char *path)
>  {
>   char *cp, errmsg[STRERR_BUFSIZE];
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index c5ff00515ce7..e1c748db44f6 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -89,8 +89,14 @@ LIBBPF_API int bpf_object__unpin_programs(struct 
> bpf_object *obj,
>  LIBBPF_API int bpf_object__pin(struct bpf_object *object, const char *path);
>  LIBBPF_API void bpf_object__close(struct bpf_object *object);
>  
> +struct bpf_object_load_attr {
> + struct bpf_object *obj;
> + int log_level;
> +};

Can this be extended later?

>  /* Load/unload object into/from kernel */
>  LIBBPF_API int bpf_object__load(struct bpf_object *obj);
> +LIBBPF_API int bpf_object__load_xattr(struct bpf_object_load_attr *attr);
>  LIBBPF_API int bpf_object__unload(struct bpf_object *obj);
>  LIBBPF_API const char *bpf_object__name(struct bpf_object *obj);
>  LIBBPF_API unsigned int bpf_object__kversion(struct bpf_object *obj);
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 673001787cba..6ce61fa0baf3 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -164,3 +164,8 @@ LIBBPF_0.0.3 {
>   bpf_map_freeze;
>   btf__finalize_data;
>  } LIBBPF_0.0.2;
> +
> +LIBBPF_0.0.4 {
> + global:
> + bpf_object__load_xattr;
> +} LIBBPF_0.0.3;



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v3 2/3] libbpf: add bpf_object__load_xattr() API function to pass log_level

2019-05-24 Thread Jesper Dangaard Brouer

On Fri, 24 May 2019 12:51:14 +0100
Quentin Monnet  wrote:

> 2019-05-24 13:22 UTC+0200 ~ Jesper Dangaard Brouer 
> > On Fri, 24 May 2019 11:36:47 +0100
> > Quentin Monnet  wrote:
> >   
> >> libbpf was recently made aware of the log_level attribute for programs,
> >> used to specify the level of information expected to be dumped by the
> >> verifier. Function bpf_prog_load_xattr() got support for this log_level
> >> parameter.
> >>
> >> But some applications using libbpf rely on another function to load
> >> programs, bpf_object__load(), which does accept any parameter for log
> >> level. Create an API function based on bpf_object__load(), but accepting
> >> an "attr" object as a parameter. Then add a log_level field to that
> >> object, so that applications calling the new bpf_object__load_xattr()
> >> can pick the desired log level.  
> > 
> > Does this allow us to extend struct bpf_object_load_attr later?  
> 
> I see no reason why it could not. Having the _xattr() version of the
> function is precisely a way to have something extensible in the future,
> without having to create additional API functions each time we want to
> pass a new parameter. And e.g. struct bpf_prog_load_attr (used with
> bpf_prog_load_xattr()) has already been extended in the past. So, yeah,
> we can add to it in the future.

Great.  I just don't know/understand how user-space handle this. If a
binary is compiled with libbpf as dynamic loadable lib, then it e.g. saw
libbpf.so.2 when it was compiled, then can't it choose to use libbpf.so.3
then? (e.g. when libbpf.so.2 is not on the system). (I would actually
like to learn/understand this, so links are welcome).

> Do you have something in mind?

I was playing with extending bpf_prog_load_attr, but instead I created a
bpf_prog_load_attr_maps instead and a new function
bpf_prog_load_xattr_maps(), e.g. see:

https://github.com/xdp-project/xdp-tutorial/blob/master/common/common_libbpf.h
https://github.com/xdp-project/xdp-tutorial/blob/master/common/common_libbpf.c

I guess, I could just extend bpf_prog_load_attr instead, right?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: Bad XDP performance with mlx5

2019-05-31 Thread Jesper Dangaard Brouer



On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette  wrote:

> CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a huge 
> CPU increase on CX5 when introducing a XDP program.
>
> See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> around 0:40. We're talking something like 15% while it's near 0 for
> other drivers. The machine is a recent Skylake. For us it makes XDP
> unusable. Is that a known problem?

I have a similar test setup, and I can reproduce. I have found the
root-cause see below.  But on my system it was even worse, with an
XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
CPU (43% idle) and total 89.7 Gbits/sec.

 
> I wonder if it doesn't simply come from mlx5/en_main.c:
> rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> 

Nope, that is not the problem.

> Which would be inline from my observation that memory access seems 
> heavier. I guess this is for the XDP_TX case.
> 
> If this is indeed the problem. Any chance we can:
> a) detect automatically that a program will not return XDP_TX (I'm not 
> quite sure about what the BPF limitations allow to guess in advance) or
> b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
> performance when not needed?

This was kind of hard to root-cause, but I solved it by increasing the TCP
socket size used by the iperf tool, like this (please reproduce):

$ iperf -s --window 4M

Server listening on TCP port 5001
TCP window size:  416 KByte (WARNING: requested 4.00 MByte)


Given I could reproduce, I took at closer look at perf record/report stats,
and it was actually quite clear that this was related to stalling on getting
pages from the page allocator (function calls top#6 get_page_from_freelist
and free_pcppages_bulk).

Using my tool: ethtool_stats.pl
 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

It was clear that the mlx5 driver page-cache was not working:
 Ethtool(mlx5p1  ) stat: 6653761 (   6,653,761) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat: 6653732 (   6,653,732) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:  669481 ( 669,481) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:   1 (   1) <= rx_congst_umr /sec
 Ethtool(mlx5p1  ) stat: 7323230 (   7,323,230) <= rx_csum_unnecessary /sec
 Ethtool(mlx5p1  ) stat:1034 (   1,034) <= rx_discards_phy /sec
 Ethtool(mlx5p1  ) stat: 7323230 (   7,323,230) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat: 7324244 (   7,324,244) <= rx_packets_phy /sec

While the non-XDP case looked like this:
 Ethtool(mlx5p1  ) stat:  298929 ( 298,929) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:  298971 ( 298,971) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat: 3548789 (   3,548,789) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat: 7695476 (   7,695,476) <= rx_csum_complete /sec
 Ethtool(mlx5p1  ) stat: 7695476 (   7,695,476) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat: 7695169 (   7,695,169) <= rx_packets_phy /sec
Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44

With the increased TCP window size, the mlx5 driver cache is working better,
but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU usage.
 Ethtool(mlx5p1  ) stat:  894438 ( 894,438) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:  894453 ( 894,453) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat: 6638518 (   6,638,518) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:   6 (   6) <= rx_congst_umr /sec
 Ethtool(mlx5p1  ) stat: 7532983 (   7,532,983) <= rx_csum_unnecessary /sec
 Ethtool(mlx5p1  ) stat: 164 ( 164) <= rx_discards_phy /sec
 Ethtool(mlx5p1  ) stat: 7532983 (   7,532,983) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat: 7533193 (   7,533,193) <= rx_packets_phy /sec
Manual consistence calc: 7532983-(6638518+894453) = 12

To understand why this is happening, you first have to know that the
difference is between the two RX-memory modes used by mlx5 for non-XDP vs
XDP. With non-XDP two frames are stored per memory-page, while for XDP only
a single frame per page is used.  The packets available in the RX-rings are
actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.

I believe, the real issue is that TCP use the SKB->truesize (based on frame
size) for different memory pressure and window calculations, which is why it
solved the issue to increase the window size manually.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: Bad XDP performance with mlx5

2019-05-31 Thread Jesper Dangaard Brouer



On Fri, 31 May 2019 18:06:01 + Saeed Mahameed  wrote:

> On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote:
[...]
> > 
> > To understand why this is happening, you first have to know that the
> > difference is between the two RX-memory modes used by mlx5 for non-
> > XDP vs XDP. With non-XDP two frames are stored per memory-page,
> > while for XDP only a single frame per page is used.  The packets
> > available in the RX- rings are  actually the same, as the ring
> > sizes are non-XDP=512 vs. XDP=1024. 
> 
> Thanks Jesper ! this was a well put together explanation.
> I want to point out that some other drivers are using alloc_skb APIs
> which provide a good caching mechanism, which is even better than the
> mlx5 internal one (which uses the alloc_page APIs directly), this can
> explain the difference, and your explanation shows the root cause of
> the higher cpu util with XDP on mlx5, since the mlx5 page cache works
> with half of its capacity when enabling XDP.
> 
> Now do we really need to keep this page per packet in mlx5 when XDP is
> enabled ? i think it is time to drop that .. 

No, we need to keep the page per packet (at least until, I've solved
some corner-cases with page_pool, which could likely require getting a
page-flag).

> > I believe, the real issue is that TCP use the SKB->truesize (based
> > on frame size) for different memory pressure and window
> > calculations, which is why it solved the issue to increase the
> > window size manually. 

The TCP performance issue is not solely a SKB->truesize issue, but also
an issue with how the driver level page-cache works.  It is actually
very fragile, as single page with elevated refcnt can block the cache
(see mlx5e_rx_cache_get()).  Which easily happens with TCP packets
that is waiting to be re-transmitted in-case of loss.  This is
happening here, as indicated by the rx_cache_busy and rx_cache_full
being the same.

We (Ilias, Tariq and I) have been planning to remove this small driver
cache, and instead use the page_pool, and create a page-return path for
SKBs.  Which should make this problem go away.  I'm going to be working
on this the next couple of weeks (the tricky part is all the corner
cases).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Fri, 31 May 2019 18:18:17 +0200
Jesper Dangaard Brouer  wrote:

> It was clear that the mlx5 driver page-cache was not working:
>  Ethtool(mlx5p1  ) stat: 6653761 (   6,653,761) <= rx_cache_busy /sec
>  Ethtool(mlx5p1  ) stat: 6653732 (   6,653,732) <= rx_cache_full /sec
>  Ethtool(mlx5p1  ) stat:  669481 ( 669,481) <= rx_cache_reuse /sec
>  Ethtool(mlx5p1  ) stat:   1 (   1) <= rx_congst_umr /sec
>  Ethtool(mlx5p1  ) stat: 7323230 (   7,323,230) <= rx_csum_unnecessary 
> /sec
>  Ethtool(mlx5p1  ) stat:1034 (   1,034) <= rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat: 7323230 (   7,323,230) <= rx_packets /sec
>  Ethtool(mlx5p1  ) stat: 7324244 (   7,324,244) <= rx_packets_phy /sec

Re: Bad XDP performance with mlx5

2019-06-04 Thread Jesper Dangaard Brouer

On Tue, 4 Jun 2019 09:28:22 +0200
Tom Barbette  wrote:

> Thanks Jesper for looking into this!
> 
> I don't think I will be of much help further on this matter. My take
> out would be: as a first-time user looking into XDP after watching a
> dozen of XDP talks, I would have expected XDP default settings to be
> identical to SKB, so I don't have to watch out for a set of
> per-driver parameter checklist to avoid increasing my CPU consumption
> by 15% when inserting "a super efficient and light BPF program". But
> I understand it's not that easy...

The gap should not be this large, but as I demonstrated it was primarily
because you hit an unfortunate interaction with TCP and how the mlx5
driver does page-caching (p.s. we are working on removing this driver
local recycle-cache).
  When loading an XDP/eBPF-prog then the driver change the underlying RX
memory model, which waste memory to gain packets-per-sec speed, but TCP
sees this memory waste and gives us a penalty.

It is important to understand, that XDP is not optimized for TCP.  XDP
is designed and optimized for L2-L3 handling of packets (TCP is L4).
Before XDP these L2-L3 use-cases were "slow", because the kernel
netstack assumes a L4/socket use-case (full SKB), when less was really
needed.

This is actually another good example of why XDP programs per RX-queue,
will be useful (notice: which is not implemented upstream, yet...).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v2 1/2] net: xdp: refactor XDP_QUERY_PROG{,_HW} to netdev

2019-06-04 Thread Jesper Dangaard Brouer

On Mon, 3 Jun 2019 21:20:30 +
Saeed Mahameed  wrote:

> On Mon, 2019-06-03 at 11:04 +0200, Björn Töpel wrote:
> > On Sat, 1 Jun 2019 at 21:42, Jakub Kicinski
> >  wrote:  
> > > On Fri, 31 May 2019 19:18:17 +, Saeed Mahameed wrote:  
> > > > On Fri, 2019-05-31 at 11:42 +0200, Björn Töpel wrote:  
> > > > > From: Björn Töpel 
> > > > > 
> > > > > All XDP capable drivers need to implement the
> > > > > XDP_QUERY_PROG{,_HW} command of ndo_bpf. The query code is
> > > > > fairly generic. This commit refactors the query code up from
> > > > > the drivers to the netdev level.
> > > > > 
> > > > > The struct net_device has gained two new members: xdp_prog_hw
> > > > > and xdp_flags. The former is the offloaded XDP program, if
> > > > > any, and the latter tracks the flags that the supplied when
> > > > > attaching the XDP  program. The flags only apply to SKB_MODE
> > > > > or DRV_MODE, not HW_MODE.
> > > > > 
> > > > > The xdp_prog member, previously only used for SKB_MODE, is
> > > > > shared with DRV_MODE. This is OK, due to the fact that
> > > > > SKB_MODE and DRV_MODE are mutually exclusive. To
> > > > > differentiate between the two modes, a new internal flag is
> > > > > introduced as well.  
> > > > 
> > > > Just thinking out loud, why can't we allow any combination of
> > > > HW/DRV/SKB modes? they are totally different attach points in a
> > > > totally different checkpoints in a frame life cycle.  

The DRV_MODE and SKB_MODE is tricky to run concurrently. The XDP-redirect
scheme (designed by John) use a BPF helper (bpf_xdp_redirect_map) that
set global per-CPU state (this_cpu_ptr(&bpf_redirect_info)).

The tricky part (which I warned about, and we already have some fixes
for) is that the XDP/BPF-prog can call bpf_redirect_map, which update
the per-CPU state, but it can still choose not to return XDP_REDIRECT,
which then miss an invocation of xdp_do_redirect().  
 Later, your SKB_MODE XDP/BPF-prog can return XDP_REDIRECT without
calling the helper, and then use/reach to the per-CPU info set by the
DRV_MODE program, which is NOT something I want us to "support".


> > > FWIW see Message-ID: <20190201080236.446d8...@redhat.com>
> > 
> > I've always seen the SKB-mode as something that will eventually be
> > removed.
> 
> I don't think so, we are too deep into SKB-mode.

I wish we could remove it.  After we got native XDP support in veth,
then its original purpose is gone, which were making it easier for
developers to get something working on their laptop, without having to
deploy it to the server with physical hardware all the time.


> > Clickable link:
> >  https://lore.kernel.org/netdev/20190201080236.446d8...@redhat.com/
> > 
> 
> So we are all hanging on Jesper's refactoring ideas that are not
> getting any priority for now :).

Well, that is not true.  This patchset _is_ the refactor idea that
Bjørn is taking over and working on.  Specifically [2] from above link.

[2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#refactor-idea-xdpbpf_prog-into-netdev_rx_queuenet_device
 
> > > > Down the road i think we will utilize this fact and start
> > > > introducing SKB helpers for SKB mode and driver helpers for DRV
> > > > mode..  
> > > 
> > > Any reason why we would want the extra complexity?  There is
> > > cls_bpf if someone wants skb features after all..  
> 
> Donno, SKB mode is earlier in the stack maybe .. 

>From a BPF perspective you cannot introduce SKB helpers for SKB mode.
An XDP-prog have bpf prog type XDP (BPF_PROG_TYPE_XDP), and the program
itself is identical regardless of attaching for DRV_MODE or SKB_MODE.
You cannot detect this at attach time, due to tail-calls (which have
painted us into a corner).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next 2/2] devmap: Allow map lookups from eBPF

2019-06-04 Thread Jesper Dangaard Brouer

On Tue, 04 Jun 2019 17:24:10 +0200
Toke Høiland-Jørgensen  wrote:

> We don't currently allow lookups into a devmap from eBPF, because the map
> lookup returns a pointer directly to the dev->ifindex, which shouldn't be
> modifiable from eBPF.
> 
> However, being able to do lookups in devmaps is useful to know (e.g.)
> whether forwarding to a specific interface is enabled. Currently, programs
> work around this by keeping a shadow map of another type which indicates
> whether a map index is valid.
> 
> To allow lookups, simply copy the ifindex into a scratch variable and
> return a pointer to this. If an eBPF program does modify it, this doesn't
> matter since it will be overridden on the next lookup anyway. While this
> does add a write to every lookup, the overhead of this is negligible
> because the cache line is hot when both the write and the subsequent
> read happens.

When we choose the return value, here the ifindex, then this basically
becomes UABI, right?

Can we somehow use BTF to help us to make this extensible?

As Toke mention in the cover letter, we really want to know if the
chosen egress have actually enabled/allocated resources for XDP
transmitting, but as we currently don't have in-kernel way to query
thus (thus, we cannot expose such info).


> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  kernel/bpf/devmap.c   |8 +++-
>  kernel/bpf/verifier.c |7 ++-
>  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 5ae7cce5ef16..830650300ea4 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -65,6 +65,7 @@ struct xdp_bulk_queue {
>  struct bpf_dtab_netdev {
>   struct net_device *dev; /* must be first member, due to tracepoint */
>   struct bpf_dtab *dtab;
> + int ifindex_scratch;
>   unsigned int bit;
>   struct xdp_bulk_queue __percpu *bulkq;
>   struct rcu_head rcu;
> @@ -375,7 +376,12 @@ static void *dev_map_lookup_elem(struct bpf_map *map, 
> void *key)
>   struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
>   struct net_device *dev = obj ? obj->dev : NULL;
>  
> - return dev ? &dev->ifindex : NULL;
> + if (dev) {
> + obj->ifindex_scratch = dev->ifindex;
> + return &obj->ifindex_scratch;
> + }
> +
> + return NULL;
>  }
>  
>  static void dev_map_flush_old(struct bpf_dtab_netdev *dev)
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5c2cb5bd84ce..7128a9821481 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2893,12 +2893,9 @@ static int check_map_func_compatibility(struct 
> bpf_verifier_env *env,
>   if (func_id != BPF_FUNC_get_local_storage)
>   goto error;
>   break;
> - /* devmap returns a pointer to a live net_device ifindex that we cannot
> -  * allow to be modified from bpf side. So do not allow lookup elements
> -  * for now.
> -  */
>   case BPF_MAP_TYPE_DEVMAP:
> - if (func_id != BPF_FUNC_redirect_map)
> + if (func_id != BPF_FUNC_redirect_map &&
> + func_id != BPF_FUNC_map_lookup_elem)
>   goto error;
>   break;
>   /* Restrict bpf side of cpumap and xskmap, open when use-cases
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v4 bpf-next 1/2] bpf: Allow bpf_map_lookup_elem() on an xskmap

2019-06-04 Thread Jesper Dangaard Brouer

On Mon, 3 Jun 2019 09:38:51 -0700
Jonathan Lemon  wrote:

> Currently, the AF_XDP code uses a separate map in order to
> determine if an xsk is bound to a queue.  Instead of doing this,
> have bpf_map_lookup_elem() return the queue_id, as a way of
> indicating that there is a valid entry at the map index.

Just a reminder, that once we choose a return value, there the
queue_id, then it basically becomes UAPI, and we cannot change it.

Can we somehow use BTF to allow us to extend this later?



I was also going to point out that, you cannot return a direct pointer
to queue_id, as BPF-prog side can modify this... but Daniel already
pointed this out.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX

2019-06-05 Thread Jesper Dangaard Brouer

On Wed,  5 Jun 2019 14:36:12 +0900
Toshiaki Makita  wrote:

> This is introduced for admins to check what is happening on XDP_TX when
> bulk XDP_TX is in use, which will be first introduced in veth in next
> commit.

Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by
all drivers?

(more below)

> Signed-off-by: Toshiaki Makita 
> ---
>  include/trace/events/xdp.h | 25 +
>  kernel/bpf/core.c  |  1 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index e95cb86..e06ea65 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -50,6 +50,31 @@
> __entry->ifindex)
>  );
>  
> +TRACE_EVENT(xdp_bulk_tx,
> +
> + TP_PROTO(const struct net_device *dev,
> +  int sent, int drops, int err),
> +
> + TP_ARGS(dev, sent, drops, err),
> +
> + TP_STRUCT__entry(

All other tracepoints in this file starts with:

__field(int, prog_id)
__field(u32, act)
or
__field(int, map_id)
__field(u32, act)

Could you please add those?

> + __field(int, ifindex)
> + __field(int, drops)
> + __field(int, sent)
> + __field(int, err)
> + ),

The reason is that this make is easier to attach to multiple
tracepoints, and extract the same value.

Example with bpftrace oneliner:

$ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'
Attaching 8 probes...
^C

@action[4]: 30259246
@action[0]: 34489024

XDP_ABORTED = 0  
XDP_REDIRECT= 4


> +
> + TP_fast_assign(

__entry->act= XDP_TX;

> + __entry->ifindex= dev->ifindex;
> + __entry->drops  = drops;
> + __entry->sent   = sent;
> + __entry->err= err;
> + ),
> +
> + TP_printk("ifindex=%d sent=%d drops=%d err=%d",
> +   __entry->ifindex, __entry->sent, __entry->drops, __entry->err)
> +);
> +

Other fun bpftrace stuff:

sudo bpftrace -e 'tracepoint:xdp:xdp_*map* { @map_id[comm, args->map_id] = 
count(); }'
Attaching 5 probes...
^C

@map_id[swapper/2, 113]: 1428
@map_id[swapper/0, 113]: 2085
@map_id[ksoftirqd/4, 113]: 2253491
@map_id[ksoftirqd/2, 113]: 25677560
@map_id[ksoftirqd/0, 113]: 29004338
@map_id[ksoftirqd/3, 113]: 31034885


$ bpftool map list id 113
113: devmap  name tx_port  flags 0x0
    key 4B  value 4B  max_entries 100  memlock 4096B


p.s. People should look out for Brendan Gregg's upcoming book on BPF
performance tools, from which I learned to use bpftrace :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next 1/2] bpf_xdp_redirect_map: Add flag to return XDP_PASS on map lookup failure

2019-06-05 Thread Jesper Dangaard Brouer



On Tue, 04 Jun 2019 17:24:10 +0200
Toke Høiland-Jørgensen  wrote:

> The bpf_redirect_map() helper used by XDP programs doesn't return any
> indication of whether it can successfully redirect to the map index it was
> given. Instead, BPF programs have to track this themselves, leading to
> programs using duplicate maps to track which entries are populated in the
> devmap.
> 
> This adds a flag to the XDP version of the bpf_redirect_map() helper, which
> makes the helper do a lookup in the map when called, and return XDP_PASS if
> there is no value at the provided index. This enables two use cases:

To Jonathan Lemon, notice this approach of adding a flag to the helper
call, it actually also works for your use-case of XSK AF_XDP maps.

> - A BPF program can check the return code from the helper call and react if
>   it is XDP_PASS (by, for instance, redirecting out a different interface).
> 
> - Programs that just return the value of the bpf_redirect() call will
>   automatically fall back to the regular networking stack, simplifying
>   programs that (for instance) build a router with the fib_lookup() helper.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  include/uapi/linux/bpf.h |8 
>  net/core/filter.c|   10 +-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7c6aef253173..4c41482b7604 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3098,6 +3098,14 @@ enum xdp_action {
>   XDP_REDIRECT,
>  };
>  
> +/* Flags for bpf_xdp_redirect_map helper */
> +
> +/* If set, the help will check if the entry exists in the map and return
> + * XDP_PASS if it doesn't.
> + */
> +#define XDP_REDIRECT_PASS_ON_INVALID BIT(0)
> +#define XDP_REDIRECT_ALL_FLAGS XDP_REDIRECT_PASS_ON_INVALID
> +
>  /* user accessible metadata for XDP packet hook
>   * new fields must be added to the end of this structure
>   */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 55bfc941d17a..dfab8478f66c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3755,9 +3755,17 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
> map, u32, ifindex,
>  {
>   struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  
> - if (unlikely(flags))
> + if (unlikely(flags & ~XDP_REDIRECT_ALL_FLAGS))
>   return XDP_ABORTED;
>  
> + if (flags & XDP_REDIRECT_PASS_ON_INVALID) {
> + struct net_device *fwd;

It is slightly misguiding that '*fwd' is a 'struct net_device', as the
__xdp_map_lookup_elem() call works for all the supported redirect-map
types.

People should realize that this patch is a general approach for all the
redirect-map types.

> +
> + fwd = __xdp_map_lookup_elem(map, ifindex);
> + if (unlikely(!fwd))
> + return XDP_PASS;
> + }
> +
>   ri->ifindex = ifindex;
>   ri->flags = flags;
>   WRITE_ONCE(ri->map, map);


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next v2 2/2] devmap: Allow map lookups from eBPF

2019-06-06 Thread Jesper Dangaard Brouer

On Thu, 06 Jun 2019 15:24:14 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> We don't currently allow lookups into a devmap from eBPF, because the map
> lookup returns a pointer directly to the dev->ifindex, which shouldn't be
> modifiable from eBPF.
> 
> However, being able to do lookups in devmaps is useful to know (e.g.)
> whether forwarding to a specific interface is enabled. Currently, programs
> work around this by keeping a shadow map of another type which indicates
> whether a map index is valid.
> 
> Since we now have a flag to make maps read-only from the eBPF side, we can
> simply lift the lookup restriction if we make sure this flag is always set.

Nice, I didn't know this was possible.  I like it! :-)

> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  kernel/bpf/devmap.c   |5 +
>  kernel/bpf/verifier.c |7 ++-
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 5ae7cce5ef16..0e6875a462ef 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -99,6 +99,11 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
>   attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK)
>   return ERR_PTR(-EINVAL);
>  
> + /* Lookup returns a pointer straight to dev->ifindex, so make sure the
> +  * verifier prevents writes from the BPF side
> +  */
> + attr->map_flags |= BPF_F_RDONLY_PROG;
> +
>   dtab = kzalloc(sizeof(*dtab), GFP_USER);
>   if (!dtab)
>   return ERR_PTR(-ENOMEM);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5c2cb5bd84ce..7128a9821481 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2893,12 +2893,9 @@ static int check_map_func_compatibility(struct 
> bpf_verifier_env *env,
>   if (func_id != BPF_FUNC_get_local_storage)
>   goto error;
>   break;
> - /* devmap returns a pointer to a live net_device ifindex that we cannot
> -  * allow to be modified from bpf side. So do not allow lookup elements
> -  * for now.
> -  */
>   case BPF_MAP_TYPE_DEVMAP:
> - if (func_id != BPF_FUNC_redirect_map)
> + if (func_id != BPF_FUNC_redirect_map &&
> + func_id != BPF_FUNC_map_lookup_elem)
>   goto error;
>   break;
>   /* Restrict bpf side of cpumap and xskmap, open when use-cases
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX

2019-06-06 Thread Jesper Dangaard Brouer

On Thu, 6 Jun 2019 20:04:20 +0900
Toshiaki Makita  wrote:

> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:
> > On Wed,  5 Jun 2019 14:36:12 +0900
> > Toshiaki Makita  wrote:
> >   
> >> This is introduced for admins to check what is happening on XDP_TX when
> >> bulk XDP_TX is in use, which will be first introduced in veth in next
> >> commit.  
> > 
> > Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by
> > all drivers?  
> 
> I guess you mean all drivers that implement similar mechanism should use 
> this? Then yes.
> (I don't think all drivers needs bulk tx mechanism though)
> 
> > (more below)
> >   
> >> Signed-off-by: Toshiaki Makita 
> >> ---
> >>   include/trace/events/xdp.h | 25 +
> >>   kernel/bpf/core.c  |  1 +
> >>   2 files changed, 26 insertions(+)
> >>
> >> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> >> index e95cb86..e06ea65 100644
> >> --- a/include/trace/events/xdp.h
> >> +++ b/include/trace/events/xdp.h
> >> @@ -50,6 +50,31 @@
> >>  __entry->ifindex)
> >>   );
> >>   
> >> +TRACE_EVENT(xdp_bulk_tx,
> >> +
> >> +  TP_PROTO(const struct net_device *dev,
> >> +   int sent, int drops, int err),
> >> +
> >> +  TP_ARGS(dev, sent, drops, err),
> >> +
> >> +  TP_STRUCT__entry(  
> > 
> > All other tracepoints in this file starts with:
> > 
> > __field(int, prog_id)
> > __field(u32, act)
> > or
> > __field(int, map_id)
> > __field(u32, act)
> > 
> > Could you please add those?  
> 
> So... prog_id is the problem. The program can be changed while we are 
> enqueueing packets to the bulk queue, so the prog_id at flush may be an 
> unexpected one.

Hmmm... that sounds problematic, if the XDP bpf_prog for veth can
change underneath, before the flush.  Our redirect system, depend on
things being stable until the xdp_do_flush_map() operation, as will
e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend
on XDP prog), and expect it to be correct/valid.


> It can be fixed by disabling NAPI when changing XDP programs. This stops 
> packet processing while changing XDP programs, but I guess it is an 
> acceptable compromise. Having said that, I'm honestly not so eager to 
> make this change, since this will require refurbishment of one of the 
> most delicate part of veth XDP, NAPI disabling/enabling mechanism.
> 
> WDYT?

Sound like a bug, if XDP bpf_prog is not stable within the NAPI poll...

 
> >> +      __field(int, ifindex)
> >> +  __field(int, drops)
> >> +  __field(int, sent)
> >> +  __field(int, err)
> >> +  ),  
> > 
> > The reason is that this make is easier to attach to multiple
> > tracepoints, and extract the same value.
> > 
> > Example with bpftrace oneliner:
> > 
> > $ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX

2019-06-07 Thread Jesper Dangaard Brouer

On Fri, 7 Jun 2019 11:22:00 +0900
Toshiaki Makita  wrote:

> On 2019/06/07 4:41, Jesper Dangaard Brouer wrote:
> > On Thu, 6 Jun 2019 20:04:20 +0900
> > Toshiaki Makita  wrote:
> >   
> >> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote:  
> >>> On Wed,  5 Jun 2019 14:36:12 +0900
> >>> Toshiaki Makita  wrote:
> >>>  
[...]
> >>
> >> So... prog_id is the problem. The program can be changed while we are
> >> enqueueing packets to the bulk queue, so the prog_id at flush may be an
> >> unexpected one.  
> > 
> > Hmmm... that sounds problematic, if the XDP bpf_prog for veth can
> > change underneath, before the flush.  Our redirect system, depend on
> > things being stable until the xdp_do_flush_map() operation, as will
> > e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend
> > on XDP prog), and expect it to be correct/valid.  
> 
> Sorry, I don't get how maps depend on programs.

BPF/XDP programs have a reference count on the map (e.g. used for
redirect) and when the XDP is removed, and last refcnt for the map is
reached, then the map is also removed (redirect maps does a call_rcu
when shutdown).

> At least xdp_do_redirect_map() handles map_to_flush change during NAPI. 
> Is there a problem when the map is not changed but the program is changed?
> Also I believe this is not veth-specific behavior. Looking at tun and 
> i40e, they seem to change xdp_prog without stopping data path.

I guess this could actually happen, but we are "saved" by the
'map_to_flush' (pointer) is still valid due to RCU protection.

But it does look fishy, as our rcu_read_lock's does not encapsulation
this. There is RCU-read-section in veth_xdp_rcv_skb(), which via can
call xdp_do_redirect() which set per-CPU ri->map_to_flush.  

Do we get this protection by running under softirq, and does this
prevent an RCU grace-period (call_rcu callbacks) from happening?
(between veth_xdp_rcv_skb() and xdp_do_flush_map() in veth_poll())

To Toshiaki, regarding your patch 2/2, you are not affected by this
per-CPU map storing, as you pass along the bulk-queue.  I do see you
point, with prog_id could change.  Could you change the tracepoint to
include the 'act' and place 'ifindex' above this in the struct, this way
the 'act' member is in the same location/offset as other XDP
tracepoints.  I see the 'ifindex' as the identifier for this tracepoint
(other have map_id or prog_id in this location).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v4 3/6] xdp: Add devmap_hash map type for looking up devices by hashed index

2019-07-25 Thread Jesper Dangaard Brouer

On Mon, 22 Jul 2019 13:52:48 +0200
Toke Høiland-Jørgensen  wrote:

> +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab,
> + int idx)
> +{
> + return &dtab->dev_index_head[idx & (NETDEV_HASHENTRIES - 1)];
> +}

It is good for performance that our "hash" function is simply an AND
operation on the idx.  We want to keep it this way.

I don't like that you are using NETDEV_HASHENTRIES, because the BPF map
infrastructure already have a way to specify the map size (struct
bpf_map_def .max_entries).  BUT for performance reasons, to keep the
AND operation, we would need to round up the hash-array size to nearest
power of 2 (or reject if user didn't specify a power of 2, if we want
to "expose" this limit to users).

> +struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 
> key)
> +{
> + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> + struct hlist_head *head = dev_map_index_hash(dtab, key);
> + struct bpf_dtab_netdev *dev;
> +
> + hlist_for_each_entry_rcu(dev, head, index_hlist)
> + if (dev->idx == key)
> +     return dev;
> +
> + return NULL;
> +}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v4 3/6] xdp: Add devmap_hash map type for looking up devices by hashed index

2019-07-25 Thread Jesper Dangaard Brouer

On Thu, 25 Jul 2019 12:32:19 +0200
Toke Høiland-Jørgensen  wrote:

> Jesper Dangaard Brouer  writes:
> 
> > On Mon, 22 Jul 2019 13:52:48 +0200
> > Toke Høiland-Jørgensen  wrote:
> >  
> >> +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab,
> >> +  int idx)
> >> +{
> >> +  return &dtab->dev_index_head[idx & (NETDEV_HASHENTRIES - 1)];
> >> +}  
> >
> > It is good for performance that our "hash" function is simply an AND
> > operation on the idx.  We want to keep it this way.
> >
> > I don't like that you are using NETDEV_HASHENTRIES, because the BPF map
> > infrastructure already have a way to specify the map size (struct
> > bpf_map_def .max_entries).  BUT for performance reasons, to keep the
> > AND operation, we would need to round up the hash-array size to nearest
> > power of 2 (or reject if user didn't specify a power of 2, if we want
> > to "expose" this limit to users).  
> 
> But do we really want the number of hash buckets to be equal to the max
> number of entries? The values are not likely to be evenly distributed,
> so we'll end up with big buckets if the number is small, meaning we'll
> blow performance on walking long lists in each bucket.

The requested change makes it user-configurable, instead of fixed 256
entries.  I've seen production use-case with >5000 net_devices, thus
they need a knob to increase this (to avoid the list walking as you
mention).


> Also, if the size is dynamic the size needs to be loaded from memory
> instead of being a compile-time constant, which will presumably hurt
> performance (though not sure by how much)?

To counter this, the mask value which need to be loaded from memory,
needs to be placed next to some other struct member which is already in
use (at least on same cacheline, Intel have some 16 bytes access micro
optimizations, which I've never been able to measure, as its in 0.5
nanosec scale).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH net-next] MAINTAINERS: Remove mailing-list entry for XDP (eXpress Data Path)

2019-07-29 Thread Jesper Dangaard Brouer

This removes the mailing list xdp-newb...@vger.kernel.org from the XDP
kernel maintainers entry.

Being in the kernel MAINTAINERS file successfully caused the list to
receive kbuild bot warnings, syzbot reports and sometimes developer
patches. The level of details in these messages, doesn't match the
target audience of the XDP-newbies list. This is based on a survey on
the mailing list, where 73% voted for removal from MAINTAINERS file.

Signed-off-by: Jesper Dangaard Brouer 
---
 MAINTAINERS |1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9cc156c58f0c..45cb4237eddc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17560,7 +17560,6 @@ M:  Jakub Kicinski 
 M: Jesper Dangaard Brouer 
 M: John Fastabend 
 L: netdev@vger.kernel.org
-L: xdp-newb...@vger.kernel.org
 L: b...@vger.kernel.org
 S: Supported
 F: net/core/xdp.c

Re: [PATCH bpf-next v5 1/6] include/bpf.h: Remove map_insert_ctx() stubs

2019-07-29 Thread Jesper Dangaard Brouer

On Fri, 26 Jul 2019 18:06:52 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> When we changed the device and CPU maps to use linked lists instead of
> bitmaps, we also removed the need for the map_insert_ctx() helpers to keep
> track of the bitmaps inside each map. However, it seems I forgot to remove
> the function definitions stubs, so remove those here.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> Acked-by: Yonghong Song 

Acked-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v5 2/6] xdp: Refactor devmap allocation code for reuse

2019-07-29 Thread Jesper Dangaard Brouer

On Fri, 26 Jul 2019 18:06:53 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> The subsequent patch to add a new devmap sub-type can re-use much of the
> initialisation and allocation code, so refactor it into separate functions.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> Acked-by: Yonghong Song 

Acked-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v5 3/6] xdp: Add devmap_hash map type for looking up devices by hashed index

2019-07-29 Thread Jesper Dangaard Brouer

On Fri, 26 Jul 2019 18:06:55 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> A common pattern when using xdp_redirect_map() is to create a device map
> where the lookup key is simply ifindex. Because device maps are arrays,
> this leaves holes in the map, and the map has to be sized to fit the
> largest ifindex, regardless of how many devices actually are actually
> needed in the map.
> 
> This patch adds a second type of device map where the key is looked up
> using a hashmap, instead of being used as an array index. This allows maps
> to be densely packed, so they can be smaller.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> Acked-by: Yonghong Song 

Acked-by: Jesper Dangaard Brouer 

[...]
> +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab,
> + int idx)
> +{
> + return &dtab->dev_index_head[idx & (dtab->n_buckets - 1)];
> +}

I was about to complain about, that you are not using a pre-calculated
MASK value, instead of doing the -1 operation each time.  But I looked
at the ASM code, and the LEA operation used does the -1 operation in
the same instruction, so I guess this makes no performance difference.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v5 4/6] tools/include/uapi: Add devmap_hash BPF map type

2019-07-29 Thread Jesper Dangaard Brouer

On Fri, 26 Jul 2019 18:06:56 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> This adds the devmap_hash BPF map type to the uapi headers in tools/.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> Acked-by: Yonghong Song 

Acked-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v5 5/6] tools/libbpf_probes: Add new devmap_hash type

2019-07-29 Thread Jesper Dangaard Brouer

On Fri, 26 Jul 2019 18:06:57 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> This adds the definition for BPF_MAP_TYPE_DEVMAP_HASH to libbpf_probes.c in
> tools/lib/bpf.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> Acked-by: Yonghong Song 

Acked-by: Jesper Dangaard Brouer http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v5 6/6] tools: Add definitions for devmap_hash map type

2019-07-29 Thread Jesper Dangaard Brouer

On Fri, 26 Jul 2019 18:06:58 +0200
Toke Høiland-Jørgensen  wrote:

> From: Toke Høiland-Jørgensen 
> 
> This adds selftest and bpftool updates for the devmap_hash map type.
> 
> Signed-off-by: Toke Høiland-Jørgensen 

Acked-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP

2019-07-31 Thread Jesper Dangaard Brouer

version (in /usr/local) it was
not first in $PATH for root. Load order also matters see issue#3.

[3]
https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/common_user.c#L381-L493

(Issue#3)
Next to get TC and XDP programs to cooperate, pinned maps via bpffs is
used. And iproute2 dictates that pinned maps are located in directory
/sys/fs/bpf/tc/globals/. What could go wrong, it's only a static dir path.

Sub-problem(1): If XDP loads _before_ any tc-bpf cmd, then the subdirs
are not created, leading to replicating mkdir creation in C[4]. Else
the XDP load will fail. (Troubleshooting this was complicated by

[4] https://github.com/xdp-project/xdp-cpumap-tc/commit/25e7e56699cd75a4a

Sub-problem(2): We really want to load XDP first, because libbpf
creates maps in a better way, e.g. with "name" (and BTF info).
The "name" part was needed by libbpf to find a given map (to avoid
depending on the order maps are defined in C/ELF file) via helper
bpf_object__find_map_fd_by_name(). To handle TC noname case, I had to
code up this workaround[5], which depend on extracting the name of the
pinned file name.

[5]
https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/xdp_iphash_to_cpu_user.c#L117-L131

Experience/conclusion: Getting XDP and TC to cooperate suck, primarily
because iproute2 tc is based on a separate BPF-ELF loader, which is
features are not in sync / lacking behind. Calling cmdline binaries
from C also sucks, and I would prefer some libbpf TC attach function,
but most pain comes from the slight differences between ELF-loaders.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface

2019-07-31 Thread Jesper Dangaard Brouer

On Wed, 31 Jul 2019 03:48:20 +0900
"Daniel T. Lee"  wrote:

> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 67e99c56bc88..d3a4f18b5b95 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -55,6 +55,35 @@ struct bpf_attach_info {
>   __u32 flow_dissector_id;
>  };
>  
> +enum net_load_type {
> + NET_LOAD_TYPE_XDP,
> + NET_LOAD_TYPE_XDP_GENERIC,
> + NET_LOAD_TYPE_XDP_DRIVE,

Why "DRIVE" ?
Why not "DRIVER" ?

> + NET_LOAD_TYPE_XDP_OFFLOAD,
> + __MAX_NET_LOAD_TYPE
> +};
> +
> +static const char * const load_type_strings[] = {
> + [NET_LOAD_TYPE_XDP] = "xdp",
> + [NET_LOAD_TYPE_XDP_GENERIC] = "xdpgeneric",
> + [NET_LOAD_TYPE_XDP_DRIVE] = "xdpdrv",
> + [NET_LOAD_TYPE_XDP_OFFLOAD] = "xdpoffload",
> + [__MAX_NET_LOAD_TYPE] = NULL,
> +};



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface

2019-07-31 Thread Jesper Dangaard Brouer

On Wed, 31 Jul 2019 03:48:20 +0900
"Daniel T. Lee"  wrote:

> By this commit, using `bpftool net load`, user can load XDP prog on
> interface. New type of enum 'net_load_type' has been made, as stated at
> cover-letter, the meaning of 'load' is, prog will be loaded on interface.

Why the keyword "load" ?
Why not "attach" (and "detach")?

For BPF there is a clear distinction between the "load" and "attach"
steps.  I know this is under subcommand "net", but to follow the
conversion of other subcommands e.g. "prog" there are both "load" and
"attach" commands.


> BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'.

Again this is a "set" operation, not a "load" operation.

> Signed-off-by: Daniel T. Lee 

[...]
>  static int do_show(int argc, char **argv)
>  {
>   struct bpf_attach_info attach_info = {};
> @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
>  
>   fprintf(stderr,
>   "Usage: %s %s { show | list } [dev ]\n"
> + "   %s %s load PROG LOAD_TYPE \n"

The "PROG" here does it correspond to the 'bpftool prog' syntax?:

 PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }

>   "   %s %s help\n"
> + "\n"
> + "   " HELP_SPEC_PROGRAM "\n"
> + "   LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload 
> }\n"
>   "Note: Only xdp and tc attachments are supported now.\n"
>       "  For progs attached to cgroups, use \"bpftool cgroup\"\n"
>   "  to dump program attachments. For program types\n"
>   "  sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
>   "  consult iproute2.\n",


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[net v1 PATCH 2/4] selftests/bpf: add wrapper scripts for test_xdp_vlan.sh

2019-08-01 Thread Jesper Dangaard Brouer

In-order to test both native-XDP (xdpdrv) and generic-XDP (xdpgeneric)
create two wrapper test scripts, that start the test_xdp_vlan.sh script
with these modes.

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/Makefile   |3 ++-
 tools/testing/selftests/bpf/test_xdp_vlan.sh   |5 -
 .../selftests/bpf/test_xdp_vlan_mode_generic.sh|9 +
 .../selftests/bpf/test_xdp_vlan_mode_native.sh |9 +
 4 files changed, 24 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh
 create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a0336a..29001f944db7 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -57,7 +57,8 @@ TEST_PROGS := test_kmod.sh \
test_lirc_mode2.sh \
test_skb_cgroup_id.sh \
test_flow_dissector.sh \
-   test_xdp_vlan.sh \
+   test_xdp_vlan_mode_generic.sh \
+   test_xdp_vlan_mode_native.sh \
test_lwt_ip_encap.sh \
test_tcp_check_syncookie.sh \
test_tc_tunnel.sh \
diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.sh 
b/tools/testing/selftests/bpf/test_xdp_vlan.sh
index c8aed63b0ffe..7348661be815 100755
--- a/tools/testing/selftests/bpf/test_xdp_vlan.sh
+++ b/tools/testing/selftests/bpf/test_xdp_vlan.sh
@@ -2,7 +2,10 @@
 # SPDX-License-Identifier: GPL-2.0
 # Author: Jesper Dangaard Brouer 
 
-TESTNAME=xdp_vlan
+# Allow wrapper scripts to name test
+if [ -z "$TESTNAME" ]; then
+TESTNAME=xdp_vlan
+fi
 
 # Default XDP mode
 XDP_MODE=xdpgeneric
diff --git a/tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh 
b/tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh
new file mode 100755
index ..c515326d6d59
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Exit on failure
+set -e
+
+# Wrapper script to test generic-XDP
+export TESTNAME=xdp_vlan_mode_generic
+./test_xdp_vlan.sh --mode=xdpgeneric
diff --git a/tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh 
b/tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh
new file mode 100755
index ..5cf7ce1f16c1
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Exit on failure
+set -e
+
+# Wrapper script to test native-XDP
+export TESTNAME=xdp_vlan_mode_native
+./test_xdp_vlan.sh --mode=xdpdrv

[net v1 PATCH 0/4] net: fix regressions for generic-XDP

2019-08-01 Thread Jesper Dangaard Brouer

Thanks to Brandon Cazander, who wrote a very detailed bug report that
even used perf probe's on xdp-newbies mailing list, we discovered that
generic-XDP contains some regressions when using bpf_xdp_adjust_head().

First issue were that my selftests script, that use bpf_xdp_adjust_head(),
by mistake didn't use generic-XDP any-longer. That selftest should have
caught the real regression introduced in commit 458bf2f224f0 ("net: core:
support XDP generic on stacked devices.").

To verify this patchset fix the regressions, you can invoked manually via:

  cd tools/testing/selftests/bpf/
  sudo ./test_xdp_vlan_mode_generic.sh
  sudo ./test_xdp_vlan_mode_native.sh

Link: https://www.spinics.net/lists/xdp-newbies/msg01231.html
Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
Reported by: Brandon Cazander 
Signed-off-by: Jesper Dangaard Brouer 

---

Jesper Dangaard Brouer (4):
  bpf: fix XDP vlan selftests test_xdp_vlan.sh
  selftests/bpf: add wrapper scripts for test_xdp_vlan.sh
  selftests/bpf: reduce time to execute test_xdp_vlan.sh
  net: fix bpf_xdp_adjust_head regression for generic-XDP


 net/core/dev.c |   15 -
 tools/testing/selftests/bpf/Makefile   |3 +
 tools/testing/selftests/bpf/test_xdp_vlan.sh   |   57 
 .../selftests/bpf/test_xdp_vlan_mode_generic.sh|9 +++
 .../selftests/bpf/test_xdp_vlan_mode_native.sh |9 +++
 5 files changed, 75 insertions(+), 18 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh
 create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh

--

[net v1 PATCH 1/4] bpf: fix XDP vlan selftests test_xdp_vlan.sh

2019-08-01 Thread Jesper Dangaard Brouer

Change BPF selftest test_xdp_vlan.sh to (default) use generic XDP.

This selftest was created together with a fix for generic XDP, in commit
297249569932 ("net: fix generic XDP to handle if eth header was
mangled"). And was suppose to catch if generic XDP was broken again.

The tests are using veth and assumed that veth driver didn't support
native driver XDP, thus it used the (ip link set) 'xdp' attach that fell
back to generic-XDP. But veth gained native-XDP support in 948d4f214fde
("veth: Add driver XDP"), which caused this test script to use
native-XDP.

Fixes: 948d4f214fde ("veth: Add driver XDP")
Fixes: 97396ff0bc2d ("selftests/bpf: add XDP selftests for modifying and 
popping VLAN headers")
Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/test_xdp_vlan.sh |   42 ++
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.sh 
b/tools/testing/selftests/bpf/test_xdp_vlan.sh
index 51a3a31d1aac..c8aed63b0ffe 100755
--- a/tools/testing/selftests/bpf/test_xdp_vlan.sh
+++ b/tools/testing/selftests/bpf/test_xdp_vlan.sh
@@ -1,7 +1,12 @@
 #!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Author: Jesper Dangaard Brouer 
 
 TESTNAME=xdp_vlan
 
+# Default XDP mode
+XDP_MODE=xdpgeneric
+
 usage() {
   echo "Testing XDP + TC eBPF VLAN manipulations: $TESTNAME"
   echo ""
@@ -9,9 +14,23 @@ usage() {
   echo "  -v | --verbose : Verbose"
   echo "  --flush: Flush before starting (e.g. after --interactive)"
   echo "  --interactive  : Keep netns setup running after test-run"
+  echo "  --mode=XXX : Choose XDP mode (xdp | xdpgeneric | xdpdrv)"
   echo ""
 }
 
+valid_xdp_mode()
+{
+   local mode=$1
+
+   case "$mode" in
+   xdpgeneric | xdpdrv | xdp)
+   return 0
+   ;;
+   *)
+   return 1
+   esac
+}
+
 cleanup()
 {
local status=$?
@@ -37,7 +56,7 @@ cleanup()
 
 # Using external program "getopt" to get --long-options
 OPTIONS=$(getopt -o hvfi: \
---long verbose,flush,help,interactive,debug -- "$@")
+--long verbose,flush,help,interactive,debug,mode: -- "$@")
 if (( $? != 0 )); then
 usage
 echo "selftests: $TESTNAME [FAILED] Error calling getopt, unknown option?"
@@ -60,6 +79,11 @@ while true; do
cleanup
shift
;;
+   --mode )
+   shift
+   XDP_MODE=$1
+   shift
+   ;;
-- )
shift
break
@@ -81,8 +105,14 @@ if [ "$EUID" -ne 0 ]; then
exit 1
 fi
 
-ip link set dev lo xdp off 2>/dev/null > /dev/null
-if [ $? -ne 0 ];then
+valid_xdp_mode $XDP_MODE
+if [ $? -ne 0 ]; then
+   echo "selftests: $TESTNAME [FAILED] unknown XDP mode ($XDP_MODE)"
+   exit 1
+fi
+
+ip link set dev lo xdpgeneric off 2>/dev/null > /dev/null
+if [ $? -ne 0 ]; then
echo "selftests: $TESTNAME [SKIP] need ip xdp support"
exit 0
 fi
@@ -166,7 +196,7 @@ export FILE=test_xdp_vlan.o
 
 # First test: Remove VLAN by setting VLAN ID 0, using "xdp_vlan_change"
 export XDP_PROG=xdp_vlan_change
-ip netns exec ns1 ip link set $DEVNS1 xdp object $FILE section $XDP_PROG
+ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE object $FILE section $XDP_PROG
 
 # In ns1: egress use TC to add back VLAN tag 4011
 #  (del cmd)
@@ -187,8 +217,8 @@ ip netns exec ns1 ping -W 2 -c 3 $IPADDR2
 # ETH_P_8021Q indication, and this cause overwriting of our changes.
 #
 export XDP_PROG=xdp_vlan_remove_outer2
-ip netns exec ns1 ip link set $DEVNS1 xdp off
-ip netns exec ns1 ip link set $DEVNS1 xdp object $FILE section $XDP_PROG
+ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE off
+ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE object $FILE section $XDP_PROG
 
 # Now the namespaces should still be able reach each-other, test with ping:
 ip netns exec ns2 ping -W 2 -c 3 $IPADDR1

[net v1 PATCH 4/4] net: fix bpf_xdp_adjust_head regression for generic-XDP

2019-08-01 Thread Jesper Dangaard Brouer

When generic-XDP was moved to a later processing step by commit
458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
a regression was introduced when using bpf_xdp_adjust_head.

The issue is that after this commit the skb->network_header is now
changed prior to calling generic XDP and not after. Thus, if the header
is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header
also need to be updated again.  Fix by calling skb_reset_network_header().

Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
Reported-by: Brandon Cazander 
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/dev.c |   15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2610e3..b5533795c3c1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4374,12 +4374,17 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
*skb,
 
act = bpf_prog_run_xdp(xdp_prog, xdp);
 
+   /* check if bpf_xdp_adjust_head was used */
off = xdp->data - orig_data;
-   if (off > 0)
-   __skb_pull(skb, off);
-   else if (off < 0)
-   __skb_push(skb, -off);
-   skb->mac_header += off;
+   if (off) {
+   if (off > 0)
+   __skb_pull(skb, off);
+   else if (off < 0)
+   __skb_push(skb, -off);
+
+   skb->mac_header += off;
+   skb_reset_network_header(skb);
+   }
 
/* check if bpf_xdp_adjust_tail was used. it can only "shrink"
 * pckt.

[net v1 PATCH 3/4] selftests/bpf: reduce time to execute test_xdp_vlan.sh

2019-08-01 Thread Jesper Dangaard Brouer

Given the increasing number of BPF selftests, it makes sense to
reduce the time to execute these tests.  The ping parameters are
adjusted to reduce the time from measures 9 sec to approx 2.8 sec.

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/test_xdp_vlan.sh |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.sh 
b/tools/testing/selftests/bpf/test_xdp_vlan.sh
index 7348661be815..bb8b0da91686 100755
--- a/tools/testing/selftests/bpf/test_xdp_vlan.sh
+++ b/tools/testing/selftests/bpf/test_xdp_vlan.sh
@@ -188,7 +188,7 @@ ip netns exec ns2 ip link set lo up
 # At this point, the hosts cannot reach each-other,
 # because ns2 are using VLAN tags on the packets.
 
-ip netns exec ns2 sh -c 'ping -W 1 -c 1 100.64.41.1 || echo "Okay ping fails"'
+ip netns exec ns2 sh -c 'ping -W 1 -c 1 100.64.41.1 || echo "Success: First 
ping must fail"'
 
 
 # Now we can use the test_xdp_vlan.c program to pop/push these VLAN tags
@@ -210,8 +210,8 @@ ip netns exec ns1 tc filter add dev $DEVNS1 egress \
   prio 1 handle 1 bpf da obj $FILE sec tc_vlan_push
 
 # Now the namespaces can reach each-other, test with ping:
-ip netns exec ns2 ping -W 2 -c 3 $IPADDR1
-ip netns exec ns1 ping -W 2 -c 3 $IPADDR2
+ip netns exec ns2 ping -i 0.2 -W 2 -c 2 $IPADDR1
+ip netns exec ns1 ping -i 0.2 -W 2 -c 2 $IPADDR2
 
 # Second test: Replace xdp prog, that fully remove vlan header
 #
@@ -224,5 +224,5 @@ ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE off
 ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE object $FILE section $XDP_PROG
 
 # Now the namespaces should still be able reach each-other, test with ping:
-ip netns exec ns2 ping -W 2 -c 3 $IPADDR1
-ip netns exec ns1 ping -W 2 -c 3 $IPADDR2
+ip netns exec ns2 ping -i 0.2 -W 2 -c 2 $IPADDR1
+ip netns exec ns1 ping -i 0.2 -W 2 -c 2 $IPADDR2

Re: [net v1 PATCH 4/4] net: fix bpf_xdp_adjust_head regression for generic-XDP

2019-08-02 Thread Jesper Dangaard Brouer

On Thu, 1 Aug 2019 17:44:06 -0700
Jakub Kicinski  wrote:

> On Thu, 01 Aug 2019 20:00:31 +0200, Jesper Dangaard Brouer wrote:
> > When generic-XDP was moved to a later processing step by commit
> > 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
> > a regression was introduced when using bpf_xdp_adjust_head.
> > 
> > The issue is that after this commit the skb->network_header is now
> > changed prior to calling generic XDP and not after. Thus, if the header
> > is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header
> > also need to be updated again.  Fix by calling skb_reset_network_header().
> > 
> > Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
> > Reported-by: Brandon Cazander 
> > Signed-off-by: Jesper Dangaard Brouer   
> 
> Out of curiosity what was your conclusion regarding resetting the
> transport header as well?

Well, I don't know... I need some review, from e.g. Stephen that
changed this... I've added code snippets below signature to helper
reviewers (also helps understand below paragraph).

I think, we perhaps should call skb_reset_transport_header(), as we
change skb->data (via either __skb_pull() or __skb_push()), *BUT* I'm
not sure it is needed/required, as someone/something afterwards still
need to call skb_set_transport_header(), which also calls
skb_reset_transport_header() anyway.

I also want to ask reviewers, if we should move the call to
skb_reset_mac_len() (which Stephen added) in __netif_receive_skb_core()
into netif_receive_generic_xdp() (after the call to
skb_reset_network_header()), (it would be more natural to keep them
together)?

- - 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Code snippet
/* check if bpf_xdp_adjust_head was used */
off = xdp->data - orig_data;
if (off) {
if (off > 0)
__skb_pull(skb, off);
else if (off < 0)
__skb_push(skb, -off);

skb->mac_header += off;
skb_reset_network_header(skb);
}


static inline bool skb_transport_header_was_set(const struct sk_buff *skb)
{
return skb->transport_header != (typeof(skb->transport_header))~0U;
}

static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
{
return skb->head + skb->transport_header;
}

static inline void skb_reset_transport_header(struct sk_buff *skb)
{
skb->transport_header = skb->data - skb->head;
}

static inline void skb_set_transport_header(struct sk_buff *skb,
const int offset)
{
skb_reset_transport_header(skb);
skb->transport_header += offset;
}

[bpf-next PATCH 0/3] bpf: improvements to xdp_fwd sample

2019-08-07 Thread Jesper Dangaard Brouer

This patchset is focused on improvements for XDP forwarding sample
named xdp_fwd, which leverage the existing FIB routing tables as
described in LPC2018[1] talk by David Ahern.

The primary motivation is to illustrate how Toke's recent work
improves usability of XDP_REDIRECT via lookups in devmap. The other
patches are to help users understand the sample.

I have more improvements to xdp_fwd, but those might requires changes
to libbpf.  Thus, sending these patches first as they are isolated.

[1] http://vger.kernel.org/lpc-networking2018.html#session-1

---

Jesper Dangaard Brouer (3):
  samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
  samples/bpf: make xdp_fwd more practically usable via devmap lookup
  samples/bpf: xdp_fwd explain bpf_fib_lookup return codes


 samples/bpf/xdp_fwd_kern.c |   42 +-
 samples/bpf/xdp_fwd_user.c |   38 ++
 2 files changed, 59 insertions(+), 21 deletions(-)

--

[bpf-next PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup

2019-08-07 Thread Jesper Dangaard Brouer

This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
that the chosen egress index should be checked for existence in the
devmap. This can now be done via taking advantage of Toke's work in
commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").

This change makes xdp_fwd more practically usable, as this allows for
a mixed environment, where IP-forwarding fallback to network stack, if
the egress device isn't configured to use XDP.

Signed-off-by: Jesper Dangaard Brouer 
---
 samples/bpf/xdp_fwd_kern.c |   20 ++--
 samples/bpf/xdp_fwd_user.c |   36 +---
 2 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index e6ffc4ea06f4..4a5ad381ed2a 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -104,13 +104,21 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
 
rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
 
-   /* verify egress index has xdp support
-* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
-*   cannot pass map_type 14 into func bpf_map_lookup_elem#1:
-* NOTE: without verification that egress index supports XDP
-*   forwarding packets are dropped.
-*/
if (rc == 0) {
+   int *val;
+
+   /* Verify egress index has been configured as TX-port.
+* (Note: User can still have inserted an egress ifindex that
+* doesn't support XDP xmit, which will result in packet drops).
+*
+* Note: lookup in devmap supported since 0cdbb4b09a0.
+* If not supported will fail with:
+*  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
+*/
+   val = bpf_map_lookup_elem(&tx_port, &fib_params.ifindex);
+   if (!val)
+   return XDP_PASS;
+
if (h_proto == htons(ETH_P_IP))
ip_decrease_ttl(iph);
else if (h_proto == htons(ETH_P_IPV6))
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index ba012d9f93dd..20951bc27477 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -27,14 +27,20 @@
 #include "libbpf.h"
 #include 
 
-
-static int do_attach(int idx, int fd, const char *name)
+static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
 {
int err;
 
-   err = bpf_set_link_xdp_fd(idx, fd, 0);
-   if (err < 0)
+   err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
+   if (err < 0) {
printf("ERROR: failed to attach program to %s\n", name);
+   return err;
+   }
+
+   /* Adding ifindex as a possible egress TX port */
+   err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
+   if (err)
+   printf("ERROR: failed using device %s as TX-port\n", name);
 
return err;
 }
@@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
if (err < 0)
printf("ERROR: failed to detach program from %s\n", name);
 
+   /* TODO: Remember to cleanup map, when adding use of shared map
+*  bpf_map_delete_elem((map_fd, &idx);
+*/
return err;
 }
 
@@ -67,10 +76,10 @@ int main(int argc, char **argv)
};
const char *prog_name = "xdp_fwd";
struct bpf_program *prog;
+   int prog_fd, map_fd = -1;
char filename[PATH_MAX];
struct bpf_object *obj;
int opt, i, idx, err;
-   int prog_fd, map_fd;
int attach = 1;
int ret = 0;
 
@@ -103,8 +112,17 @@ int main(int argc, char **argv)
return 1;
}
 
-   if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+   err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
+   if (err) {
+   if (err == -22) {
+   printf("Does kernel support devmap lookup?\n");
+   /* If not, the error message will be:
+* "cannot pass map_type 14 into func
+* bpf_map_lookup_elem#1"
+*/
+   }
return 1;
+   }
 
prog = bpf_object__find_program_by_title(obj, prog_name);
prog_fd = bpf_program__fd(prog);
@@ -119,10 +137,6 @@ int main(int argc, char **argv)
return 1;
}
}
-   if (attach) {
-   for (i = 1; i < 64; ++i)
-   bpf_map_update_elem(map_fd, &i, &i, 0);
-   }

[bpf-next PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports

2019-08-07 Thread Jesper Dangaard Brouer

The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map
which only have a single TX port. Change name to xdp_tx_ports
to make it more descriptive.

Signed-off-by: Jesper Dangaard Brouer 
---
 samples/bpf/xdp_fwd_kern.c |5 +++--
 samples/bpf/xdp_fwd_user.c |2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index a7e94e7ff87d..e6ffc4ea06f4 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -23,7 +23,8 @@
 
 #define IPV6_FLOWINFO_MASK  cpu_to_be32(0x0FFF)
 
-struct bpf_map_def SEC("maps") tx_port = {
+/* For TX-traffic redirect requires net_device ifindex to be in this devmap */
+struct bpf_map_def SEC("maps") xdp_tx_ports = {
.type = BPF_MAP_TYPE_DEVMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
@@ -117,7 +118,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
 
memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
-   return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
+   return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0);
}
 
return XDP_PASS;
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index 5b46ee12c696..ba012d9f93dd 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -113,7 +113,7 @@ int main(int argc, char **argv)
return 1;
}
map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
- "tx_port"));
+   "xdp_tx_ports"));
if (map_fd < 0) {
printf("map not found: %s\n", strerror(map_fd));
return 1;

[bpf-next PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes

2019-08-07 Thread Jesper Dangaard Brouer

Make it clear that this XDP program depend on the network
stack to do the ARP resolution.  This is connected with the
BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().

Another common mistake (seen via XDP-tutorial) is that users
don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
setting is honored by bpf_fib_lookup.

Reported-by: Anton Protopopov 
Signed-off-by: Jesper Dangaard Brouer 
---
 samples/bpf/xdp_fwd_kern.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 4a5ad381ed2a..df11163454e7 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -103,8 +103,23 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
fib_params.ifindex = ctx->ingress_ifindex;
 
rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
-
-   if (rc == 0) {
+   /*
+* Some rc (return codes) from bpf_fib_lookup() are important,
+* to understand how this XDP-prog interacts with network stack.
+*
+* BPF_FIB_LKUP_RET_NO_NEIGH:
+*  Even if route lookup was a success, then the MAC-addresses are also
+*  needed.  This is obtained from arp/neighbour table, but if table is
+*  (still) empty then BPF_FIB_LKUP_RET_NO_NEIGH is returned.  To avoid
+*  doing ARP lookup directly from XDP, then send packet to normal
+*  network stack via XDP_PASS and expect it will do ARP resolution.
+*
+* BPF_FIB_LKUP_RET_FWD_DISABLED:
+*  The bpf_fib_lookup respect sysctl net.ipv{4,6}.conf.all.forwarding
+*  setting, and will return BPF_FIB_LKUP_RET_FWD_DISABLED if not
+*  enabled this on ingress device.
+*/
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
int *val;
 
/* Verify egress index has been configured as TX-port.

Re: [bpf-next PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup

2019-08-07 Thread Jesper Dangaard Brouer

On Wed, 7 Aug 2019 11:04:17 -0700
Y Song  wrote:

> On Wed, Aug 7, 2019 at 5:37 AM Jesper Dangaard Brouer  
> wrote:
> >
> > This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
> > that the chosen egress index should be checked for existence in the
> > devmap. This can now be done via taking advantage of Toke's work in
> > commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").
> >
> > This change makes xdp_fwd more practically usable, as this allows for
> > a mixed environment, where IP-forwarding fallback to network stack, if
> > the egress device isn't configured to use XDP.
> >
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >  samples/bpf/xdp_fwd_kern.c |   20 ++--
> >  samples/bpf/xdp_fwd_user.c |   36 +---
> >  2 files changed, 39 insertions(+), 17 deletions(-)
> >
> > diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> > index e6ffc4ea06f4..4a5ad381ed2a 100644
> > --- a/samples/bpf/xdp_fwd_kern.c
> > +++ b/samples/bpf/xdp_fwd_kern.c
> > @@ -104,13 +104,21 @@ static __always_inline int xdp_fwd_flags(struct 
> > xdp_md *ctx, u32 flags)
> >
> > rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
> >
> > -   /* verify egress index has xdp support
> > -* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> > -*   cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> > -* NOTE: without verification that egress index supports XDP
> > -*   forwarding packets are dropped.
> > -*/
> > if (rc == 0) {
> > +   int *val;
> > +
> > +   /* Verify egress index has been configured as TX-port.
> > +* (Note: User can still have inserted an egress ifindex 
> > that
> > +* doesn't support XDP xmit, which will result in packet 
> > drops).
> > +*
> > +* Note: lookup in devmap supported since 0cdbb4b09a0.
> > +* If not supported will fail with:
> > +*  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> > +*/
> > +   val = bpf_map_lookup_elem(&tx_port, &fib_params.ifindex);  
> 
> It should be "xdp_tx_ports". Otherwise, you will have compilation errors.

Ups. This happened in my rebase, where I moved the rename patch [1/3]
before this one.  Thanks for catching this!

> > +   if (!val)
> > +   return XDP_PASS;  
> 
> Also, maybe we can do
>  if (!bpf_map_lookup_elem(&tx_port, &fib_params.ifindex))
> return XDP_PASS;
> so we do not need to define val at all.

I had it this way, because I also checked the contents (of the pointer
*val) to check if this was the correct ifindex, but I removed that
check again (as user side always insert correctly).  So, I guess I
could take your suggestion now.


> > +
> > if (h_proto == htons(ETH_P_IP))
> > ip_decrease_ttl(iph);
> > else if (h_proto == htons(ETH_P_IPV6))
> > diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
> > index ba012d9f93dd..20951bc27477 100644
> > --- a/samples/bpf/xdp_fwd_user.c
> > +++ b/samples/bpf/xdp_fwd_user.c
> > @@ -27,14 +27,20 @@
> >  #include "libbpf.h"
> >  #include 
> >
> > -
> > -static int do_attach(int idx, int fd, const char *name)
> > +static int do_attach(int idx, int prog_fd, int map_fd, const char
> > *name) {
> > int err;
> >
> > -   err = bpf_set_link_xdp_fd(idx, fd, 0);
> > -   if (err < 0)
> > +   err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
> > +   if (err < 0) {
> > printf("ERROR: failed to attach program to %s\n",
> > name);
> > +   return err;
> > +   }
> > +
> > +   /* Adding ifindex as a possible egress TX port */
> > +   err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
> > +   if (err)
> > +   printf("ERROR: failed using device %s as
> > TX-port\n", name);
> >
> > return err;
> >  }
> > @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
> > if (err < 0)
> > printf("ERROR: failed to detach program from %s\n",
> > name);
> >
> > +   /* TODO: Remember to cleanup map, when adding use o

Re: [bpf-next PATCH 0/3] bpf: improvements to xdp_fwd sample

2019-08-08 Thread Jesper Dangaard Brouer

On Wed, 7 Aug 2019 15:09:09 -0700
Zvi Effron  wrote:

> On Wed, Aug 7, 2019 at 6:00 AM Jesper Dangaard Brouer  
> wrote:
> >
> > Toke's devmap lookup improvement is first avail in kernel v5.3.
> > Thus, not part of XDP-tutorial yet.
> >  
> I probably missed this in an earlier email, but what are Toke's devmap
> improvements? Performance? Capability?

Toke's devmap and redirect improvements are primarily about usability.

Currently, from BPF-context (kernel-side) you cannot read the contents
of devmap (or cpumap or xskmap(AF_XDP)).  Because for devmap you get
the real pointer to the net_device ifindex, and we cannot allow you to
write/change that from BPF (kernel would likely crash or be inconsistent).

The work-around, is to keep a shadow map, that contains the "config" of
the devmap, which you check/validate against instead.  It is just a pain
to maintain this shadow map.  Toke's change allow you to read devmap
from BPF-context.  Thus, you can avoid this shadow map.

Another improvement from Toke, is that the bpf_redirect_map() helper,
now also check if the redirect index is valid in the map.  If not, then
it returns another value than XDP_REDIRECT.  You can choose the
alternative return value yourself, via "flags" e.g. XDP_PASS.  Thus,
you don't even need to check/validate devmap in your BPF-code, as it is
part of the bpf_redirect_map() call now.

 action = bpf_redirect_map(&map, &index, flags_as_xdp_value) 

The default flags used in most programs today is 0, which maps to
XDP_ABORTED.  This is sort of a small UAPI change, but for the better.
As today, the packet is dropped later, only diagnose/seen via
tracepoint xdp:xdp_redirect_map_err.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[bpf-next v2 PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes

2019-08-08 Thread Jesper Dangaard Brouer

Make it clear that this XDP program depend on the network
stack to do the ARP resolution.  This is connected with the
BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().

Another common mistake (seen via XDP-tutorial) is that users
don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
setting is honored by bpf_fib_lookup.

Reported-by: Anton Protopopov 
Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: David Ahern 
Acked-by: Yonghong Song 
---
 samples/bpf/xdp_fwd_kern.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index a43d6953c054..701a30f258b1 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -103,8 +103,23 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
fib_params.ifindex = ctx->ingress_ifindex;
 
rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
-
-   if (rc == 0) {
+   /*
+* Some rc (return codes) from bpf_fib_lookup() are important,
+* to understand how this XDP-prog interacts with network stack.
+*
+* BPF_FIB_LKUP_RET_NO_NEIGH:
+*  Even if route lookup was a success, then the MAC-addresses are also
+*  needed.  This is obtained from arp/neighbour table, but if table is
+*  (still) empty then BPF_FIB_LKUP_RET_NO_NEIGH is returned.  To avoid
+*  doing ARP lookup directly from XDP, then send packet to normal
+*  network stack via XDP_PASS and expect it will do ARP resolution.
+*
+* BPF_FIB_LKUP_RET_FWD_DISABLED:
+*  The bpf_fib_lookup respect sysctl net.ipv{4,6}.conf.all.forwarding
+*  setting, and will return BPF_FIB_LKUP_RET_FWD_DISABLED if not
+*  enabled this on ingress device.
+*/
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
/* Verify egress index has been configured as TX-port.
 * (Note: User can still have inserted an egress ifindex that
 * doesn't support XDP xmit, which will result in packet drops).

[bpf-next v2 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup

2019-08-08 Thread Jesper Dangaard Brouer

This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
that the chosen egress index should be checked for existence in the
devmap. This can now be done via taking advantage of Toke's work in
commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").

This change makes xdp_fwd more practically usable, as this allows for
a mixed environment, where IP-forwarding fallback to network stack, if
the egress device isn't configured to use XDP.

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: David Ahern 
---
 samples/bpf/xdp_fwd_kern.c |   17 +++--
 samples/bpf/xdp_fwd_user.c |   36 +---
 2 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index e6ffc4ea06f4..a43d6953c054 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -104,13 +104,18 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
 
rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
 
-   /* verify egress index has xdp support
-* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
-*   cannot pass map_type 14 into func bpf_map_lookup_elem#1:
-* NOTE: without verification that egress index supports XDP
-*   forwarding packets are dropped.
-*/
if (rc == 0) {
+   /* Verify egress index has been configured as TX-port.
+* (Note: User can still have inserted an egress ifindex that
+* doesn't support XDP xmit, which will result in packet drops).
+*
+* Note: lookup in devmap supported since 0cdbb4b09a0.
+* If not supported will fail with:
+*  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
+*/
+   if (!bpf_map_lookup_elem(&xdp_tx_ports, &fib_params.ifindex))
+   return XDP_PASS;
+
if (h_proto == htons(ETH_P_IP))
ip_decrease_ttl(iph);
else if (h_proto == htons(ETH_P_IPV6))
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index ba012d9f93dd..20951bc27477 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -27,14 +27,20 @@
 #include "libbpf.h"
 #include 
 
-
-static int do_attach(int idx, int fd, const char *name)
+static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
 {
int err;
 
-   err = bpf_set_link_xdp_fd(idx, fd, 0);
-   if (err < 0)
+   err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
+   if (err < 0) {
printf("ERROR: failed to attach program to %s\n", name);
+   return err;
+   }
+
+   /* Adding ifindex as a possible egress TX port */
+   err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
+   if (err)
+   printf("ERROR: failed using device %s as TX-port\n", name);
 
return err;
 }
@@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
if (err < 0)
printf("ERROR: failed to detach program from %s\n", name);
 
+   /* TODO: Remember to cleanup map, when adding use of shared map
+*  bpf_map_delete_elem((map_fd, &idx);
+*/
return err;
 }
 
@@ -67,10 +76,10 @@ int main(int argc, char **argv)
};
const char *prog_name = "xdp_fwd";
struct bpf_program *prog;
+   int prog_fd, map_fd = -1;
char filename[PATH_MAX];
struct bpf_object *obj;
int opt, i, idx, err;
-   int prog_fd, map_fd;
int attach = 1;
int ret = 0;
 
@@ -103,8 +112,17 @@ int main(int argc, char **argv)
return 1;
}
 
-   if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+   err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
+   if (err) {
+   if (err == -22) {
+   printf("Does kernel support devmap lookup?\n");
+   /* If not, the error message will be:
+* "cannot pass map_type 14 into func
+* bpf_map_lookup_elem#1"
+*/
+   }
return 1;
+   }
 
prog = bpf_object__find_program_by_title(obj, prog_name);
prog_fd = bpf_program__fd(prog);
@@ -119,10 +137,6 @@ int main(int argc, char **argv)
return 1;
}
}
-   if (attach) {
-   for (i = 1; i < 64; ++i)
-   bpf_map_update_elem(map_fd, &i, &i, 0);
-   }
 
for (i = optind; i < argc; ++i) {

[bpf-next v2 PATCH 0/3] bpf: improvements to xdp_fwd sample

2019-08-08 Thread Jesper Dangaard Brouer

V2: Addressed issues point out by Yonghong Song
 - Please ACK patch 2/3 again
 - Added ACKs and reviewed-by to other patches

This patchset is focused on improvements for XDP forwarding sample
named xdp_fwd, which leverage the existing FIB routing tables as
described in LPC2018[1] talk by David Ahern.

The primary motivation is to illustrate how Toke's recent work
improves usability of XDP_REDIRECT via lookups in devmap. The other
patches are to help users understand the sample.

I have more improvements to xdp_fwd, but those might requires changes
to libbpf.  Thus, sending these patches first as they are isolated.

[1] http://vger.kernel.org/lpc-networking2018.html#session-1

---

Jesper Dangaard Brouer (3):
  samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
  samples/bpf: make xdp_fwd more practically usable via devmap lookup
  samples/bpf: xdp_fwd explain bpf_fib_lookup return codes


 samples/bpf/xdp_fwd_kern.c |   39 ++-
 samples/bpf/xdp_fwd_user.c |   38 ++
 2 files changed, 56 insertions(+), 21 deletions(-)

--

[bpf-next v2 PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports

2019-08-08 Thread Jesper Dangaard Brouer

The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map
which only have a single TX port. Change name to xdp_tx_ports
to make it more descriptive.

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: David Ahern 
Acked-by: Yonghong Song 
---
 samples/bpf/xdp_fwd_kern.c |5 +++--
 samples/bpf/xdp_fwd_user.c |2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index a7e94e7ff87d..e6ffc4ea06f4 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -23,7 +23,8 @@
 
 #define IPV6_FLOWINFO_MASK  cpu_to_be32(0x0FFF)
 
-struct bpf_map_def SEC("maps") tx_port = {
+/* For TX-traffic redirect requires net_device ifindex to be in this devmap */
+struct bpf_map_def SEC("maps") xdp_tx_ports = {
.type = BPF_MAP_TYPE_DEVMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
@@ -117,7 +118,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
 
memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
-   return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
+   return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0);
}
 
return XDP_PASS;
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index 5b46ee12c696..ba012d9f93dd 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -113,7 +113,7 @@ int main(int argc, char **argv)
return 1;
}
map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
- "tx_port"));
+   "xdp_tx_ports"));
if (map_fd < 0) {
printf("map not found: %s\n", strerror(map_fd));
return 1;

Re: [bpf-next v2 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup

2019-08-08 Thread Jesper Dangaard Brouer

On Thu, 08 Aug 2019 17:46:46 +0200
Jesper Dangaard Brouer  wrote:

> @@ -103,8 +112,17 @@ int main(int argc, char **argv)
>   return 1;
>   }
>  
> - if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> + err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
> + if (err) {
> + if (err == -22) {

Darn - I forgot this, need to be changed.
I'll send a V3.

> + printf("Does kernel support devmap lookup?\n");
> + /* If not, the error message will be:
> +  * "cannot pass map_type 14 into func
> +  * bpf_map_lookup_elem#1"
> +  */
> +     }
>   return 1;



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[bpf-next v3 PATCH 0/3] bpf: improvements to xdp_fwd sample

2019-08-08 Thread Jesper Dangaard Brouer

V3: Hopefully fixed all issues point out by Yonghong Song

V2: Addressed issues point out by Yonghong Song
 - Please ACK patch 2/3 again
 - Added ACKs and reviewed-by to other patches

This patchset is focused on improvements for XDP forwarding sample
named xdp_fwd, which leverage the existing FIB routing tables as
described in LPC2018[1] talk by David Ahern.

The primary motivation is to illustrate how Toke's recent work
improves usability of XDP_REDIRECT via lookups in devmap. The other
patches are to help users understand the sample.

I have more improvements to xdp_fwd, but those might requires changes
to libbpf.  Thus, sending these patches first as they are isolated.

[1] http://vger.kernel.org/lpc-networking2018.html#session-1

---

Jesper Dangaard Brouer (3):
  samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
  samples/bpf: make xdp_fwd more practically usable via devmap lookup
  samples/bpf: xdp_fwd explain bpf_fib_lookup return codes


 samples/bpf/xdp_fwd_kern.c |   39 ++-
 samples/bpf/xdp_fwd_user.c |   35 +++
 2 files changed, 53 insertions(+), 21 deletions(-)

--

[bpf-next v3 PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes

2019-08-08 Thread Jesper Dangaard Brouer

Make it clear that this XDP program depend on the network
stack to do the ARP resolution.  This is connected with the
BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().

Another common mistake (seen via XDP-tutorial) is that users
don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
setting is honored by bpf_fib_lookup.

Reported-by: Anton Protopopov 
Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: David Ahern 
Acked-by: Yonghong Song 
---
 samples/bpf/xdp_fwd_kern.c |   19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index a43d6953c054..701a30f258b1 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -103,8 +103,23 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
fib_params.ifindex = ctx->ingress_ifindex;
 
rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
-
-   if (rc == 0) {
+   /*
+* Some rc (return codes) from bpf_fib_lookup() are important,
+* to understand how this XDP-prog interacts with network stack.
+*
+* BPF_FIB_LKUP_RET_NO_NEIGH:
+*  Even if route lookup was a success, then the MAC-addresses are also
+*  needed.  This is obtained from arp/neighbour table, but if table is
+*  (still) empty then BPF_FIB_LKUP_RET_NO_NEIGH is returned.  To avoid
+*  doing ARP lookup directly from XDP, then send packet to normal
+*  network stack via XDP_PASS and expect it will do ARP resolution.
+*
+* BPF_FIB_LKUP_RET_FWD_DISABLED:
+*  The bpf_fib_lookup respect sysctl net.ipv{4,6}.conf.all.forwarding
+*  setting, and will return BPF_FIB_LKUP_RET_FWD_DISABLED if not
+*  enabled this on ingress device.
+*/
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
/* Verify egress index has been configured as TX-port.
 * (Note: User can still have inserted an egress ifindex that
 * doesn't support XDP xmit, which will result in packet drops).

[bpf-next v3 PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports

2019-08-08 Thread Jesper Dangaard Brouer

The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map
which only have a single TX port. Change name to xdp_tx_ports
to make it more descriptive.

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: David Ahern 
Acked-by: Yonghong Song 
---
 samples/bpf/xdp_fwd_kern.c |5 +++--
 samples/bpf/xdp_fwd_user.c |2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index a7e94e7ff87d..e6ffc4ea06f4 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -23,7 +23,8 @@
 
 #define IPV6_FLOWINFO_MASK  cpu_to_be32(0x0FFF)
 
-struct bpf_map_def SEC("maps") tx_port = {
+/* For TX-traffic redirect requires net_device ifindex to be in this devmap */
+struct bpf_map_def SEC("maps") xdp_tx_ports = {
.type = BPF_MAP_TYPE_DEVMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
@@ -117,7 +118,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
 
memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
-   return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
+   return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0);
}
 
return XDP_PASS;
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index 5b46ee12c696..ba012d9f93dd 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -113,7 +113,7 @@ int main(int argc, char **argv)
return 1;
}
map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
- "tx_port"));
+   "xdp_tx_ports"));
if (map_fd < 0) {
printf("map not found: %s\n", strerror(map_fd));
return 1;

[bpf-next v3 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup

2019-08-08 Thread Jesper Dangaard Brouer

This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
that the chosen egress index should be checked for existence in the
devmap. This can now be done via taking advantage of Toke's work in
commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").

This change makes xdp_fwd more practically usable, as this allows for
a mixed environment, where IP-forwarding fallback to network stack, if
the egress device isn't configured to use XDP.

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: David Ahern 
---
 samples/bpf/xdp_fwd_kern.c |   17 +++--
 samples/bpf/xdp_fwd_user.c |   33 ++---
 2 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index e6ffc4ea06f4..a43d6953c054 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -104,13 +104,18 @@ static __always_inline int xdp_fwd_flags(struct xdp_md 
*ctx, u32 flags)
 
rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
 
-   /* verify egress index has xdp support
-* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
-*   cannot pass map_type 14 into func bpf_map_lookup_elem#1:
-* NOTE: without verification that egress index supports XDP
-*   forwarding packets are dropped.
-*/
if (rc == 0) {
+   /* Verify egress index has been configured as TX-port.
+* (Note: User can still have inserted an egress ifindex that
+* doesn't support XDP xmit, which will result in packet drops).
+*
+* Note: lookup in devmap supported since 0cdbb4b09a0.
+* If not supported will fail with:
+*  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
+*/
+   if (!bpf_map_lookup_elem(&xdp_tx_ports, &fib_params.ifindex))
+   return XDP_PASS;
+
if (h_proto == htons(ETH_P_IP))
ip_decrease_ttl(iph);
else if (h_proto == htons(ETH_P_IPV6))
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index ba012d9f93dd..97ff1dad7669 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -27,14 +27,20 @@
 #include "libbpf.h"
 #include 
 
-
-static int do_attach(int idx, int fd, const char *name)
+static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
 {
int err;
 
-   err = bpf_set_link_xdp_fd(idx, fd, 0);
-   if (err < 0)
+   err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
+   if (err < 0) {
printf("ERROR: failed to attach program to %s\n", name);
+   return err;
+   }
+
+   /* Adding ifindex as a possible egress TX port */
+   err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
+   if (err)
+   printf("ERROR: failed using device %s as TX-port\n", name);
 
return err;
 }
@@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
if (err < 0)
printf("ERROR: failed to detach program from %s\n", name);
 
+   /* TODO: Remember to cleanup map, when adding use of shared map
+*  bpf_map_delete_elem((map_fd, &idx);
+*/
return err;
 }
 
@@ -67,10 +76,10 @@ int main(int argc, char **argv)
};
const char *prog_name = "xdp_fwd";
struct bpf_program *prog;
+   int prog_fd, map_fd = -1;
char filename[PATH_MAX];
struct bpf_object *obj;
int opt, i, idx, err;
-   int prog_fd, map_fd;
int attach = 1;
int ret = 0;
 
@@ -103,8 +112,14 @@ int main(int argc, char **argv)
return 1;
}
 
-   if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+   err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
+   if (err) {
+   printf("Does kernel support devmap lookup?\n");
+   /* If not, the error message will be:
+*  "cannot pass map_type 14 into func 
bpf_map_lookup_elem#1"
+*/
return 1;
+   }
 
prog = bpf_object__find_program_by_title(obj, prog_name);
prog_fd = bpf_program__fd(prog);
@@ -119,10 +134,6 @@ int main(int argc, char **argv)
return 1;
}
}
-   if (attach) {
-   for (i = 1; i < 64; ++i)
-   bpf_map_update_elem(map_fd, &i, &i, 0);
-   }
 
for (i = optind; i < argc; ++i) {
idx = if_nametoindex(argv[i]);
@@ -138,7 +149,7 @@ int main(int argc, char **argv)
if (err)

Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type

2019-08-08 Thread Jesper Dangaard Brouer

On Thu, 8 Aug 2019 12:57:05 -0700
Y Song  wrote:

> On Thu, Aug 8, 2019 at 12:43 PM Toke Høiland-Jørgensen  
> wrote:
> >
> > Alexei Starovoitov  writes:
> >  
> > > On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen  
> > > wrote:  
> > >>
> > >> This series adds a new map type, devmap_hash, that works like the 
> > >> existing
> > >> devmap type, but using a hash-based indexing scheme. This is useful for 
> > >> the use
> > >> case where a devmap is indexed by ifindex (for instance for use with the 
> > >> routing
> > >> table lookup helper). For this use case, the regular devmap needs to be 
> > >> sized
> > >> after the maximum ifindex number, not the number of devices in it. A 
> > >> hash-based
> > >> indexing scheme makes it possible to size the map after the number of 
> > >> devices it
> > >> should contain instead.
> > >>
> > >> This was previously part of my patch series that also turned the regular
> > >> bpf_redirect() helper into a map-based one; for this series I just 
> > >> pulled out
> > >> the patches that introduced the new map type.
> > >>
> > >> Changelog:
> > >>
> > >> v5:
> > >>
> > >> - Dynamically set the number of hash buckets by rounding up max_entries 
> > >> to the
> > >>   nearest power of two (mirroring the regular hashmap), as suggested by 
> > >> Jesper.  
> > >
> > > fyi I'm waiting for Jesper to review this new version.  
> >
> > Ping Jesper? :)  
> 
> Toke, the patch set has been merged to net-next.
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=d3406913561c322323ec2898cc58f55e79786be7
> 

Yes, and I did review this... :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking

2021-01-14 Thread Jesper Dangaard Brouer

On Thu, 14 Jan 2021 00:07:14 +0100
Daniel Borkmann  wrote:

> On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote:
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.  
> [...]
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, 
> > u64 flags)
> > + * Description
> > + * Check ctx packet size against MTU of net device (based on
> > + * *ifindex*).  This helper will likely be used in combination with
> > + * helpers that adjust/change the packet size.  The argument
> > + * *len_diff* can be used for querying with a planned size
> > + * change. This allows to check MTU prior to changing packet ctx.
> > + *
> > + * Specifying *ifindex* zero means the MTU check is performed
> > + * against the current net device.  This is practical if this isn't
> > + * used prior to redirect.
> > + *
> > + * The Linux kernel route table can configure MTUs on a more
> > + * specific per route level, which is not provided by this helper.
> > + * For route level MTU checks use the **bpf_fib_lookup**\ ()
> > + * helper.
> > + *
> > + * *ctx* is either **struct xdp_md** for XDP programs or
> > + * **struct sk_buff** for tc cls_act programs.
> > + *
> > + * The *flags* argument can be a combination of one or more of the
> > + * following values:
> > + *
> > + * **BPF_MTU_CHK_SEGS**
> > + * This flag will only works for *ctx* **struct sk_buff**.
> > + * If packet context contains extra packet segment buffers
> > + * (often knows as GSO skb), then MTU check is harder to
> > + * check at this point, because in transmit path it is
> > + * possible for the skb packet to get re-segmented
> > + * (depending on net device features).  This could still be
> > + * a MTU violation, so this flag enables performing MTU
> > + * check against segments, with a different violation
> > + * return code to tell it apart. Check cannot use len_diff.
> > + *
> > + * On return *mtu_len* pointer contains the MTU value of the net
> > + * device.  Remember the net device configured MTU is the L3 size,
> > + * which is returned here and XDP and TX length operate at L2.
> > + * Helper take this into account for you, but remember when using
> > + * MTU value in your BPF-code.  On input *mtu_len* must be a valid
> > + * pointer and be initialized (to zero), else verifier will reject
> > + * BPF program.
> > + *
> > + * Return
> > + * * 0 on success, and populate MTU value in *mtu_len* pointer.
> > + *
> > + * * < 0 if any input argument is invalid (*mtu_len* not updated)
> > + *
> > + * MTU violations return positive values, but also populate MTU
> > + * value in *mtu_len* pointer, as this can be needed for
> > + * implementing PMTU handing:
> > + *
> > + * * **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > + *
> >*/
> >   #define __BPF_FUNC_MAPPER(FN) \
> > FN(unspec), \
> > @@ -3998,6 +4053,7 @@ union bpf_attr {
> > FN(ktime_get_coarse_ns),\
> > FN(ima_inode_hash), \
> > FN(sock_from_file), \
> > +   FN(check_mtu),  \
> > /* */
> >   
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which 
> > helper
> > @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
> > };
> >   };
> >   
> > +/* bpf_check_mtu flags*/
> > +enum  bpf_check_mtu_flags {
> > +   BPF_MTU_CHK_SEGS  = (1U << 0),
> > +};
> > +
> > +enum bpf_check_mtu_ret {
> > +   BPF_MTU_CHK_RET_SUCCESS,  /* check and lookup successful */
> > +   BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> > +   BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
> > +};
> > +
> >   enum bpf_task_fd_type {
> > BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */
> > BPF_FD_TYPE_TRACEPOINT, /* tp name */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index db59ab55572c..3f2e593244ca 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto 
>

Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking

2021-01-14 Thread Jesper Dangaard Brouer

On Tue, 12 Jan 2021 11:23:33 -0800
Andrii Nakryiko  wrote:

> On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
>  wrote:
> >
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> >
> > The SKB object is complex and the skb->len value (accessible from
> > BPF-prog) also include the length of any extra GRO/GSO segments, but
> > without taking into account that these GRO/GSO segments get added
> > transport (L4) and network (L3) headers before being transmitted. Thus,
> > this BPF-helper is created such that the BPF-programmer don't need to
> > handle these details in the BPF-prog.
> >
> > The API is designed to help the BPF-programmer, that want to do packet
> > context size changes, which involves other helpers. These other helpers
> > usually does a delta size adjustment. This helper also support a delta
> > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > these other helpers, and perform the MTU check prior to doing any actual
> > size adjustment of the packet context.
> >
> > It is on purpose, that we allow the len adjustment to become a negative
> > result, that will pass the MTU check. This might seem weird, but it's not
> > this helpers responsibility to "catch" wrong len_diff adjustments. Other
> > helpers will take care of these checks, if BPF-programmer chooses to do
> > actual size adjustment.
> >
> > V9:
> > - Use dev->hard_header_len (instead of ETH_HLEN)
> > - Annotate with unlikely req from Daniel
> > - Fix logic error using skb_gso_validate_network_len from Daniel
> >
> > V6:
> > - Took John's advice and dropped BPF_MTU_CHK_RELAX
> > - Returned MTU is kept at L3-level (like fib_lookup)
> >
> > V4: Lot of changes
> >  - ifindex 0 now use current netdev for MTU lookup
> >  - rename helper from bpf_mtu_check to bpf_check_mtu
> >  - fix bug for GSO pkt length (as skb->len is total len)
> >  - remove __bpf_len_adj_positive, simply allow negative len adj
> >
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >  include/uapi/linux/bpf.h   |   67 ++
> >  net/core/filter.c  |  122 
> > 
> >  tools/include/uapi/linux/bpf.h |   67 ++
> >  3 files changed, 256 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 649586d656b6..fa2e99351758 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3833,6 +3833,61 @@ union bpf_attr {
> >   * Return
> >   * A pointer to a struct socket on success or NULL if the file 
> > is
> >   * not a socket.
> > + *
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, 
> > u64 flags)  
> 
> should return long, same as most other helpers

Is it enough to change it here?
(as this will be used for generating the helpers header file,
via ./scripts/bpf_helpers_doc.py --header)

Or do I also need to change bpf_func_proto.ret_type ?

> > + * Description
> > + * Check ctx packet size against MTU of net device (based on
> > + * *ifindex*).  This helper will likely be used in combination 
> > with
> > + * helpers that adjust/change the packet size.  The argument
> > + * *len_diff* can be used for querying with a planned size
> > + * change. This allows to check MTU prior to changing packet 
> > ctx.
> > + *  
> 
> [...]
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress

2021-01-14 Thread Jesper Dangaard Brouer

On Thu, 14 Jan 2021 01:03:33 -0800
John Fastabend  wrote:

> Jesper Dangaard Brouer wrote:
> > The use-case for dropping the MTU check when TC-BPF does redirect to
> > ingress, is described by Eyal Birger in email[0]. The summary is the
> > ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> > ingress redirect packet and let normal netstack fragment packet as needed.
> > 
> > [0] 
> > https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/
> > 
> > V9:
> >  - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect
> > 
> > V4:
> >  - Keep net_device "up" (IFF_UP) check.
> >  - Adjustment to handle bpf_redirect_peer() helper
> > 
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >  include/linux/netdevice.h |   31 +--
> >  net/core/dev.c|   19 ++-
> >  net/core/filter.c |   14 +++---
> >  3 files changed, 42 insertions(+), 22 deletions(-)
> >   
> 
> [...]
> 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 3f2e593244ca..1908800b671c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto 
> > bpf_csum_level_proto = {
> >  
> >  static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
> >  {
> > -   return dev_forward_skb(dev, skb);  
> 
> > +   int ret = dev_forward_skb(dev, skb, false);
> > +
> > +   if (likely(!ret)) {
> > +   skb->protocol = eth_type_trans(skb, dev);
> > +   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > +   ret = netif_rx(skb);
> > +   }
> > +
> > +   return ret;  
> 
> How about putting above block into a dev.c routine call it
> 
>  dev_forward_skb_nomtu(...)
> 
> or something like that. Then we keep this code next to its pair
> with mtu check, dev_forward_skb().
> 
> dev_forward_skb() also uses netif_rx_internal() looks like maybe we should
> just do the same here?

I love the idea.  I'm coding it up and it looks much nicer.  And yes we
obviously can use netif_rx_internal() once the code in core/dev.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking

2021-01-18 Thread Jesper Dangaard Brouer

On Thu, 14 Jan 2021 23:28:57 +0100
Daniel Borkmann  wrote:

> On 1/14/21 3:36 PM, Jesper Dangaard Brouer wrote:
> [...]
> >>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> >>> +u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> >>> +{
> >>> + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> >>> + struct net_device *dev = skb->dev;
> >>> + int skb_len, dev_len;
> >>> + int mtu;
> >>> +
> >>> + if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> >>> + return -EINVAL;
> >>> +
> >>> + dev = __dev_via_ifindex(dev, ifindex);
> >>> + if (unlikely(!dev))
> >>> + return -ENODEV;
> >>> +
> >>> + mtu = READ_ONCE(dev->mtu);
> >>> +
> >>> + dev_len = mtu + dev->hard_header_len;
> >>> + skb_len = skb->len + len_diff; /* minus result pass check */
> >>> + if (skb_len <= dev_len) {
> >>> + ret = BPF_MTU_CHK_RET_SUCCESS;
> >>> + goto out;
> >>> + }
> >>> + /* At this point, skb->len exceed MTU, but as it include length of all
> >>> +  * segments, it can still be below MTU.  The SKB can possibly get
> >>> +  * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> >>> +  * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> >>> +  * checked against MTU.
> >>> +  */
> >>> + if (skb_is_gso(skb)) {
> >>> + ret = BPF_MTU_CHK_RET_SUCCESS;
> >>> +
> >>> + if (!(flags & BPF_MTU_CHK_SEGS))
> >>> + goto out;
> >>> +
> >>> + if (!skb_gso_validate_network_len(skb, mtu)) {
> >>> + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + skb_len = skb_headlen(skb) + len_diff;
> >>> + if (skb_len > dev_len) {  
> [...]
> >> Do you have a particular use case for the BPF_MTU_CHK_SEGS?  
> > 
> > The complaint from Maze (and others) were that when skb_is_gso then all
> > the MTU checks are bypassed.  This flag enables checking the GSO part
> > via skb_gso_validate_network_len().  We cannot enable it per default,
> > as you say, it is universally correct in all cases.  
> 
> If there is a desire to have access to the skb_gso_validate_network_len(), I'd
> keep that behind the flag then, but would drop the skb_headlen(skb) + len_diff
> case given the mentioned case on rx where it would yield misleading results to
> users that might be unintuitive & hard to debug.

Okay, I will update the patch, and drop those lines.

> >> I also don't see the flag being used anywhere in your selftests, so I 
> >> presume
> >> not as otherwise you would have added an example there?  
> > 
> > I'm using the flag in the bpf-examples code[1], this is how I've tested
> > the code path.
> > 
> > I've not found a way to generate GSO packet via the selftests
> > infrastructure via bpf_prog_test_run_xattr().  I'm
> > 
> > [1] 
> > https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c
> >   
> 
> Haven't checked but likely something as prog_tests/skb_ctx.c might not be 
> sufficient
> to pass it into the helper. For real case you might need a netns + veth setup 
> like
> some of the other tests are doing and then generating TCP stream from one end 
> to the
> other.

I have looked at prog_tests/skb_ctx.c and (as you say yourself) this is
not sufficient.  I can look into creating a netns+veth setup, but I
will appreciate if we can merge this patchset to make forward progress,
as I'm sure the netns+veth setup will require its own round of nitpicking.

I have created netns+veth test scripts before (see test_xdp_vlan.sh),
but my experience is that people/maintainers forget/don't to run these
separate shell scripts.  Thus, if I create a netns+veth test, then I
will prefer if I can integrate this into the "test_progs", as I know
that will be run by people/maintainers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH bpf-next V12 1/7] bpf: Remove MTU check in __bpf_skb_max_len

2021-01-18 Thread Jesper Dangaard Brouer

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

Signed-off-by: Jesper Dangaard Brouer 
Acked-by: John Fastabend 
---
 net/core/filter.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9ab94e90d660..5beadd659091 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
off, u32 len_diff,
return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-   return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, 
s32, len_diff,
 {
u32 len_cur, len_diff_abs = abs(len_diff);
u32 len_min = bpf_skb_net_base_len(skb);
-   u32 len_max = __bpf_skb_max_len(skb);
+   u32 len_max = BPF_SKB_MAX_LEN;
__be16 proto = skb->protocol;
bool shrink = len_diff < 0;
u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, 
unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 min_len = __bpf_skb_min_len(skb);
int ret;
 
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto 
sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 new_len = skb->len + head_room;
int ret;

[PATCH bpf-next V12 4/7] bpf: add BPF-helper for MTU checking

2021-01-18 Thread Jesper Dangaard Brouer

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V12:
 - Simplify segment check that calls skb_gso_validate_network_len.
 - Helpers should return long

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   67 
 net/core/filter.c  |  111 
 tools/include/uapi/linux/bpf.h |   67 
 3 files changed, 245 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 05bfc8c843dc..f17381a337ec 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3839,6 +3839,61 @@ union bpf_attr {
  * Return
  * A pointer to a struct socket on success or NULL if the file is
  * not a socket.
+ *
+ * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 
flags)
+ * Description
+ * Check ctx packet size against MTU of net device (based on
+ * *ifindex*).  This helper will likely be used in combination with
+ * helpers that adjust/change the packet size.  The argument
+ * *len_diff* can be used for querying with a planned size
+ * change. This allows to check MTU prior to changing packet ctx.
+ *
+ * Specifying *ifindex* zero means the MTU check is performed
+ * against the current net device.  This is practical if this isn't
+ * used prior to redirect.
+ *
+ * The Linux kernel route table can configure MTUs on a more
+ * specific per route level, which is not provided by this helper.
+ * For route level MTU checks use the **bpf_fib_lookup**\ ()
+ * helper.
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** for tc cls_act programs.
+ *
+ * The *flags* argument can be a combination of one or more of the
+ * following values:
+ *
+ * **BPF_MTU_CHK_SEGS**
+ * This flag will only works for *ctx* **struct sk_buff**.
+ * If packet context contains extra packet segment buffers
+ * (often knows as GSO skb), then MTU check is harder to
+ * check at this point, because in transmit path it is
+ * possible for the skb packet to get re-segmented
+ * (depending on net device features).  This could still be
+ * a MTU violation, so this flag enables performing MTU
+ * check against segments, with a different violation
+ * return code to tell it apart. Check cannot use len_diff.
+ *
+ * On return *mtu_len* pointer contains the MTU value of the net
+ * device.  Remember the net device configured MTU is the L3 size,
+ * which is returned here and XDP and TX length operate at L2.
+ * Helper take this into account for you, but remember when using
+ * MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ * pointer and be initialized (to zero), else verifier will reject
+ * BPF program.
+ *
+ * Return
+ * * 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ *

[PATCH bpf-next V12 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up

2021-01-18 Thread Jesper Dangaard Brouer

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   11 +--
 net/core/filter.c  |   22 +++---
 tools/include/uapi/linux/bpf.h |   11 +--
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c001766adcbc..05bfc8c843dc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2231,6 +2231,9 @@ union bpf_attr {
  * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *   packet is not forwarded or needs assist from full stack
  *
+ * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ * was exceeded and output params->mtu_result contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, 
void *key, u64 flags)
  * Description
  * Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4981,9 +4984,13 @@ struct bpf_fib_lookup {
__be16  sport;
__be16  dport;
 
-   /* total length of packet from network header - used for MTU check */
-   __u16   tot_len;
+   union { /* used for MTU check */
+   /* input to lookup */
+   __u16   tot_len; /* L3 length from network hdr (iph->tot_len) */
 
+   /* output: MTU value */
+   __u16   mtu_result;
+   };
/* input: L3 device index for lookup
 * output: device index from FIB lookup
 */
diff --git a/net/core/filter.c b/net/core/filter.c
index d5e6f395cf64..da162e64578a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5289,12 +5289,14 @@ static const struct bpf_func_proto 
bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
  const struct neighbour *neigh,
- const struct net_device *dev)
+ const struct net_device *dev, u32 mtu)
 {
memcpy(params->dmac, neigh->ha, ETH_ALEN);
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+   if (mtu)
+   params->mtu_result = mtu; /* union with tot_len */
 
return 0;
 }
@@ -5310,8 +5312,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct net_device *dev;
struct fib_result res;
struct flowi4 fl4;
+   u32 mtu = 0;
int err;
-   u32 mtu;
 
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
@@ -5378,8 +5380,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+   }
}
 
nhc = res.nhc;
@@ -5413,7 +5417,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-   return bpf_fib_set_fwd_params(params, neigh, dev);
+   return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5430,7 +5434,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct flowi6 fl6;
int strict = 0;
int oif, err;
-   u32 mtu;
+   u32 mtu = 0;
 
/* link local addresses are never forwarded */
if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -5505,8 +5509,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+

[PATCH bpf-next V12 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx

2021-01-18 Thread Jesper Dangaard Brouer

BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the expected
encap size. (Just like the bpf_check_mtu helper supports). He discovered
that for SKB ctx the param->tot_len was not used, instead skb->len was used
(via MTU check in is_skb_forwardable() that checks against netdev MTU).

Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g.
zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU
check is done like XDP code-path, which checks against FIB-dst MTU.

V10:
- Use same method as XDP for 'tot_len' MTU check

Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: Carlo Carraro 
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 5beadd659091..d5e6f395cf64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5569,6 +5569,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 {
struct net *net = dev_net(skb->dev);
int rc = -EAFNOSUPPORT;
+   bool check_mtu = false;
 
if (plen < sizeof(*params))
return -EINVAL;
@@ -5576,22 +5577,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT))
return -EINVAL;
 
+   if (params->tot_len)
+   check_mtu = true;
+
switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
case AF_INET:
-   rc = bpf_ipv4_fib_lookup(net, params, flags, false);
+   rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
-   rc = bpf_ipv6_fib_lookup(net, params, flags, false);
+   rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
break;
 #endif
}
 
-   if (!rc) {
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
struct net_device *dev;
 
+   /* When tot_len isn't provided by user,
+* check skb against net_device MTU
+*/
dev = dev_get_by_index_rcu(net, params->ifindex);
if (!is_skb_forwardable(dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;

[PATCH bpf-next V12 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper

2021-01-18 Thread Jesper Dangaard Brouer

Adding selftest for BPF-helper bpf_check_mtu(). Making sure
it can be used from both XDP and TC.

V11:
 - Addresse nitpicks from Andrii Nakryiko

V10:
 - Remove errno non-zero test in CHECK_ATTR()
 - Addresse comments from Andrii Nakryiko

Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Andrii Nakryiko 
---
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  216 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  198 ++
 2 files changed, 414 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c 
b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
new file mode 100644
index ..9e2fd01b7c65
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Jesper Dangaard Brouer */
+
+#include  /* before test_progs.h, avoid bpf_util.h redefines 
*/
+#include 
+#include "test_check_mtu.skel.h"
+#include "network_helpers.h"
+
+#include 
+#include 
+
+#define IFINDEX_LO 1
+
+static __u32 duration; /* Hint: needed for CHECK macro */
+
+static int read_mtu_device_lo(void)
+{
+   const char *filename = "/sys/class/net/lo/mtu";
+   char buf[11] = {};
+   int value, n, fd;
+
+   fd = open(filename, 0, O_RDONLY);
+   if (fd == -1)
+   return -1;
+
+   n = read(fd, buf, sizeof(buf));
+   close(fd);
+
+   if (n == -1)
+   return -2;
+
+   value = strtoimax(buf, NULL, 10);
+   if (errno == ERANGE)
+   return -3;
+
+   return value;
+}
+
+static void test_check_mtu_xdp_attach()
+{
+   struct bpf_link_info link_info;
+   __u32 link_info_len = sizeof(link_info);
+   struct test_check_mtu *skel;
+   struct bpf_program *prog;
+   struct bpf_link *link;
+   int err = 0;
+   int fd;
+
+   skel = test_check_mtu__open_and_load();
+   if (CHECK(!skel, "open and load skel", "failed"))
+   return; /* Exit if e.g. helper unknown to kernel */
+
+   prog = skel->progs.xdp_use_helper_basic;
+
+   link = bpf_program__attach_xdp(prog, IFINDEX_LO);
+   if (CHECK(IS_ERR(link), "link_attach", "failed: %ld\n", PTR_ERR(link)))
+   goto out;
+   skel->links.xdp_use_helper_basic = link;
+
+   memset(&link_info, 0, sizeof(link_info));
+   fd = bpf_link__fd(link);
+   err = bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+   if (CHECK(err, "link_info", "failed: %d\n", err))
+   goto out;
+
+   CHECK(link_info.type != BPF_LINK_TYPE_XDP, "link_type",
+ "got %u != exp %u\n", link_info.type, BPF_LINK_TYPE_XDP);
+   CHECK(link_info.xdp.ifindex != IFINDEX_LO, "link_ifindex",
+ "got %u != exp %u\n", link_info.xdp.ifindex, IFINDEX_LO);
+
+   err = bpf_link__detach(link);
+   CHECK(err, "link_detach", "failed %d\n", err);
+
+out:
+   test_check_mtu__destroy(skel);
+}
+
+static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
+  struct bpf_program *prog,
+  __u32 mtu_expect)
+{
+   const char *prog_name = bpf_program__name(prog);
+   int retval_expect = XDP_PASS;
+   __u32 mtu_result = 0;
+   char buf[256] = {};
+   int err;
+   struct bpf_prog_test_run_attr tattr = {
+   .repeat = 1,
+   .data_in = &pkt_v4,
+   .data_size_in = sizeof(pkt_v4),
+   .data_out = buf,
+   .data_size_out = sizeof(buf),
+   .prog_fd = bpf_program__fd(prog),
+   };
+
+   err = bpf_prog_test_run_xattr(&tattr);
+   CHECK_ATTR(err != 0, "bpf_prog_test_run",
+  "prog_name:%s (err %d errno %d retval %d)\n",
+  prog_name, err, errno, tattr.retval);
+
+   CHECK(tattr.retval != retval_expect, "retval",
+ "progname:%s unexpected retval=%d expected=%d\n",
+ prog_name, tattr.retval, retval_expect);
+
+   /* Extract MTU that BPF-prog got */
+   mtu_result = skel->bss->global_bpf_mtu_xdp;
+   ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user");
+}
+
+
+static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex)
+{
+   struct test_check_mtu *skel;
+   int err;
+
+   skel = test_check_mtu__open();
+   if (CHECK(!skel, "skel_open", "failed"))
+   return;
+
+   /* Update "constants" in BPF-prog *BEFORE* libbpf load */
+   skel->rodata->GLOBAL_USER_MTU = mtu;
+   skel->rodata->GLOBAL_USER_IFINDEX = ifindex;

[PATCH bpf-next V12 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress

2021-01-18 Thread Jesper Dangaard Brouer

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] 
https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/netdevice.h |   32 ++--
 net/core/dev.c|   32 ++--
 net/core/filter.c |6 +++---
 3 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5b949076ed23..b7915484369c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3926,14 +3926,42 @@ int xdp_umem_query(struct net_device *dev, u16 
queue_id);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
+int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+const struct sk_buff *skb,
+const bool check_mtu)
+{
+   const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+   unsigned int len;
+
+   if (!(dev->flags & IFF_UP))
+   return false;
+
+   if (!check_mtu)
+   return true;
+
+   len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+   if (skb->len <= len)
+   return true;
+
+   /* if TSO is enabled, we don't care about the length as the packet
+* could be forwarded without being segmented before
+*/
+   if (skb_is_gso(skb))
+   return true;
+
+   return false;
+}
+
 static __always_inline int dev_forward_skb(struct net_device *dev,
-  struct sk_buff *skb)
+  struct sk_buff *skb,
+  const bool check_mtu)
 {
if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-   unlikely(!is_skb_forwardable(dev, skb))) {
+   unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
atomic_long_inc(&dev->rx_dropped);
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index bae35c1ae192..87bb2cd62189 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2194,28 +2194,14 @@ static inline void net_timestamp_set(struct sk_buff 
*skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
 {
-   unsigned int len;
-
-   if (!(dev->flags & IFF_UP))
-   return false;
-
-   len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-   if (skb->len <= len)
-   return true;
-
-   /* if TSO is enabled, we don't care about the length as the packet
-* could be forwarded without being segmented before
-*/
-   if (skb_is_gso(skb))
-   return true;
-
-   return false;
+   return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
-int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
+int __dev_forward_skb2(struct net_device *dev, struct sk_buff *skb,
+bool check_mtu)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, check_mtu);
 
if (likely(!ret)) {
skb->protocol = eth_type_trans(skb, dev);
@@ -2224,6 +2210,11 @@ int __dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb)
 
return ret;
 }
+
+int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
+{
+   return __dev_forward_skb2(dev, skb, true);
+}
 EXPORT_SYMBOL_GPL(__dev_forward_skb);
 
 /**
@@ -2250,6 +2241,11 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
+int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb)
+{
+   return __dev_forward_skb2(dev, skb, false) ?: netif_rx_internal(skb);
+}
+
 static inline int deliver_skb(struct sk_buff *skb,
  struct packet_type *pt_prev,
  struct net_device *orig_dev)
diff --git a/net/core/filter.c b/net/core/filter.c
index 0be81f499f51..07876e9ab5e3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,13 @@ static const struct bpf

[PATCH bpf-next V12 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect

2021-01-18 Thread Jesper Dangaard Brouer

This demonstrate how bpf_check_mtu() helper can easily be used together
with bpf_skb_adjust_room() helper, prior to doing size adjustment, as
delta argument is already setup.

Hint: This specific test can be selected like this:
 ./test_progs -t cls_redirect

Signed-off-by: Jesper Dangaard Brouer 
---
 .../selftests/bpf/progs/test_cls_redirect.c|7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c 
b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
index c9f8464996ea..3c1e042962e6 100644
--- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c
+++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
@@ -70,6 +70,7 @@ typedef struct {
uint64_t errors_total_encap_adjust_failed;
uint64_t errors_total_encap_buffer_too_small;
uint64_t errors_total_redirect_loop;
+   uint64_t errors_total_encap_mtu_violate;
 } metrics_t;
 
 typedef enum {
@@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff 
*skb, encap_headers_t *e
payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr);
int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead;
uint16_t proto = ETH_P_IP;
+   uint32_t mtu_len = 0;
 
/* Loop protection: the inner packet's TTL is decremented as a safeguard
 * against any forwarding loop. As the only interesting field is the TTL
@@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff 
*skb, encap_headers_t *e
}
}
 
+   if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) {
+   metrics->errors_total_encap_mtu_violate++;
+   return TC_ACT_SHOT;
+   }
+
if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET,
BPF_F_ADJ_ROOM_FIXED_GSO |
BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||

[PATCH bpf-next V12 0/7] bpf: New approach for BPF MTU handling

2021-01-18 Thread Jesper Dangaard Brouer

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest
V8: Add better selftests for BPF-helper bpf_check_mtu
V9: Remove patch that use skb_set_redirected
V10: Fix selftests and 'tot_len' MTU check like XDP
V11: Fix nitpicks in selftests
V12: Adjustments requested by Daniel

---

Jesper Dangaard Brouer (7):
  bpf: Remove MTU check in __bpf_skb_max_len
  bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
  bpf: bpf_fib_lookup return MTU value as output when looked up
  bpf: add BPF-helper for MTU checking
  bpf: drop MTU check when doing TC-BPF redirect to ingress
  selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
  bpf/selftests: tests using bpf_check_mtu BPF-helper


 include/linux/netdevice.h  |   32 +++
 include/uapi/linux/bpf.h   |   78 +++
 net/core/dev.c |   32 +--
 net/core/filter.c  |  164 +--
 tools/include/uapi/linux/bpf.h |   78 +++
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  216 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  198 ++
 .../selftests/bpf/progs/test_cls_redirect.c|7 +
 8 files changed, 760 insertions(+), 45 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--

Re: [PATCHv8 bpf-next] samples/bpf: add xdp program on egress for xdp_redirect_map

2021-01-19 Thread Jesper Dangaard Brouer

   return 1;
> + }
>  
>   if (optind == argc) {
>   printf("usage: %s _IN _OUT\n", 
> argv[0]);
> @@ -150,24 +191,28 @@ int main(int argc, char **argv)
>   if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
>   return 1;
>  
> - prog = bpf_program__next(NULL, obj);
> - dummy_prog = bpf_program__next(prog, obj);
> - if (!prog || !dummy_prog) {
> - printf("finding a prog in obj file failed\n");
> - return 1;
> + if (xdp_flags & XDP_FLAGS_SKB_MODE) {
> + prog = bpf_object__find_program_by_name(obj, 
> "xdp_redirect_map_general");
> + tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, 
> "tx_port_general");
> + } else {
> + prog = bpf_object__find_program_by_name(obj, 
> "xdp_redirect_map_native");
> + tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, 
> "tx_port_native");
>   }
> - /* bpf_prog_load_xattr gives us the pointer to first prog's fd,
> -  * so we're missing only the fd for dummy prog
> -  */
> + dummy_prog = bpf_object__find_program_by_name(obj, 
> "xdp_redirect_dummy_prog");
> + if (!prog || dummy_prog < 0 || tx_port_map_fd < 0) {
> + printf("finding prog/dummy_prog/tx_port_map in obj file 
> failed\n");
> + goto out;
> + }
> + prog_fd = bpf_program__fd(prog);
>   dummy_prog_fd = bpf_program__fd(dummy_prog);
> - if (prog_fd < 0 || dummy_prog_fd < 0) {
> + if (prog_fd < 0 || dummy_prog_fd < 0 || tx_port_map_fd < 0) {
>   printf("bpf_prog_load_xattr: %s\n", strerror(errno));
>   return 1;
>   }
>  
> - tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_port");
> + tx_mac_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_mac");
>   rxcnt_map_fd = bpf_object__find_map_fd_by_name(obj, "rxcnt");
> - if (tx_port_map_fd < 0 || rxcnt_map_fd < 0) {
> + if (tx_mac_map_fd < 0 || rxcnt_map_fd < 0) {
>   printf("bpf_object__find_map_fd_by_name failed\n");
>   return 1;
>   }
> @@ -199,11 +244,66 @@ int main(int argc, char **argv)
>   }
>   dummy_prog_id = info.id;
>  
> + /* Load 2nd xdp prog on egress. */
> + if (xdp_devmap_attached) {
> + unsigned char mac_addr[6];
> +
> + devmap_prog = bpf_object__find_program_by_name(obj, 
> "xdp_redirect_map_egress_0");
> + if (!devmap_prog) {
> + printf("finding devmap_prog in obj file failed\n");
> + goto out;
> + }
> + devmap_prog_fd_0 = bpf_program__fd(devmap_prog);
> + if (devmap_prog_fd_0 < 0) {
> + printf("finding devmap_prog fd failed\n");
> + goto out;
> + }
> +
> + devmap_prog = bpf_object__find_program_by_name(obj, 
> "xdp_redirect_map_egress_1");
> + if (!devmap_prog) {
> + printf("finding devmap_prog in obj file failed\n");
> + goto out;
> + }
> + devmap_prog_fd_1 = bpf_program__fd(devmap_prog);
> + if (devmap_prog_fd_1 < 0) {
> +     printf("finding devmap_prog fd failed\n");
> + goto out;
> + }
> +
> + if (get_mac_addr(ifindex_out, mac_addr) < 0) {
> + printf("get interface %d mac failed\n", ifindex_out);
> + goto out;
> + }
> +
> + ret = bpf_map_update_elem(tx_mac_map_fd, &key, mac_addr, 0);
> + if (ret) {
> + perror("bpf_update_elem tx_mac_map_fd");
> + goto out;
> + }
> + }
> +
>   signal(SIGINT, int_exit);
>   signal(SIGTERM, int_exit);
>  
> - /* populate virtual to physical port map */
> - ret = bpf_map_update_elem(tx_port_map_fd, &key, &ifindex_out, 0);
> + /* devmap prog 0 will set src mac to 00:00:00:00:00:01
> +  * if 2nd xdp prog attached on egress
> +  */
> + key = 0;
> + devmap_val.ifindex = ifindex_out;

(ifindex_out same as below)

> + devmap_val.bpf_prog.fd = devmap_prog_fd_0;
> + ret = bpf_map_update_elem(tx_port_map_fd, &key, &devmap_val, 0);
> + if (ret) {
> + perror("bpf_update_elem");
> + goto out;
> + }
> +
> + /* devmap prog 1 will set src mac to 00:00:00:00:01:01
> +  * if 2nd xdp prog attached on egress
> +  */
> + key = 1;
> + devmap_val.ifindex = ifindex_out;

(ifindex_out same as above)

> + devmap_val.bpf_prog.fd = devmap_prog_fd_1;
> + ret = bpf_map_update_elem(tx_port_map_fd, &key, &devmap_val, 0);
>   if (ret) {
>   perror("bpf_update_elem");
>   goto out;



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper

2021-01-20 Thread Jesper Dangaard Brouer

On Wed, 20 Jan 2021 15:15:22 +0200
Maxim Mikityanskiy  wrote:

> On 2021-01-19 17:50, Björn Töpel wrote:
> > This series extends bind() for XDP sockets, so that the bound socket
> > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > this to register the socket. To redirect packets to the registered
> > socket, a new BPF helper is used: bpf_redirect_xsk().
> > 
> > For shared XDP sockets, only the first bound socket is
> > registered. Users that need more complex setup has to use XSKMAP and
> > bpf_redirect_map().
> > 
> > Now, why would one use bpf_redirect_xsk() over the regular
> > bpf_redirect_map() helper?
> > 
> > * Better performance!
> > * Convenience; Most user use one socket per queue. This scenario is
> >what registered sockets support. There is no need to create an
> >XSKMAP. This can also reduce complexity from containerized setups,
> >where users might what to use XDP sockets without CAP_SYS_ADMIN
> >capabilities.

I'm buying into the convenience and reduce complexity, and XDP sockets
without CAP_SYS_ADMIN into containers.

People might be surprised that I'm actually NOT buying into the better
performance argument here.  At these speeds we are basically comparing
how close we are to zero (and have to use nanosec time scale for our
comparisons), more below.

 
> > The first patch restructures xdp_do_redirect() a bit, to make it
> > easier to add the new helper. This restructure also give us a slight
> > performance benefit. The following three patches extends bind() and
> > adds the new helper. After that, two libbpf patches that selects XDP
> > program based on what kernel is running. Finally, selftests for the new
> > functionality is added.
> > 
> > Note that the libbpf "auto-selection" is based on kernel version, so
> > it is hard coded to the "-next" version (5.12). If you would like to
> > try this is out, you will need to change the libbpf patch locally!
> > 
> > Thanks to Maciej and Magnus for the internal review/comments!
> > 
> > Performance (rxdrop, zero-copy)
> > 
> > Baseline
> > Two cores:   21.3 Mpps
> > One core:24.5 Mpps  
> 
> Two cores is slower? It used to be faster all the time, didn't it?
> 
> > Patched
> > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > 
> > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%  
> 
> Nice, impressive improvement!

I do appreciate you work and performance optimizations at this level,
because when we are using this few CPU cycles per packet, then it is
really hard to find new ways to reduce cycles further.

Thank for you saying +13% instead of saying +2.7 Mpps.
It *is* impressive to basically reduce cycles with 13%.

 21.3 Mpps = 46.94 nanosec per packet
 24.0 Mpps = 41.66 nanosec per packet

 21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

On my 3.60GHz testlab machine that gives me 19 cycles.

 
> > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> >

 24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

At this point with saving 1.6 ns this is around the cost of a function call 1.3 
ns.


We still need these optimization in the kernel, but end-users in
userspace are very quickly going to waste the 19 cycles we found.
I still support/believe that the OS need to have a little overhead as
possible, but for me 42 nanosec overhead is close to zero overhead. For
comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
delivery of UDP packets into socket (almost 15 times slower).

I guess my point is that with XDP we have already achieved and exceeded
(my original) performance goals, making it even faster is just showing off ;-P
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
[3] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

Re: [PATCHv14 bpf-next 1/6] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-01-21 Thread Jesper Dangaard Brouer

On Mon, 18 Jan 2021 18:07:17 +0800
Hangbin Liu  wrote:

> On Sun, Jan 17, 2021 at 02:57:02PM -0800, John Fastabend wrote:
> [...]
> > It looks like we could embed xdp_buff in xdp_frame and then keep the 
> > metadata
> > at the end.
> > 
> > Because you are working performance here wdyt? <- @Jesper as well.  
> 
> Leave this question to Jesper.

The struct xdp_buff is larger than struct xdp_frame.  The size of
xdp_frame matters. It is a reserved areas in top of the frame.
An XDP BPF-program cannot access this area (and limit headroom grow).
This is why this code works, as afterwards xdp_frame is still valid.
Looking at the code xdp_update_frame_from_buff() we do seem to update
more fields than actually needed.

> > >  
> > > - sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> > > + if (unlikely(bq->xdp_prog)) {  
> > 
> > Whats the rational for making above unlikely()? Seems for users its not
> > unlikely. Can you measure a performance increase/decrease here? I think
> > its probably fine to just let compiler/prefetcher do its thing here. Or
> > I'm not reading this right, but seems users of bq->xdp_prog would disagree
> > on unlikely case?
> > 
> > Either way a comment might be nice to give us some insight in 6 months
> > why we decided this is unlikely.  
> 
> I agree that there is no need to use unlikely() here.

I added the unlikely() to preserve the baseline performance when not
having the 2nd prog loaded.  But I'm fine with removing that.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCHv9 bpf-next] samples/bpf: add xdp program on egress for xdp_redirect_map

2021-01-21 Thread Jesper Dangaard Brouer

On Thu, 21 Jan 2021 21:06:42 +0800
Hangbin Liu  wrote:

> This patch add a xdp program on egress to show that we can modify
> the packet on egress. In this sample we will set the pkt's src
> mac to egress's mac address. The xdp_prog will be attached when
> -X option supplied.
> 
> Signed-off-by: Hangbin Liu 
> 
> ---
> v9:
> roll back to just set src mac to egress interface mac on 2nd xdp prog,
> this could avoid packet reorder introduce in patch v6.

I like this V9 much better.

One (hopefully) last nitpick below.

> diff --git a/samples/bpf/xdp_redirect_map_kern.c 
> b/samples/bpf/xdp_redirect_map_kern.c
> index 6489352ab7a4..35ce5e26bbe5 100644
> --- a/samples/bpf/xdp_redirect_map_kern.c
> +++ b/samples/bpf/xdp_redirect_map_kern.c
> -SEC("xdp_redirect_map")
> -int xdp_redirect_map_prog(struct xdp_md *ctx)
> +static int xdp_redirect_map(struct xdp_md *ctx, void *redirect_map)

Can you make this function always inline?

>  {
>   void *data_end = (void *)(long)ctx->data_end;
>   void *data = (void *)(long)ctx->data;
>   struct ethhdr *eth = data;
>   int rc = XDP_DROP;
> - int vport, port = 0, m = 0;
>   long *value;
>   u32 key = 0;
>   u64 nh_off;
> + int vport;
>  
>   nh_off = sizeof(*eth);
>   if (data + nh_off > data_end)
> @@ -79,7 +96,40 @@ int xdp_redirect_map_prog(struct xdp_md *ctx)
>   swap_src_dst_mac(data);
>  
>   /* send packet out physical port */
> - return bpf_redirect_map(&tx_port, vport, 0);
> + return bpf_redirect_map(redirect_map, vport, 0);
> +}
> +
> +SEC("xdp_redirect_general")
> +int xdp_redirect_map_general(struct xdp_md *ctx)
> +{
> + return xdp_redirect_map(ctx, &tx_port_general);
> +}
> +
> +SEC("xdp_redirect_native")
> +int xdp_redirect_map_native(struct xdp_md *ctx)
> +{
> + return xdp_redirect_map(ctx, &tx_port_native);
> +}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper

2021-01-21 Thread Jesper Dangaard Brouer

On Wed, 20 Jan 2021 17:19:31 +0100
Maciej Fijalkowski  wrote:

> On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 20 Jan 2021 15:15:22 +0200
> > Maxim Mikityanskiy  wrote:
> >   
> > > On 2021-01-19 17:50, Björn Töpel wrote:  
> > > > This series extends bind() for XDP sockets, so that the bound socket
> > > > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > > > this to register the socket. To redirect packets to the registered
> > > > socket, a new BPF helper is used: bpf_redirect_xsk().
> > > > 
> > > > For shared XDP sockets, only the first bound socket is
> > > > registered. Users that need more complex setup has to use XSKMAP and
> > > > bpf_redirect_map().
> > > > 
> > > > Now, why would one use bpf_redirect_xsk() over the regular
> > > > bpf_redirect_map() helper?
> > > > 
> > > > * Better performance!
> > > > * Convenience; Most user use one socket per queue. This scenario is
> > > >what registered sockets support. There is no need to create an
> > > >XSKMAP. This can also reduce complexity from containerized setups,
> > > >where users might what to use XDP sockets without CAP_SYS_ADMIN
> > > >capabilities.  
> > 
> > I'm buying into the convenience and reduce complexity, and XDP sockets
> > without CAP_SYS_ADMIN into containers.
> > 
> > People might be surprised that I'm actually NOT buying into the better
> > performance argument here.  At these speeds we are basically comparing
> > how close we are to zero (and have to use nanosec time scale for our
> > comparisons), more below.
> > 
> >
> > > > The first patch restructures xdp_do_redirect() a bit, to make it
> > > > easier to add the new helper. This restructure also give us a slight
> > > > performance benefit. The following three patches extends bind() and
> > > > adds the new helper. After that, two libbpf patches that selects XDP
> > > > program based on what kernel is running. Finally, selftests for the new
> > > > functionality is added.
> > > > 
> > > > Note that the libbpf "auto-selection" is based on kernel version, so
> > > > it is hard coded to the "-next" version (5.12). If you would like to
> > > > try this is out, you will need to change the libbpf patch locally!
> > > > 
> > > > Thanks to Maciej and Magnus for the internal review/comments!
> > > > 
> > > > Performance (rxdrop, zero-copy)
> > > > 
> > > > Baseline
> > > > Two cores:   21.3 Mpps
> > > > One core:24.5 Mpps
> > > 
> > > Two cores is slower? It used to be faster all the time, didn't it?
> > >   
> > > > Patched
> > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > > > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > > > 
> > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%
> > > 
> > > Nice, impressive improvement!  
> > 
> > I do appreciate you work and performance optimizations at this level,
> > because when we are using this few CPU cycles per packet, then it is
> > really hard to find new ways to reduce cycles further.
> > 
> > Thank for you saying +13% instead of saying +2.7 Mpps.
> > It *is* impressive to basically reduce cycles with 13%.
> > 
> >  21.3 Mpps = 46.94 nanosec per packet
> >  24.0 Mpps = 41.66 nanosec per packet
> > 
> >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved
> > 
> > On my 3.60GHz testlab machine that gives me 19 cycles.
> > 
> >
> > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> > > >  
> > 
> >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved
> > 
> > At this point with saving 1.6 ns this is around the cost of a function call 
> > 1.3 ns.
> > 
> > 
> > We still need these optimization in the kernel, but end-users in
> > userspace are very quickly going to waste the 19 cycles we found.
> > I still support/believe that the OS need to have a little overhead as
> > possible, but for me 42 nanosec overhead is close to zero overhead. For
> > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
> > delivery of UDP packets into socket (almost 15 times slower).
> > 
> > I guess my point is that with XDP we have already achieved and exceed

Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues

2021-01-21 Thread Jesper Dangaard Brouer

On Wed, 20 Jan 2021 13:27:59 -0800
Ivan Babrou  wrote:

> Without this change the driver tries to allocate too many queues,
> breaching the number of available msi-x interrupts on machines
> with many logical cpus and default adapter settings:
> 
> Insufficient resources for 12 XDP event queues (24 other channels, max 32)
> 
> Which in turn triggers EINVAL on XDP processing:
> 
> sfc :86:00.0 ext0: XDP TX failed (-22)
> 
> Signed-off-by: Ivan Babrou 
> ---

I guess the patch is good in itself due to available msi-x interrupts.

Per earlier discussion: What will happen if a CPU with an ID higher
than available XDP TX-queues redirect a packet out this driver?


>  drivers/net/ethernet/sfc/efx_channels.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/efx_channels.c 
> b/drivers/net/ethernet/sfc/efx_channels.c
> index a4a626e9cd9a..1bfeee283ea9 100644
> --- a/drivers/net/ethernet/sfc/efx_channels.c
> +++ b/drivers/net/ethernet/sfc/efx_channels.c
> @@ -17,6 +17,7 @@
>  #include "rx_common.h"
>  #include "nic.h"
>  #include "sriov.h"
> +#include "workarounds.h"
>  
>  /* This is the first interrupt mode to try out of:
>   * 0 => MSI-X
> @@ -137,6 +138,7 @@ static int efx_allocate_msix_channels(struct efx_nic *efx,
>  {
>   unsigned int n_channels = parallelism;
>   int vec_count;
> + int tx_per_ev;
>   int n_xdp_tx;
>   int n_xdp_ev;
>  
> @@ -149,9 +151,9 @@ static int efx_allocate_msix_channels(struct efx_nic *efx,
>* multiple tx queues, assuming tx and ev queues are both
>* maximum size.
>*/
> -
> + tx_per_ev = EFX_MAX_EVQ_SIZE / EFX_TXQ_MAX_ENT(efx);
>   n_xdp_tx = num_possible_cpus();
> - n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, EFX_MAX_TXQ_PER_CHANNEL);
> +     n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, tx_per_ev);
>  
>   vec_count = pci_msix_vec_count(efx->pci_dev);
>   if (vec_count < 0)



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCHv10 bpf-next] samples/bpf: add xdp program on egress for xdp_redirect_map

2021-01-22 Thread Jesper Dangaard Brouer

On Fri, 22 Jan 2021 10:50:07 +0800
Hangbin Liu  wrote:

> This patch add a xdp program on egress to show that we can modify
> the packet on egress. In this sample we will set the pkt's src
> mac to egress's mac address. The xdp_prog will be attached when
> -X option supplied.
> 
> Signed-off-by: Hangbin Liu 
> ---

I think we have nitpicked this enough ;-)

Acked-by: Jesper Dangaard Brouer 

> v10:
> make xdp_redirect_map() always inline.
> 
> v9:
> roll back to just set src mac to egress interface mac on 2nd xdp prog,
> this could avoid packet reorder introduce in patch v6.
> 
> v8: Fix some checkpatch issues.
> 
> v7:
> a) use bpf_object__find_program_by_name() instad of
>bpf_object__find_program_by_title()
> b) set default devmap fd to 0
> 
> v6: no code update, only rebase the code on latest bpf-next
> 
> v5:
> a) close fd when err out in get_mac_addr()
> b) exit program when both -S and -X supplied.
> 
> v4:
> a) Update get_mac_addr socket create
> b) Load dummy prog regardless of 2nd xdp prog on egress
> 
> v3:
> a) modify the src mac address based on egress mac
> 
> v2:
> a) use pkt counter instead of IP ttl modification on egress program
> b) make the egress program selectable by option -X
> ---
>  samples/bpf/xdp_redirect_map_kern.c |  60 +--
>  samples/bpf/xdp_redirect_map_user.c | 112 +++-
>  2 files changed, 147 insertions(+), 25 deletions(-)
 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)

2020-12-10 Thread Jesper Dangaard Brouer

On Wed, 9 Dec 2020 08:44:33 -0700
David Ahern  wrote:

> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > (up-to 16) so it should not be a problem to solve this by sharing
> > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > for fallback case,  you allocated a number a TX-queue and distribute
> > this across CPUs to avoid hitting a congested lock (above measurements
> > are the optimal non-congested atomic lock operation)  
> 
> I have been meaning to ask you why 16 for the XDP batching? If the
> netdev budget is 64, why not something higher like 32 or 64?

Thanks you for asking as there are multiple good reasons and
consideration for this 16 batch size.  Notice cpumap have batch size 8,
which is also an explicit choice.  And AF_XDP went in the wrong
direction IMHO and I think have 256.  I designed this to be a choice in
the map code, for the level of bulking it needs/wants.

The low level explanation is that these 8 and 16 batch sizes are
optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
with 10 elements).  I'm betting on that memory backing these 8 or 16
packets have higher chance to remain/being in cache, and I can prefetch
them without evicting them from cache again.  In some cases the pointer
to these packets are queued into a ptr_ring, and it is more optimal to
write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

The general explanation is my goal to do bulking without adding latency.
This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
Sure, you/we can likely make the micro-benchmarks look better by using
64 batch size, but that will introduce added latency and likely shoot
our-selves in the foot for real workloads.  With experience from
bufferbloat and real networks, we know that massive TX bulking have bad
effects.  Still XDP-redirect does massive bulking (NIC flush is after
full 64 budget) and we don't have pushback or a queue mechanism (so I
know we are already shooting ourselves in the foot) ...  Fortunately we
now have a PhD student working on queuing for XDP.

It is also important to understand that this is an adaptive bulking
scheme, which comes from NAPI.  We don't wait for packets arriving
shortly, we pickup what NIC have available, but by only taking 8 or 16
packets (instead of emptying the entire RX-queue), and then spending
some time to send them along, I'm hoping that NIC could have gotten
some more frame.  For cpumap and veth (in-some-cases) they can start to
consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
update their internal queue state (and BQL), and if it gets close to
full they can choose to flush/doorbell the NIC earlier.  When doing
queuing for XDP we need to expose these NIC queue states, and having 4
calls with 16 packets (64 budget) also gives us more chances to get NIC
queue state info which the NIC already touch.

[1] 
https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [Intel-wired-lan] Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)

2020-12-10 Thread Jesper Dangaard Brouer

On Thu, 10 Dec 2020 15:14:18 +0100
Magnus Karlsson  wrote:

> On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer
>  wrote:
> >
> > On Wed, 9 Dec 2020 08:44:33 -0700
> > David Ahern  wrote:
> >  
> > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:  
> > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > > > (up-to 16) so it should not be a problem to solve this by sharing
> > > > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > > > for fallback case,  you allocated a number a TX-queue and distribute
> > > > this across CPUs to avoid hitting a congested lock (above measurements
> > > > are the optimal non-congested atomic lock operation)  
> > >
> > > I have been meaning to ask you why 16 for the XDP batching? If the
> > > netdev budget is 64, why not something higher like 32 or 64?  
> >
> > Thanks you for asking as there are multiple good reasons and
> > consideration for this 16 batch size.  Notice cpumap have batch size 8,
> > which is also an explicit choice.  And AF_XDP went in the wrong
> > direction IMHO and I think have 256.  I designed this to be a choice in
> > the map code, for the level of bulking it needs/wants.  
> 
> FYI, as far as I know, there is nothing in AF_XDP that says bulking
> should be 256. There is a 256 number in the i40e driver that states
> the maximum number of packets to be sent within one napi_poll loop.
> But this is just a maximum number and only for that driver. (In case
> you wonder, that number was inherited from the original skb Tx
> implementation in the driver.) 

Ah, that explains the issue I have on the production system that runs
the EDT-pacer[2].  I see that i40e function i40e_clean_tx_irq() ignores
napi_budget but uses it own budget, that defaults to 256.  Looks like I
can adjust this via ethtool -C tx-frames-irq.   I turned this down to
64 (32 was giving worse results, and below 16 system acted strange).

Now the issue is gone, which was that if TX-DMA completion was running
(i40e_clean_tx_irq()) on the same CPU that send packets via FQ-pacer
qdisc, then the pacing was not accurate, and was sending too bursty.

System have already tuned "net/core/dev_weight" and RX/TX-bias to
reduce bulking, as this can influence latency and the EDT-pacing
accuracy. (It is a middlebox bridging VLANs and BPF-EDT tiemstamping and
FQ-pacing packets to solve bursts overflowing switch ports).

  sudo sysctl net/core/dev_weight
  net.core.dev_weight = 1
  net.core.dev_weight_rx_bias = 32
  net.core.dev_weight_tx_bias = 1

This net.core.dev_weight_tx_bias=1 (together with dev_weight=1) cause
qdisc transmit budget to become one packet, cycling through
NET_TX_SOFTIRQ which consumes time and gives a little more pacing space
for the packets.


> The actual batch size is controlled by
> the application. If it puts 1 packet in the Tx ring and calls send(),
> the batch size will be 1. If it puts 128 packets in the Tx ring and
> calls send(), you get a batch size of 128, and so on. It is flexible,
> so you can trade-off latency with throughput in the way the
> application desires. Rx batch size has also become flexible now with
> the introduction of Björn's prefer_busy_poll patch set [1].
> 
> [1] 
> https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.to...@gmail.com/

This looks like a cool trick, to get even more accurate packet scheduling.

I played with the tunings, and could see changed behavior with mpstat,
but ended up tuning it off again, as I could not measure a direct
correlation with the bpftrace tools[3].


> > The low level explanation is that these 8 and 16 batch sizes are
> > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
> > with 10 elements).  I'm betting on that memory backing these 8 or 16
> > packets have higher chance to remain/being in cache, and I can prefetch
> > them without evicting them from cache again.  In some cases the pointer
> > to these packets are queued into a ptr_ring, and it is more optimal to
> > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.
> >
> > The general explanation is my goal to do bulking without adding latency.
> > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
> > Sure, you/we can likely make the micro-benchmarks look better by using
> > 64 batch size, but that will introduce added latency and likely shoot
> > our-selves in the foot for real workloads.  With experience from
> > bufferbloat and real networks, we know that massive TX bulking have bad
> > effects.  Still XDP-redirect does massive bulking (NIC flush is after
> > full 64 budget) and we don't

Re: [PATCH bpf-next V7 4/8] bpf: add BPF-helper for MTU checking

2020-12-14 Thread Jesper Dangaard Brouer

On Thu, 3 Dec 2020 10:25:41 -0800
s...@google.com wrote:

> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> > +  u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> > +{
> > +   int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +   struct net_device *dev = skb->dev;
> > +   int len;
> > +   int mtu;
> > +
> > +   if (flags & ~(BPF_MTU_CHK_SEGS))
> > +   return -EINVAL;
> > +
> > +   dev = __dev_via_ifindex(dev, ifindex);
> > +   if (!dev)
> > +   return -ENODEV;
> > +
> > +   mtu = READ_ONCE(dev->mtu);
> > +
> > +   /* TC len is L2, remove L2-header as dev MTU is L3 size */  
> 
> [..]
> > +   len = skb->len - ETH_HLEN;  
> Any reason not to do s/ETH_HLEN/dev->hard_header_len/ (or min_header_len?)
> thought this patch?

Will fix in V9.

There is a very small (performance) overhead, but mostly because
net_device struct layout have placed mtu and hard_header_len on
different cache-lines. (This is something that should be fixed
separately).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next V8 4/8] bpf: add BPF-helper for MTU checking

2020-12-14 Thread Jesper Dangaard Brouer

On Thu, 3 Dec 2020 00:23:14 +0100
Daniel Borkmann  wrote:

> On 11/27/20 7:06 PM, Jesper Dangaard Brouer wrote:
> [...]
> > +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
> > +   u32 ifindex)
> > +{
> > +   struct net *netns = dev_net(dev_curr);
> > +
> > +   /* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > +   if (ifindex == 0)
> > +   return dev_curr;
> > +
> > +   return dev_get_by_index_rcu(netns, ifindex);
> > +}
> > +
> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> > +  u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> > +{
> > +   int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +   struct net_device *dev = skb->dev;
> > +   int len;
> > +   int mtu;
> > +
> > +   if (flags & ~(BPF_MTU_CHK_SEGS))  
> 
> nit: unlikely() (similar for XDP case)

ok

> > +   return -EINVAL;
> > +
> > +   dev = __dev_via_ifindex(dev, ifindex);
> > +   if (!dev)  
> 
> nit: unlikely() (ditto XDP)

ok

> > +   return -ENODEV;
> > +
> > +   mtu = READ_ONCE(dev->mtu);
> > +
> > +   /* TC len is L2, remove L2-header as dev MTU is L3 size */
> > +   len = skb->len - ETH_HLEN;  
> 
> s/ETH_HLEN/dev->hard_header_len/ ?

ok
 
> > +   len += len_diff; /* len_diff can be negative, minus result pass check */
> > +   if (len <= mtu) {
> > +   ret = BPF_MTU_CHK_RET_SUCCESS;  
> 
> Wouldn't it be more intuitive to do ...
> 
> len_dev = READ_ONCE(dev->mtu) + dev->hard_header_len + VLAN_HLEN;
> len_skb = skb->len + len_diff;
> if (len_skb <= len_dev) {
>ret = BPF_MTU_CHK_RET_SUCCESS;
>got out;
> }

Yes, that is more intuitive to read.


> > +   goto out;
> > +   }
> > +   /* At this point, skb->len exceed MTU, but as it include length of all
> > +* segments, it can still be below MTU.  The SKB can possibly get
> > +* re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> > +* must choose if segs are to be MTU checked.  Last SKB "headlen" is
> > +* checked against MTU.
> > +*/
> > +   if (skb_is_gso(skb)) {
> > +   ret = BPF_MTU_CHK_RET_SUCCESS;
> > +
> > +   if (flags & BPF_MTU_CHK_SEGS &&
> > +   skb_gso_validate_network_len(skb, mtu)) {
> > +   ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> > +   goto out;  
> 
> Maybe my lack of coffe, but looking at ip_exceeds_mtu() for example, shouldn't
> the above test be on !skb_gso_validate_network_len() instead?

Yes, you are right!

> skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu) would indicate that
> it does /not/ exceed mtu.
> 
> > +   }
> > +
> > +   len = skb_headlen(skb) - ETH_HLEN + len_diff;  
> 
> How does this work with GRO when we invoke this helper at tc ingress, e.g. 
> when
> there is still non-linear data in skb_shinfo(skb)->frags[]?

In case of skb_is_gso() then this code will check the linear part
skb_headlen(skb) against the MTU.  I though this was an improvement
from what we have today, where skb_is_gso() packets will skip all
checks, which have caused a lot of confusion by end-users.

I will put this under the BPF_MTU_CHK_SEGS flag (in V9) as I understand
from you comment, you don't think this is correct at tc ingress.

> > +   if (len > mtu) {
> > +   ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +   goto out;
> > +   }
> > +   }
> > +out:
> > +   /* BPF verifier guarantees valid pointer */
> > +   *mtu_len = mtu;
> > +
> > +   return ret;
> > +}
[...]

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues

2020-12-15 Thread Jesper Dangaard Brouer

On Mon, 14 Dec 2020 17:29:06 -0800
Ivan Babrou  wrote:

> Without this change the driver tries to allocate too many queues,
> breaching the number of available msi-x interrupts on machines
> with many logical cpus and default adapter settings:
> 
> Insufficient resources for 12 XDP event queues (24 other channels, max 32)
> 
> Which in turn triggers EINVAL on XDP processing:
> 
> sfc :86:00.0 ext0: XDP TX failed (-22)

I have a similar QA report with XDP_REDIRECT:
  sfc :05:00.0 ens1f0np0: XDP redirect failed (-22)

Here we are back to the issue we discussed with ixgbe, that NIC / msi-x
interrupts hardware resources are not enough on machines with many
logical cpus.

After this fix, what will happen if (cpu >= efx->xdp_tx_queue_count) ?
(Copied efx_xdp_tx_buffers code below signature)

The question leads to, does this driver need a fallback mechanism when
HW resource or systems logical cpus exceed the one TX-queue per CPU
assumption?

> Signed-off-by: Ivan Babrou 
> ---
>  drivers/net/ethernet/sfc/efx_channels.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/efx_channels.c
> b/drivers/net/ethernet/sfc/efx_channels.c index
> a4a626e9cd9a..1bfeee283ea9 100644 ---
> a/drivers/net/ethernet/sfc/efx_channels.c +++
> b/drivers/net/ethernet/sfc/efx_channels.c @@ -17,6 +17,7 @@
>  #include "rx_common.h"
>  #include "nic.h"
>  #include "sriov.h"
> +#include "workarounds.h"
>  
>  /* This is the first interrupt mode to try out of:
>   * 0 => MSI-X
> @@ -137,6 +138,7 @@ static int efx_allocate_msix_channels(struct
> efx_nic *efx, {
>   unsigned int n_channels = parallelism;
>   int vec_count;
> + int tx_per_ev;
>   int n_xdp_tx;
>   int n_xdp_ev;
>  
> @@ -149,9 +151,9 @@ static int efx_allocate_msix_channels(struct
> efx_nic *efx,
>* multiple tx queues, assuming tx and ev queues are both
>* maximum size.
>*/
> -
> + tx_per_ev = EFX_MAX_EVQ_SIZE / EFX_TXQ_MAX_ENT(efx);
>   n_xdp_tx = num_possible_cpus();
> - n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, EFX_MAX_TXQ_PER_CHANNEL);
> + n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, tx_per_ev);
>  
>   vec_count = pci_msix_vec_count(efx->pci_dev);
>   if (vec_count < 0)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

/* Transmit a packet from an XDP buffer
 *
 * Returns number of packets sent on success, error code otherwise.
 * Runs in NAPI context, either in our poll (for XDP TX) or a different NIC
 * (for XDP redirect).
 */
int efx_xdp_tx_buffers(struct efx_nic *efx, int n, struct xdp_frame **xdpfs,
   bool flush)
{
struct efx_tx_buffer *tx_buffer;
struct efx_tx_queue *tx_queue;
struct xdp_frame *xdpf;
dma_addr_t dma_addr;
unsigned int len;
int space;
int cpu;
int i;

cpu = raw_smp_processor_id();

if (!efx->xdp_tx_queue_count ||
unlikely(cpu >= efx->xdp_tx_queue_count))
return -EINVAL;

tx_queue = efx->xdp_tx_queues[cpu];
if (unlikely(!tx_queue))
return -EINVAL;

if (unlikely(n && !xdpfs))
return -EINVAL;

if (!n)
return 0;

/* Check for available space. We should never need multiple
 * descriptors per frame.
 */
space = efx->txq_entries +
tx_queue->read_count - tx_queue->insert_count;

for (i = 0; i < n; i++) {
xdpf = xdpfs[i];

if (i >= space)
break;

/* We'll want a descriptor for this tx. */
prefetchw(__efx_tx_queue_get_insert_buffer(tx_queue));

len = xdpf->len;

/* Map for DMA. */
dma_addr = dma_map_single(&efx->pci_dev->dev,
  xdpf->data, len,
  DMA_TO_DEVICE);
if (dma_mapping_error(&efx->pci_dev->dev, dma_addr))
break;

/*  Create descriptor and set up for unmapping DMA. */
tx_buffer = efx_tx_map_chunk(tx_queue, dma_addr, len);
tx_buffer->xdpf = xdpf;
tx_buffer->flags = EFX_TX_BUF_XDP |
   EFX_TX_BUF_MAP_SINGLE;
tx_buffer->dma_offset = 0;
tx_buffer->unmap_len = len;
tx_queue->tx_packets++;
}

/* Pass mapped frames to hardware. */
if (flush && i > 0)
efx_nic_push_buffers(tx_queue);

if (i == 0)
return -EIO;

efx_xdp_return_frames(n - i, xdpfs + i);

return i;
}

Re: [PATCH v3 bpf-next 2/2] net: xdp: introduce xdp_prepare_buff utility routine

2020-12-16 Thread Jesper Dangaard Brouer

On Tue, 15 Dec 2020 16:13:44 +0100
Maciej Fijalkowski  wrote:

> On Tue, Dec 15, 2020 at 04:06:20PM +0100, Lorenzo Bianconi wrote:
> > > On 12/15/20 2:47 PM, Lorenzo Bianconi wrote:
> > > [...]  
> > > > > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > > > > > index 329397c60d84..61d3f5f8b7f3 100644
> > > > > > --- a/drivers/net/xen-netfront.c
> > > > > > +++ b/drivers/net/xen-netfront.c
> > > > > > @@ -866,10 +866,8 @@ static u32 xennet_run_xdp(struct 
> > > > > > netfront_queue *queue, struct page *pdata,
> > > > > > xdp_init_buff(xdp, XEN_PAGE_SIZE - XDP_PACKET_HEADROOM,
> > > > > >   &queue->xdp_rxq);
> > > > > > -   xdp->data_hard_start = page_address(pdata);
> > > > > > -   xdp->data = xdp->data_hard_start + XDP_PACKET_HEADROOM;
> > > > > > +   xdp_prepare_buff(xdp, page_address(pdata), XDP_PACKET_HEADROOM, 
> > > > > > len);
> > > > > > xdp_set_data_meta_invalid(xdp);
> > > > > > -   xdp->data_end = xdp->data + len;
> > > > > > act = bpf_prog_run_xdp(prog, xdp);
> > > > > > switch (act) {
> > > > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > > > index 3fb3a9aa1b71..66d8a4b317a3 100644
> > > > > > --- a/include/net/xdp.h
> > > > > > +++ b/include/net/xdp.h
> > > > > > @@ -83,6 +83,18 @@ xdp_init_buff(struct xdp_buff *xdp, u32 
> > > > > > frame_sz, struct xdp_rxq_info *rxq)
> > > > > > xdp->rxq = rxq;
> > > > > >   }
> > > > > > +static inline void  
> > > 
> > > nit: maybe __always_inline  
> > 
> > ack, I will add in v4
> >   
> > >   
> > > > > > +xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
> > > > > > +int headroom, int data_len)
> > > > > > +{
> > > > > > +   unsigned char *data = hard_start + headroom;
> > > > > > +
> > > > > > +   xdp->data_hard_start = hard_start;
> > > > > > +   xdp->data = data;
> > > > > > +   xdp->data_end = data + data_len;
> > > > > > +   xdp->data_meta = data;
> > > > > > +}
> > > > > > +
> > > > > >   /* Reserve memory area at end-of data area.
> > > > > >*  
> > > 
> > > For the drivers with xdp_set_data_meta_invalid(), we're basically setting 
> > > xdp->data_meta
> > > twice unless compiler is smart enough to optimize the first one away (did 
> > > you double check?).
> > > Given this is supposed to be a cleanup, why not integrate this logic as 
> > > well so the
> > > xdp_set_data_meta_invalid() doesn't get extra treatment?  
> 
> That's what I was trying to say previously.
> 
> > 
> > we discussed it before, but I am fine to add it in v4. Something like:
> > 
> > static __always_inline void
> > xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
> >  int headroom, int data_len, bool meta_valid)
> > {
> > unsigned char *data = hard_start + headroom;
> > 
> > xdp->data_hard_start = hard_start;
> > xdp->data = data;
> > xdp->data_end = data + data_len;
> > xdp->data_meta = meta_valid ? data : data + 1;  
> 
> This will introduce branch, so for intel drivers we're getting the
> overhead of one add and a branch. I'm still opting for a separate helper.

I should think, as this gets inlined the compiler should be able to
remove the branch.  I assume that the usage of 'meta_valid' will be a
const in the drivers.  Maybe we should have the API be 'const bool meta_valid'?


> static __always_inline void
> xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,
>int headroom, int data_len)
> {
>   unsigned char *data = hard_start + headroom;
> 
>   xdp->data_hard_start = hard_start;
>   xdp->data = data;
>   xdp->data_end = data + data_len;
>   xdp_set_data_meta_invalid(xdp);
> }
> 
> static __always_inline void
> xdp_prepare_buff_meta(struct xdp_buff *xdp, unsigned char *hard_start,
> int headroom, int data_len)
> {
>   unsigned char *data = hard_start + headroom;
> 
>   xdp->data_hard_start = hard_start;
>   xdp->data = data;
>   xdp->data_end = data + data_len;
>   xdp->data_meta = data;
> }

Thanks to you Maciej for reviewing this! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v3 bpf-next 1/2] net: xdp: introduce xdp_init_buff utility routine

2020-12-16 Thread Jesper Dangaard Brouer

On Sat, 12 Dec 2020 18:41:48 +0100
Lorenzo Bianconi  wrote:

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
> index fcc262064766..b7942c3440c0 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
> @@ -133,12 +133,11 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct 
> bnxt_rx_ring_info *rxr, u16 cons,
>   dma_sync_single_for_cpu(&pdev->dev, mapping + offset, *len, bp->rx_dir);
>  
>   txr = rxr->bnapi->tx_ring;
> + xdp_init_buff(&xdp, PAGE_SIZE, &rxr->xdp_rxq);
>   xdp.data_hard_start = *data_ptr - offset;
>   xdp.data = *data_ptr;
>   xdp_set_data_meta_invalid(&xdp);
>   xdp.data_end = *data_ptr + *len;
> - xdp.rxq = &rxr->xdp_rxq;
> - xdp.frame_sz = PAGE_SIZE; /* BNXT_RX_PAGE_MODE(bp) when XDP enabled */
>   orig_data = xdp.data;

I don't like loosing the comment here.  Other developers reading this
code might assume that size is always PAGE_SIZE, which is only the case
when XDP is enabled.  Lets save them from making this mistake.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues

2020-12-16 Thread Jesper Dangaard Brouer

On Tue, 15 Dec 2020 18:49:55 +
Edward Cree  wrote:

> On 15/12/2020 09:43, Jesper Dangaard Brouer wrote:
> > On Mon, 14 Dec 2020 17:29:06 -0800
> > Ivan Babrou  wrote:
> >   
> >> Without this change the driver tries to allocate too many queues,
> >> breaching the number of available msi-x interrupts on machines
> >> with many logical cpus and default adapter settings:
> >>
> >> Insufficient resources for 12 XDP event queues (24 other channels, max 32)
> >>
> >> Which in turn triggers EINVAL on XDP processing:
> >>
> >> sfc :86:00.0 ext0: XDP TX failed (-22)  
> > 
> > I have a similar QA report with XDP_REDIRECT:
> >   sfc :05:00.0 ens1f0np0: XDP redirect failed (-22)
> > 
> > Here we are back to the issue we discussed with ixgbe, that NIC / msi-x
> > interrupts hardware resources are not enough on machines with many
> > logical cpus.
> > 
> > After this fix, what will happen if (cpu >= efx->xdp_tx_queue_count) ?  
>
> Same as happened before: the "failed -22".  But this fix will make that
>  less likely to happen, because it ties more TXQs to each EVQ, and it's
>  the EVQs that are in short supply.
>

So, what I hear is that this fix is just pampering over the real issue.

I suggest that you/we detect the situation, and have a code path that
will take a lock (per 16 packets bulk) and solve the issue.

If you care about maximum performance you can implement this via
changing the ndo_xdp_xmit pointer to the fallback function when needed,
to avoid having a to check for the fallback mode in the fast-path.

>
> (Strictly speaking, I believe the limitation is a software one, that
>  comes from the driver's channel structures having been designed a
>  decade ago when 32 cpus ought to be enough for anybody... AFAIR the
>  hardware is capable of giving us something like 1024 evqs if we ask
>  for them, it just might not have that many msi-x vectors for us.)
> Anyway, the patch looks correct, so
> Acked-by: Edward Cree 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v3 bpf-next 2/2] net: xdp: introduce xdp_prepare_buff utility routine

2020-12-16 Thread Jesper Dangaard Brouer

On Tue, 15 Dec 2020 14:47:10 +0100
Lorenzo Bianconi  wrote:

> [...]
> > >   xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
> > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
> > > b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > index 4dbbbd49c389..fcd1ca3343fb 100644
> > > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > @@ -2393,12 +2393,12 @@ static int i40e_clean_rx_irq(struct i40e_ring 
> > > *rx_ring, int budget)
> > >  
> > >   /* retrieve a buffer from the ring */
> > >   if (!skb) {
> > > - xdp.data = page_address(rx_buffer->page) +
> > > -rx_buffer->page_offset;
> > > - xdp.data_meta = xdp.data;
> > > - xdp.data_hard_start = xdp.data -
> > > -   i40e_rx_offset(rx_ring);
> > > - xdp.data_end = xdp.data + size;
> > > + unsigned int offset = i40e_rx_offset(rx_ring);  
> > 
> > I now see that we could call the i40e_rx_offset() once per napi, so can
> > you pull this variable out and have it initialized a single time? Applies
> > to other intel drivers as well.  
>
> ack, fine. I will fix in v4.

Be careful with the Intel drivers.  They have two modes (at compile
time) depending on PAGE_SIZE in system.  In one of the modes (default
one) you can place init of xdp.frame_sz outside the NAPI loop and init a
single time.  In the other mode you cannot, and it becomes dynamic per
packet.  Intel review this carefully, please!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next V8 5/8] bpf: drop MTU check when doing TC-BPF redirect to ingress

2020-12-17 Thread Jesper Dangaard Brouer

On Thu, 3 Dec 2020 00:43:36 +0100
Daniel Borkmann  wrote:

> On 11/27/20 7:06 PM, Jesper Dangaard Brouer wrote:
> > The use-case for dropping the MTU check when TC-BPF does redirect to
> > ingress, is described by Eyal Birger in email[0]. The summary is the
> > ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> > ingress redirect packet and let normal netstack fragment packet as needed.
> > 
> > [0] 
> > https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/
> > 
> > V4:
> >   - Keep net_device "up" (IFF_UP) check.
> >   - Adjustment to handle bpf_redirect_peer() helper
> > 
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >   include/linux/netdevice.h |   31 +--
> >   net/core/dev.c|   19 ++-
> >   net/core/filter.c |   14 +++---
> >   3 files changed, 42 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 7ce648a564f7..4a854e09e918 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3917,11 +3917,38 @@ int dev_forward_skb(struct net_device *dev, struct 
> > sk_buff *skb);
> >   bool is_skb_forwardable(const struct net_device *dev,
> > const struct sk_buff *skb);
> >   
> > +static __always_inline bool __is_skb_forwardable(const struct net_device 
> > *dev,
> > +const struct sk_buff *skb,
> > +const bool check_mtu)
> > +{
> > +   const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
> > +   unsigned int len;
> > +
> > +   if (!(dev->flags & IFF_UP))
> > +   return false;
> > +
> > +   if (!check_mtu)
> > +   return true;
> > +
> > +   len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
> > +   if (skb->len <= len)
> > +   return true;
> > +
> > +   /* if TSO is enabled, we don't care about the length as the packet
> > +* could be forwarded without being segmented before
> > +*/
> > +   if (skb_is_gso(skb))
> > +   return true;
> > +
> > +   return false;
> > +}
> > +
> >   static __always_inline int dev_forward_skb(struct net_device *dev,
> > -  struct sk_buff *skb)
> > +  struct sk_buff *skb,
> > +  const bool check_mtu)
> >   {
> > if (skb_orphan_frags(skb, GFP_ATOMIC) ||
> > -   unlikely(!is_skb_forwardable(dev, skb))) {
> > +   unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
> > atomic_long_inc(&dev->rx_dropped);
> > kfree_skb(skb);
> > return NET_RX_DROP;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 60d325bda0d7..6ceb6412ee97 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -2189,28 +2189,13 @@ static inline void net_timestamp_set(struct sk_buff 
> > *skb)
> >   
> >   bool is_skb_forwardable(const struct net_device *dev, const struct 
> > sk_buff *skb)
> >   {
> > -   unsigned int len;
> > -
> > -   if (!(dev->flags & IFF_UP))
> > -   return false;
> > -
> > -   len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
> > -   if (skb->len <= len)
> > -   return true;
> > -
> > -   /* if TSO is enabled, we don't care about the length as the packet
> > -* could be forwarded without being segmented before
> > -*/
> > -   if (skb_is_gso(skb))
> > -   return true;
> > -
> > -   return false;
> > +   return __is_skb_forwardable(dev, skb, true);
> >   }
> >   EXPORT_SYMBOL_GPL(is_skb_forwardable);  
> 
> Only user of is_skb_forwardable() that is left after this patch is bridge, 
> maybe
> the whole thing should be moved into the header?

Well, yes, maybe... I just felt it belongs in another patchset.


> >   int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> >   {
> > -   int ret = dev_forward_skb(dev, skb);
> > +   int ret = dev_forward_skb(dev, skb, true);
> >   
> > if (likely(!ret)) {
> > skb->protocol = eth_type_trans(skb, dev);
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index d6125cfc49c3..4673afe59533 100644
> > --- a/net/core/filter

Re: [PATCH bpf-next V8 5/8] bpf: drop MTU check when doing TC-BPF redirect to ingress

2020-12-17 Thread Jesper Dangaard Brouer

On Thu, 17 Dec 2020 15:46:55 +0100
Jesper Dangaard Brouer  wrote:

> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index d6125cfc49c3..4673afe59533 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto 
> > > bpf_csum_level_proto = {
> > >   
> > >   static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff 
> > > *skb)
> > >   {
> > > - return dev_forward_skb(dev, skb);
> > > + int ret = dev_forward_skb(dev, skb, false);
> > > +
> > > + if (likely(!ret)) {
> > > + skb->protocol = eth_type_trans(skb, dev);
> > > + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > > + ret = netif_rx(skb);
> > 
> > Why netif_rx() and not netif_rx_internal() as in dev_forward_skb() 
> > originally?
> > One extra call otherwise.  
> 
> This is because the function below calls netif_rx(), which is just
> outside patch-diff-window.  Thus, it looked wrong/strange to call
> netif_rx_internal(), but sure I can use netif_rx_internal() instead.

Well, when building I found that we obviously cannot call
netif_rx_internal() as this is filter.c, else we get a build error:

net/core/filter.c:2091:9: error: implicit declaration of function 
‘netif_rx_internal’ [-Werror=implicit-function-declaration]
 2091 |   ret = netif_rx_internal(skb);
  | ^

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH bpf-next V9 0/7] bpf: New approach for BPF MTU handling

2020-12-17 Thread Jesper Dangaard Brouer

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest
V8: Add better selftests for BPF-helper bpf_check_mtu
V9: Remove patch that use skb_set_redirected

---

Jesper Dangaard Brouer (7):
  bpf: Remove MTU check in __bpf_skb_max_len
  bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
  bpf: bpf_fib_lookup return MTU value as output when looked up
  bpf: add BPF-helper for MTU checking
  bpf: drop MTU check when doing TC-BPF redirect to ingress
  selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
  bpf/selftests: tests using bpf_check_mtu BPF-helper


 include/linux/netdevice.h  |   31 +++
 include/uapi/linux/bpf.h   |   78 +++-
 net/core/dev.c |   19 --
 net/core/filter.c  |  184 --
 tools/include/uapi/linux/bpf.h |   78 +++-
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  204 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  196 +++
 .../selftests/bpf/progs/test_cls_redirect.c|7 +
 8 files changed, 754 insertions(+), 43 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--

[PATCH bpf-next V9 1/7] bpf: Remove MTU check in __bpf_skb_max_len

2020-12-17 Thread Jesper Dangaard Brouer

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 255aeee72402..f8f198252ff2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
off, u32 len_diff,
return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-   return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, 
s32, len_diff,
 {
u32 len_cur, len_diff_abs = abs(len_diff);
u32 len_min = bpf_skb_net_base_len(skb);
-   u32 len_max = __bpf_skb_max_len(skb);
+   u32 len_max = BPF_SKB_MAX_LEN;
__be16 proto = skb->protocol;
bool shrink = len_diff < 0;
u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, 
unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 min_len = __bpf_skb_min_len(skb);
int ret;
 
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto 
sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 new_len = skb->len + head_room;
int ret;

[PATCH bpf-next V9 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up

2020-12-17 Thread Jesper Dangaard Brouer

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   11 +--
 net/core/filter.c  |   22 +++---
 tools/include/uapi/linux/bpf.h |   11 +--
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77d7c1bb2923..649586d656b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2225,6 +2225,9 @@ union bpf_attr {
  * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *   packet is not forwarded or needs assist from full stack
  *
+ * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ * was exceeded and output params->mtu_result contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, 
void *key, u64 flags)
  * Description
  * Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4975,9 +4978,13 @@ struct bpf_fib_lookup {
__be16  sport;
__be16  dport;
 
-   /* total length of packet from network header - used for MTU check */
-   __u16   tot_len;
+   union { /* used for MTU check */
+   /* input to lookup */
+   __u16   tot_len; /* L3 length from network hdr (iph->tot_len) */
 
+   /* output: MTU value */
+   __u16   mtu_result;
+   };
/* input: L3 device index for lookup
 * output: device index from FIB lookup
 */
diff --git a/net/core/filter.c b/net/core/filter.c
index 95c6fdebd0db..722910cbfe0b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5268,12 +5268,14 @@ static const struct bpf_func_proto 
bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
  const struct neighbour *neigh,
- const struct net_device *dev)
+ const struct net_device *dev, u32 mtu)
 {
memcpy(params->dmac, neigh->ha, ETH_ALEN);
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+   if (mtu)
+   params->mtu_result = mtu; /* union with tot_len */
 
return 0;
 }
@@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct net_device *dev;
struct fib_result res;
struct flowi4 fl4;
+   u32 mtu = 0;
int err;
-   u32 mtu;
 
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
@@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+   }
}
 
nhc = res.nhc;
@@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-   return bpf_fib_set_fwd_params(params, neigh, dev);
+   return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct flowi6 fl6;
int strict = 0;
int oif, err;
-   u32 mtu;
+   u32 mtu = 0;
 
/* link local addresses are never forwarded */
if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+

[PATCH bpf-next V9 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx

2020-12-17 Thread Jesper Dangaard Brouer

BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the
expected encap size. (Just like the bpf_check_mtu helper supports). He
discovered that for SKB ctx the param->tot_len was not used, instead
skb->len was used (via MTU check in is_skb_forwardable()).

Fix this by using fib_params 'tot_len' for MTU check.  If not provided
(e.g. zero) then keep existing behaviour intact.

Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: Carlo Carraro 
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index f8f198252ff2..95c6fdebd0db 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5568,11 +5568,21 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 #endif
}
 
-   if (!rc) {
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS) {
struct net_device *dev;
+   u32 mtu;
 
dev = dev_get_by_index_rcu(net, params->ifindex);
-   if (!is_skb_forwardable(dev, skb))
+   mtu = READ_ONCE(dev->mtu);
+
+   /* Using tot_len for (L3) MTU check if provided by user */
+   if (params->tot_len && params->tot_len > mtu)
+   rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+   /* Notice at this TC cls_bpf level skb->len contains L2 size,
+* but is_skb_forwardable takes that into account
+*/
+   if (params->tot_len == 0 && !is_skb_forwardable(dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
}

[PATCH bpf-next V9 4/7] bpf: add BPF-helper for MTU checking

2020-12-17 Thread Jesper Dangaard Brouer

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   67 ++
 net/core/filter.c  |  122 
 tools/include/uapi/linux/bpf.h |   67 ++
 3 files changed, 256 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@ union bpf_attr {
  * Return
  * A pointer to a struct socket on success or NULL if the file is
  * not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 
flags)
+ * Description
+ * Check ctx packet size against MTU of net device (based on
+ * *ifindex*).  This helper will likely be used in combination with
+ * helpers that adjust/change the packet size.  The argument
+ * *len_diff* can be used for querying with a planned size
+ * change. This allows to check MTU prior to changing packet ctx.
+ *
+ * Specifying *ifindex* zero means the MTU check is performed
+ * against the current net device.  This is practical if this isn't
+ * used prior to redirect.
+ *
+ * The Linux kernel route table can configure MTUs on a more
+ * specific per route level, which is not provided by this helper.
+ * For route level MTU checks use the **bpf_fib_lookup**\ ()
+ * helper.
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** for tc cls_act programs.
+ *
+ * The *flags* argument can be a combination of one or more of the
+ * following values:
+ *
+ * **BPF_MTU_CHK_SEGS**
+ * This flag will only works for *ctx* **struct sk_buff**.
+ * If packet context contains extra packet segment buffers
+ * (often knows as GSO skb), then MTU check is harder to
+ * check at this point, because in transmit path it is
+ * possible for the skb packet to get re-segmented
+ * (depending on net device features).  This could still be
+ * a MTU violation, so this flag enables performing MTU
+ * check against segments, with a different violation
+ * return code to tell it apart. Check cannot use len_diff.
+ *
+ * On return *mtu_len* pointer contains the MTU value of the net
+ * device.  Remember the net device configured MTU is the L3 size,
+ * which is returned here and XDP and TX length operate at L2.
+ * Helper take this into account for you, but remember when using
+ * MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ * pointer and be initialized (to zero), else verifier will reject
+ * BPF program.
+ *
+ * Return
+ * * 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ * * < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ * MTU v

[PATCH bpf-next V9 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress

2020-12-17 Thread Jesper Dangaard Brouer

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] 
https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/netdevice.h |   31 +--
 net/core/dev.c|   19 ++-
 net/core/filter.c |   14 +++---
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7bf167993c05..4c78f73db1ec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3947,11 +3947,38 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+const struct sk_buff *skb,
+const bool check_mtu)
+{
+   const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+   unsigned int len;
+
+   if (!(dev->flags & IFF_UP))
+   return false;
+
+   if (!check_mtu)
+   return true;
+
+   len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+   if (skb->len <= len)
+   return true;
+
+   /* if TSO is enabled, we don't care about the length as the packet
+* could be forwarded without being segmented before
+*/
+   if (skb_is_gso(skb))
+   return true;
+
+   return false;
+}
+
 static __always_inline int dev_forward_skb(struct net_device *dev,
-  struct sk_buff *skb)
+  struct sk_buff *skb,
+  const bool check_mtu)
 {
if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-   unlikely(!is_skb_forwardable(dev, skb))) {
+   unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
atomic_long_inc(&dev->rx_dropped);
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index a46334906c94..ffead13c158f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2176,28 +2176,13 @@ static inline void net_timestamp_set(struct sk_buff 
*skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
 {
-   unsigned int len;
-
-   if (!(dev->flags & IFF_UP))
-   return false;
-
-   len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-   if (skb->len <= len)
-   return true;
-
-   /* if TSO is enabled, we don't care about the length as the packet
-* could be forwarded without being segmented before
-*/
-   if (skb_is_gso(skb))
-   return true;
-
-   return false;
+   return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, true);
 
if (likely(!ret)) {
skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index 6e113de2baae..b682b11fbbec 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto 
= {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-   return dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, false);
+
+   if (likely(!ret)) {
+   skb->protocol = eth_type_trans(skb, dev);
+   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+   ret = netif_rx(skb);
+   }
+
+   return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
  struct sk_buff *skb)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, false);
 
if (likely(!ret)) {
skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
goto out_drop;
dev = ops->ndo_get_peer_dev(dev);
if (unlikely(!dev ||
-!is_skb_forwardable(dev, skb) ||
+!(dev->flags & IFF_UP) ||

[PATCH bpf-next V9 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect

2020-12-17 Thread Jesper Dangaard Brouer

This demonstrate how bpf_check_mtu() helper can easily be used together
with bpf_skb_adjust_room() helper, prior to doing size adjustment, as
delta argument is already setup.

Hint: This specific test can be selected like this:
 ./test_progs -t cls_redirect

Signed-off-by: Jesper Dangaard Brouer 
---
 .../selftests/bpf/progs/test_cls_redirect.c|7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c 
b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
index c9f8464996ea..3c1e042962e6 100644
--- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c
+++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
@@ -70,6 +70,7 @@ typedef struct {
uint64_t errors_total_encap_adjust_failed;
uint64_t errors_total_encap_buffer_too_small;
uint64_t errors_total_redirect_loop;
+   uint64_t errors_total_encap_mtu_violate;
 } metrics_t;
 
 typedef enum {
@@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff 
*skb, encap_headers_t *e
payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr);
int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead;
uint16_t proto = ETH_P_IP;
+   uint32_t mtu_len = 0;
 
/* Loop protection: the inner packet's TTL is decremented as a safeguard
 * against any forwarding loop. As the only interesting field is the TTL
@@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff 
*skb, encap_headers_t *e
}
}
 
+   if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) {
+   metrics->errors_total_encap_mtu_violate++;
+   return TC_ACT_SHOT;
+   }
+
if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET,
BPF_F_ADJ_ROOM_FIXED_GSO |
BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||

[PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper

2020-12-17 Thread Jesper Dangaard Brouer

Adding selftest for BPF-helper bpf_check_mtu(). Making sure
it can be used from both XDP and TC.

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  204 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  196 +++
 2 files changed, 400 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c 
b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
new file mode 100644
index ..b5d0c3a9abe8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Jesper Dangaard Brouer */
+
+#include  /* before test_progs.h, avoid bpf_util.h redefines 
*/
+
+#include 
+#include "test_check_mtu.skel.h"
+#include 
+
+#include 
+#include 
+
+#define IFINDEX_LO 1
+
+static __u32 duration; /* Hint: needed for CHECK macro */
+
+static int read_mtu_device_lo(void)
+{
+   const char *filename = "/sys/devices/virtual/net/lo/mtu";
+   char buf[11] = {};
+   int value;
+   int fd;
+
+   fd = open(filename, 0, O_RDONLY);
+   if (fd == -1)
+   return -1;
+
+   if (read(fd, buf, sizeof(buf)) == -1)
+   return -2;
+   close(fd);
+
+   value = strtoimax(buf, NULL, 10);
+   if (errno == ERANGE)
+   return -3;
+
+   return value;
+}
+
+static void test_check_mtu_xdp_attach(struct bpf_program *prog)
+{
+   int err = 0;
+   int fd;
+
+   fd = bpf_program__fd(prog);
+   err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);
+   if (CHECK(err, "XDP-attach", "failed"))
+   return;
+
+   bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0);
+}
+
+static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
+  struct bpf_program *prog,
+  __u32 mtu_expect)
+{
+   const char *prog_name = bpf_program__name(prog);
+   int retval_expect = XDP_PASS;
+   __u32 mtu_result = 0;
+   char buf[256];
+   int err;
+
+   struct bpf_prog_test_run_attr tattr = {
+   .repeat = 1,
+   .data_in = &pkt_v4,
+   .data_size_in = sizeof(pkt_v4),
+   .data_out = buf,
+   .data_size_out = sizeof(buf),
+   .prog_fd = bpf_program__fd(prog),
+   };
+
+   memset(buf, 0, sizeof(buf));
+
+   err = bpf_prog_test_run_xattr(&tattr);
+   CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run",
+  "prog_name:%s (err %d errno %d retval %d)\n",
+  prog_name, err, errno, tattr.retval);
+
+CHECK(tattr.retval != retval_expect, "retval",
+ "progname:%s unexpected retval=%d expected=%d\n",
+ prog_name, tattr.retval, retval_expect);
+
+   /* Extract MTU that BPF-prog got */
+   mtu_result = skel->bss->global_bpf_mtu_xdp;
+   CHECK(mtu_result != mtu_expect, "MTU-compare-user",
+ "failed (MTU user:%d bpf:%d)", mtu_expect, mtu_result);
+}
+
+static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex)
+{
+   struct test_check_mtu *skel;
+   int err;
+
+   skel = test_check_mtu__open();
+   if (CHECK(!skel, "skel_open", "failed"))
+   return;
+
+   /* Update "constants" in BPF-prog *BEFORE* libbpf load */
+   skel->rodata->GLOBAL_USER_MTU = mtu;
+   skel->rodata->GLOBAL_USER_IFINDEX = ifindex;
+
+   err = test_check_mtu__load(skel);
+   if (CHECK(err, "skel_load", "failed: %d\n", err))
+   goto cleanup;
+
+   test_check_mtu_run_xdp(skel, skel->progs.xdp_use_helper, mtu);
+   test_check_mtu_run_xdp(skel, skel->progs.xdp_exceed_mtu, mtu);
+   test_check_mtu_run_xdp(skel, skel->progs.xdp_minus_delta, mtu);
+
+cleanup:
+   test_check_mtu__destroy(skel);
+}
+
+static void test_check_mtu_run_tc(struct test_check_mtu *skel,
+ struct bpf_program *prog,
+ __u32 mtu_expect)
+{
+   const char *prog_name = bpf_program__name(prog);
+   int retval_expect = BPF_OK;
+   __u32 mtu_result = 0;
+   char buf[256];
+   int err;
+
+   struct bpf_prog_test_run_attr tattr = {
+   .repeat = 1,
+   .data_in = &pkt_v4,
+   .data_size_in = sizeof(pkt_v4),
+   .data_out = buf,
+   .data_size_out = sizeof(buf),
+   .prog_fd = bpf_program__fd(prog),
+   };
+
+   memset(buf, 0, sizeof(buf));
+
+   err = bpf_prog_test_run_xattr(&tattr);
+   CHECK_ATTR(err != 0 || errno != 0, "

Re: [PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper

2020-12-18 Thread Jesper Dangaard Brouer

On Thu, 17 Dec 2020 18:26:50 +0100
Jesper Dangaard Brouer  wrote:

> Adding selftest for BPF-helper bpf_check_mtu(). Making sure
> it can be used from both XDP and TC.
> 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  tools/testing/selftests/bpf/prog_tests/check_mtu.c |  204 
> 
>  tools/testing/selftests/bpf/progs/test_check_mtu.c |  196 +++
>  2 files changed, 400 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

Will send V10 as I have an error in this selftests

> diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c 
> b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> new file mode 100644
> index ..b5d0c3a9abe8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
[...]
> +static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
> +struct bpf_program *prog,
> +__u32 mtu_expect)
> +{
> + const char *prog_name = bpf_program__name(prog);
> + int retval_expect = XDP_PASS;
> + __u32 mtu_result = 0;
> + char buf[256];
> + int err;
> +
> + struct bpf_prog_test_run_attr tattr = {
> + .repeat = 1,
> + .data_in = &pkt_v4,
> + .data_size_in = sizeof(pkt_v4),
> + .data_out = buf,
> + .data_size_out = sizeof(buf),
> + .prog_fd = bpf_program__fd(prog),
> + };
> +
> + memset(buf, 0, sizeof(buf));
> +
> + err = bpf_prog_test_run_xattr(&tattr);
> + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run",
   ^^^
You/I cannot use the check "errno != 0" here, as something else could
have set it earlier.


> +"prog_name:%s (err %d errno %d retval %d)\n",
> +prog_name, err, errno, tattr.retval);
> +
> +CHECK(tattr.retval != retval_expect, "retval",
> +   "progname:%s unexpected retval=%d expected=%d\n",
> +   prog_name, tattr.retval, retval_expect);
> +
> + /* Extract MTU that BPF-prog got */
> + mtu_result = skel->bss->global_bpf_mtu_xdp;
> + CHECK(mtu_result != mtu_expect, "MTU-compare-user",
> +   "failed (MTU user:%d bpf:%d)", mtu_expect, mtu_result);
> +}


> +static void test_check_mtu_run_tc(struct test_check_mtu *skel,
> +   struct bpf_program *prog,
> +   __u32 mtu_expect)
> +{
[...]
> + err = bpf_prog_test_run_xattr(&tattr);
> + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run",
> +"prog_name:%s (err %d errno %d retval %d)\n",
> +prog_name, err, errno, tattr.retval);

Same issue here.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v4 bpf-next 1/2] net: xdp: introduce xdp_init_buff utility routine

2020-12-21 Thread Jesper Dangaard Brouer

On Sat, 19 Dec 2020 18:55:00 +0100
Lorenzo Bianconi  wrote:

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 11ec93f827c0..323340caef88 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -76,6 +76,13 @@ struct xdp_buff {
>   u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
>  };
>  
> +static __always_inline void
> +xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq)
> +{
> + xdp->frame_sz = frame_sz;
> + xdp->rxq = rxq;

Later you will add 'xdp->mb = 0' here.

> +}

Via the names of your functions, I assume that xdp_init_buff() is
called before xdp_prepare_buff(), right?
(And your pending 'xdp->mb = 0' also prefer this.)

Below in bpf_prog_test_run_xdp() and netif_receive_generic_xdp() you
violate this order... which will give you headaches when implementing
the multi-buff support.  It is also a bad example for driver developer
that need to figure out this calling-order from the function names.

Below, will it be possible to have 'init' before 'prepare'?


> +
>  /* Reserve memory area at end-of data area.
>   *
>   * This macro reserves tailroom in the XDP buffer by limiting the
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index c1c30a9f76f3..a8fa5a9e4137 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -640,10 +640,10 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const 
> union bpf_attr *kattr,
>   xdp.data = data + headroom;
>   xdp.data_meta = xdp.data;
>   xdp.data_end = xdp.data + size;
> - xdp.frame_sz = headroom + max_data_sz + tailroom;
>  
>   rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, 
> 0);
> - xdp.rxq = &rxqueue->xdp_rxq;
> + xdp_init_buff(&xdp, headroom + max_data_sz + tailroom,
> +   &rxqueue->xdp_rxq);
>   bpf_prog_change_xdp(NULL, prog);
>   ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
>   if (ret)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a46334906c94..b1a765900c01 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4588,11 +4588,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
>   struct netdev_rx_queue *rxqueue;
>   void *orig_data, *orig_data_end;
>   u32 metalen, act = XDP_DROP;
> + u32 mac_len, frame_sz;
>   __be16 orig_eth_type;
>   struct ethhdr *eth;
>   bool orig_bcast;
>   int hlen, off;
> - u32 mac_len;
>  
>   /* Reinjected packets coming from act_mirred or similar should
>* not get XDP generic processing.
> @@ -4631,8 +4631,8 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
>   xdp->data_hard_start = skb->data - skb_headroom(skb);
>  
>   /* SKB "head" area always have tailroom for skb_shared_info */
> - xdp->frame_sz  = (void *)skb_end_pointer(skb) - xdp->data_hard_start;
> - xdp->frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + frame_sz = (void *)skb_end_pointer(skb) - xdp->data_hard_start;
> + frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
>   orig_data_end = xdp->data_end;
>   orig_data = xdp->data;
> @@ -4641,7 +4641,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
>   orig_eth_type = eth->h_proto;
>  
>   rxqueue = netif_get_rxqueue(skb);
> - xdp->rxq = &rxqueue->xdp_rxq;
> + xdp_init_buff(xdp, frame_sz, &rxqueue->xdp_rxq);
>  
>   act = bpf_prog_run_xdp(xdp_prog, xdp);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure

2020-12-21 Thread Jesper Dangaard Brouer

On Sat, 19 Dec 2020 10:30:57 -0500
Jamal Hadi Salim  wrote:

> On 2020-12-19 9:53 a.m., Shay Agroskin wrote:
> > 
> > Lorenzo Bianconi  writes:
> >   
> 
> >> for the moment I do not know if this area is used for other purposes.
> >> Do you think there are other use-cases for it?  

Yes, all the same use-cases as SKB have.  I wanted to keep this the
same as skb_shared_info, but Lorenzo choose to take John's advice and
it going in this direction (which is fine, we can always change and
adjust this later).

> Sorry to interject:
> Does it make sense to use it to store arbitrary metadata or a scratchpad
> in this space? Something equivalent to skb->cb which is lacking in
> XDP.

Well, XDP have the data_meta area.  But difficult to rely on because a
lot of driver don't implement it.  And Saeed and I plan to use this
area and populate it with driver info from RX-descriptor.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper

2020-12-22 Thread Jesper Dangaard Brouer

On Fri, 18 Dec 2020 12:13:45 -0800
Andrii Nakryiko  wrote:

> On Thu, Dec 17, 2020 at 9:30 AM Jesper Dangaard Brouer
>  wrote:
> >
> > Adding selftest for BPF-helper bpf_check_mtu(). Making sure
> > it can be used from both XDP and TC.
> >
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >  tools/testing/selftests/bpf/prog_tests/check_mtu.c |  204 
> > 
> >  tools/testing/selftests/bpf/progs/test_check_mtu.c |  196 
> > +++
> >  2 files changed, 400 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c 
> > b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> > new file mode 100644
> > index ..b5d0c3a9abe8
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
> > @@ -0,0 +1,204 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2020 Jesper Dangaard Brouer */
> > +
> > +#include  /* before test_progs.h, avoid bpf_util.h 
> > redefines */
> > +
> > +#include 
> > +#include "test_check_mtu.skel.h"
> > +#include 
> > +
> > +#include 
> > +#include 
> > +
> > +#define IFINDEX_LO 1
> > +
> > +static __u32 duration; /* Hint: needed for CHECK macro */
> > +
> > +static int read_mtu_device_lo(void)
> > +{
> > +   const char *filename = "/sys/devices/virtual/net/lo/mtu";

I will change this to: /sys/class/net/lo/mtu

> > +   char buf[11] = {};
> > +   int value;
> > +   int fd;
> > +
> > +   fd = open(filename, 0, O_RDONLY);
> > +   if (fd == -1)
> > +   return -1;
> > +
> > +   if (read(fd, buf, sizeof(buf)) == -1)  
> 
> close fd missing here?

ack, fixed.

> > +   return -2;
> > +   close(fd);
> > +
> > +   value = strtoimax(buf, NULL, 10);
> > +   if (errno == ERANGE)
> > +   return -3;
> > +
> > +   return value;
> > +}
> > +
> > +static void test_check_mtu_xdp_attach(struct bpf_program *prog)
> > +{
> > +   int err = 0;
> > +   int fd;
> > +
> > +   fd = bpf_program__fd(prog);
> > +   err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE);
> > +   if (CHECK(err, "XDP-attach", "failed"))
> > +   return;
> > +
> > +   bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0);  
> 
> can you please use bpf_link-based bpf_program__attach_xdp() which will
> provide auto-cleanup in case of crash?

Sure, that will be good for me to learn.

> also check that it succeeded?
> 
> > +}
> > +
> > +static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
> > +  struct bpf_program *prog,
> > +  __u32 mtu_expect)
> > +{
> > +   const char *prog_name = bpf_program__name(prog);
> > +   int retval_expect = XDP_PASS;
> > +   __u32 mtu_result = 0;
> > +   char buf[256];
> > +   int err;
> > +
> > +   struct bpf_prog_test_run_attr tattr = {
> > +   .repeat = 1,
> > +   .data_in = &pkt_v4,
> > +   .data_size_in = sizeof(pkt_v4),
> > +   .data_out = buf,
> > +   .data_size_out = sizeof(buf),
> > +   .prog_fd = bpf_program__fd(prog),
> > +   };  
> 
> nit: it's a variable declaration, so keep it all in one block. There
> is also opts-based variant, which might be good to use here instead.
> 
> > +
> > +   memset(buf, 0, sizeof(buf));  
> 
> char buf[256] = {}; would make this unnecessary

ok.

> 
> > +
> > +   err = bpf_prog_test_run_xattr(&tattr);
> > +   CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run",
> > +  "prog_name:%s (err %d errno %d retval %d)\n",
> > +  prog_name, err, errno, tattr.retval);
> > +
> > +CHECK(tattr.retval != retval_expect, "retval",  
> 
> whitespaces are off?

Yes, I noticed with scripts/checkpatch.pl.  And there are a couple
more, that I've already fixed.


> > + "progname:%s unexpected retval=%d expected=%d\n",
> > + prog_name, tattr.retval, retval_expect);
> > +
> > +   /* Extract MTU that BPF-prog go

[PATCH bpf-next V10 0/7] bpf: New approach for BPF MTU handling

2020-12-22 Thread Jesper Dangaard Brouer

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest
V8: Add better selftests for BPF-helper bpf_check_mtu
V9: Remove patch that use skb_set_redirected
V10: Fix selftests and 'tot_len' MTU check like XDP

---

Jesper Dangaard Brouer (7):
  bpf: Remove MTU check in __bpf_skb_max_len
  bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
  bpf: bpf_fib_lookup return MTU value as output when looked up
  bpf: add BPF-helper for MTU checking
  bpf: drop MTU check when doing TC-BPF redirect to ingress
  selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
  bpf/selftests: tests using bpf_check_mtu BPF-helper


 include/linux/netdevice.h  |   31 +++
 include/uapi/linux/bpf.h   |   78 +++
 net/core/dev.c |   19 --
 net/core/filter.c  |  183 +++--
 tools/include/uapi/linux/bpf.h |   78 +++
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  218 
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  199 ++
 .../selftests/bpf/progs/test_cls_redirect.c|7 +
 8 files changed, 769 insertions(+), 44 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--

[PATCH bpf-next V10 1/7] bpf: Remove MTU check in __bpf_skb_max_len

2020-12-22 Thread Jesper Dangaard Brouer

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 255aeee72402..f8f198252ff2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 
off, u32 len_diff,
return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-   return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, 
s32, len_diff,
 {
u32 len_cur, len_diff_abs = abs(len_diff);
u32 len_min = bpf_skb_net_base_len(skb);
-   u32 len_max = __bpf_skb_max_len(skb);
+   u32 len_max = BPF_SKB_MAX_LEN;
__be16 proto = skb->protocol;
bool shrink = len_diff < 0;
u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, 
unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 min_len = __bpf_skb_min_len(skb);
int ret;
 
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto 
sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
u64 flags)
 {
-   u32 max_len = __bpf_skb_max_len(skb);
+   u32 max_len = BPF_SKB_MAX_LEN;
u32 new_len = skb->len + head_room;
int ret;

[PATCH bpf-next V10 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx

2020-12-22 Thread Jesper Dangaard Brouer

BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the expected
encap size. (Just like the bpf_check_mtu helper supports). He discovered
that for SKB ctx the param->tot_len was not used, instead skb->len was used
(via MTU check in is_skb_forwardable() that checks against netdev MTU).

Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g.
zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU
check is done like XDP code-path, which checks against FIB-dst MTU.

V10:
- Use same method as XDP for 'tot_len' MTU check

Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: Carlo Carraro 
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index f8f198252ff2..c1e460193bae 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5548,6 +5548,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 {
struct net *net = dev_net(skb->dev);
int rc = -EAFNOSUPPORT;
+   bool check_mtu = false;
 
if (plen < sizeof(*params))
return -EINVAL;
@@ -,22 +5556,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT))
return -EINVAL;
 
+   if (params->tot_len)
+   check_mtu = true;
+
switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
case AF_INET:
-   rc = bpf_ipv4_fib_lookup(net, params, flags, false);
+   rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
-   rc = bpf_ipv6_fib_lookup(net, params, flags, false);
+   rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
break;
 #endif
}
 
-   if (!rc) {
+   if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
struct net_device *dev;
 
+   /* When tot_len isn't provided by user,
+* check skb against net_device MTU
+*/
dev = dev_get_by_index_rcu(net, params->ifindex);
if (!is_skb_forwardable(dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;

[PATCH bpf-next V10 4/7] bpf: add BPF-helper for MTU checking

2020-12-22 Thread Jesper Dangaard Brouer

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   67 ++
 net/core/filter.c  |  122 
 tools/include/uapi/linux/bpf.h |   67 ++
 3 files changed, 256 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@ union bpf_attr {
  * Return
  * A pointer to a struct socket on success or NULL if the file is
  * not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 
flags)
+ * Description
+ * Check ctx packet size against MTU of net device (based on
+ * *ifindex*).  This helper will likely be used in combination with
+ * helpers that adjust/change the packet size.  The argument
+ * *len_diff* can be used for querying with a planned size
+ * change. This allows to check MTU prior to changing packet ctx.
+ *
+ * Specifying *ifindex* zero means the MTU check is performed
+ * against the current net device.  This is practical if this isn't
+ * used prior to redirect.
+ *
+ * The Linux kernel route table can configure MTUs on a more
+ * specific per route level, which is not provided by this helper.
+ * For route level MTU checks use the **bpf_fib_lookup**\ ()
+ * helper.
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** for tc cls_act programs.
+ *
+ * The *flags* argument can be a combination of one or more of the
+ * following values:
+ *
+ * **BPF_MTU_CHK_SEGS**
+ * This flag will only works for *ctx* **struct sk_buff**.
+ * If packet context contains extra packet segment buffers
+ * (often knows as GSO skb), then MTU check is harder to
+ * check at this point, because in transmit path it is
+ * possible for the skb packet to get re-segmented
+ * (depending on net device features).  This could still be
+ * a MTU violation, so this flag enables performing MTU
+ * check against segments, with a different violation
+ * return code to tell it apart. Check cannot use len_diff.
+ *
+ * On return *mtu_len* pointer contains the MTU value of the net
+ * device.  Remember the net device configured MTU is the L3 size,
+ * which is returned here and XDP and TX length operate at L2.
+ * Helper take this into account for you, but remember when using
+ * MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ * pointer and be initialized (to zero), else verifier will reject
+ * BPF program.
+ *
+ * Return
+ * * 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ * * < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ * MTU v

[PATCH bpf-next V10 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up

2020-12-22 Thread Jesper Dangaard Brouer

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h   |   11 +--
 net/core/filter.c  |   22 +++---
 tools/include/uapi/linux/bpf.h |   11 +--
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77d7c1bb2923..649586d656b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2225,6 +2225,9 @@ union bpf_attr {
  * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *   packet is not forwarded or needs assist from full stack
  *
+ * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ * was exceeded and output params->mtu_result contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, 
void *key, u64 flags)
  * Description
  * Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4975,9 +4978,13 @@ struct bpf_fib_lookup {
__be16  sport;
__be16  dport;
 
-   /* total length of packet from network header - used for MTU check */
-   __u16   tot_len;
+   union { /* used for MTU check */
+   /* input to lookup */
+   __u16   tot_len; /* L3 length from network hdr (iph->tot_len) */
 
+   /* output: MTU value */
+   __u16   mtu_result;
+   };
/* input: L3 device index for lookup
 * output: device index from FIB lookup
 */
diff --git a/net/core/filter.c b/net/core/filter.c
index c1e460193bae..db59ab55572c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5268,12 +5268,14 @@ static const struct bpf_func_proto 
bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
  const struct neighbour *neigh,
- const struct net_device *dev)
+ const struct net_device *dev, u32 mtu)
 {
memcpy(params->dmac, neigh->ha, ETH_ALEN);
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+   if (mtu)
+   params->mtu_result = mtu; /* union with tot_len */
 
return 0;
 }
@@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct net_device *dev;
struct fib_result res;
struct flowi4 fl4;
+   u32 mtu = 0;
int err;
-   u32 mtu;
 
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
@@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+   }
}
 
nhc = res.nhc;
@@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-   return bpf_fib_set_fwd_params(params, neigh, dev);
+   return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
struct flowi6 fl6;
int strict = 0;
int oif, err;
-   u32 mtu;
+   u32 mtu = 0;
 
/* link local addresses are never forwarded */
if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-   if (params->tot_len > mtu)
+   if (params->tot_len > mtu) {
+   params->mtu_result = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+

[PATCH bpf-next V10 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress

2020-12-22 Thread Jesper Dangaard Brouer

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] 
https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/netdevice.h |   31 +--
 net/core/dev.c|   19 ++-
 net/core/filter.c |   14 +++---
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7bf167993c05..4c78f73db1ec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3947,11 +3947,38 @@ int dev_forward_skb(struct net_device *dev, struct 
sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+const struct sk_buff *skb,
+const bool check_mtu)
+{
+   const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+   unsigned int len;
+
+   if (!(dev->flags & IFF_UP))
+   return false;
+
+   if (!check_mtu)
+   return true;
+
+   len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+   if (skb->len <= len)
+   return true;
+
+   /* if TSO is enabled, we don't care about the length as the packet
+* could be forwarded without being segmented before
+*/
+   if (skb_is_gso(skb))
+   return true;
+
+   return false;
+}
+
 static __always_inline int dev_forward_skb(struct net_device *dev,
-  struct sk_buff *skb)
+  struct sk_buff *skb,
+  const bool check_mtu)
 {
if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-   unlikely(!is_skb_forwardable(dev, skb))) {
+   unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
atomic_long_inc(&dev->rx_dropped);
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index a46334906c94..ffead13c158f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2176,28 +2176,13 @@ static inline void net_timestamp_set(struct sk_buff 
*skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff 
*skb)
 {
-   unsigned int len;
-
-   if (!(dev->flags & IFF_UP))
-   return false;
-
-   len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-   if (skb->len <= len)
-   return true;
-
-   /* if TSO is enabled, we don't care about the length as the packet
-* could be forwarded without being segmented before
-*/
-   if (skb_is_gso(skb))
-   return true;
-
-   return false;
+   return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, true);
 
if (likely(!ret)) {
skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index 3f2e593244ca..1908800b671c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto 
= {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-   return dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, false);
+
+   if (likely(!ret)) {
+   skb->protocol = eth_type_trans(skb, dev);
+   skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+   ret = netif_rx(skb);
+   }
+
+   return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
  struct sk_buff *skb)
 {
-   int ret = dev_forward_skb(dev, skb);
+   int ret = dev_forward_skb(dev, skb, false);
 
if (likely(!ret)) {
skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
goto out_drop;
dev = ops->ndo_get_peer_dev(dev);
if (unlikely(!dev ||
-!is_skb_forwardable(dev, skb) ||
+!(dev->flags & IFF_UP) ||

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2426 matches

Mail list logo