Re: [PATCH bpf-next 2/3] xdp: Add tracepoint for bulk XDP_TX
On Thu, 23 May 2019 19:56:47 +0900 Toshiaki Makita wrote: > This is introduced for admins to check what is happening on XDP_TX when > bulk XDP_TX is in use. > > Signed-off-by: Toshiaki Makita > --- > include/trace/events/xdp.h | 25 + > kernel/bpf/core.c | 1 + > 2 files changed, 26 insertions(+) > > diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h > index e95cb86..e06ea65 100644 > --- a/include/trace/events/xdp.h > +++ b/include/trace/events/xdp.h > @@ -50,6 +50,31 @@ > __entry->ifindex) > ); > > +TRACE_EVENT(xdp_bulk_tx, > + You are using this tracepoint like/instead of trace_xdp_devmap_xmit if I understand correctly? Or maybe the trace_xdp_redirect tracepoint. The point is that is will be good if the tracepoints can share the TP_STRUCT layout beginning, as it allows for attaching and reusing eBPF code that is only interested in the top part of the struct. I would also want to see some identifier, that trace programs can use to group and corrolate these events, you do have ifindex, but most other XDP tracepoints also have "prog_id". > + TP_PROTO(const struct net_device *dev, > + int sent, int drops, int err), > + > + TP_ARGS(dev, sent, drops, err), > + > + TP_STRUCT__entry( > + __field(int, ifindex) > + __field(int, drops) > + __field(int, sent) > + __field(int, err) > + ), The xdp_redirect_template have: TP_STRUCT__entry( __field(int, prog_id) __field(u32, act) __field(int, ifindex) __field(int, err) __field(int, to_ifindex) __field(u32, map_id) __field(int, map_index) ), And e.g. TRACE_EVENT xdp_exception have: TP_STRUCT__entry( __field(int, prog_id) __field(u32, act) __field(int, ifindex) ), > + > + TP_fast_assign( > + __entry->ifindex= dev->ifindex; > + __entry->drops = drops; > + __entry->sent = sent; > + __entry->err= err; > + ), > + > + TP_printk("ifindex=%d sent=%d drops=%d err=%d", > + __entry->ifindex, __entry->sent, __entry->drops, __entry->err) > +); > + > DECLARE_EVENT_CLASS(xdp_redirect_template, > > TP_PROTO(const struct net_device *dev, > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 242a643..7687488 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -2108,3 +2108,4 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int > offset, void *to, > #include > > EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception); > +EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx); -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
On Thu, 23 May 2019 20:35:50 +0900 Toshiaki Makita wrote: > On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote: > > Toshiaki Makita writes: > > > >> This improves XDP_TX performance by about 8%. > >> > >> Here are single core XDP_TX test results. CPU consumptions are taken > >> from "perf report --no-child". > >> > >> - Before: > >> > >> 7.26 Mpps > >> > >> _raw_spin_lock 7.83% > >> veth_xdp_xmit 12.23% > >> > >> - After: > >> > >> 7.84 Mpps > >> > >> _raw_spin_lock 1.17% > >> veth_xdp_xmit 6.45% > >> > >> Signed-off-by: Toshiaki Makita > >> --- > >> drivers/net/veth.c | 26 +- > >> 1 file changed, 25 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/net/veth.c b/drivers/net/veth.c > >> index 52110e5..4edc75f 100644 > >> --- a/drivers/net/veth.c > >> +++ b/drivers/net/veth.c > >> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int > >> n, > >>return ret; > >> } > >> > >> +static void veth_xdp_flush_bq(struct net_device *dev) > >> +{ > >> + struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); > >> + int sent, i, err = 0; > >> + > >> + sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); > > > > Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So > > you're introducing an additional per-cpu bulk queue, only to avoid lock > > contention around the existing pointer ring. But the pointer ring is > > per-rq, so if you have lock contention, this means you must have > > multiple CPUs servicing the same rq, no? > > Yes, it's possible. Not recommended though. > I think the general per-cpu TX bulk queue is overkill. There is a loop over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and the caller veth_poll() will call veth_xdp_flush(rq->dev). Why can't you store this "temp" bulk array in struct veth_rq ? You could even alloc/create it on the stack of veth_poll() and send it along via a pointer to veth_xdp_rcv). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.
On Thu, 23 May 2019 13:15:44 -0700 Stephen Hemminger wrote: > On Thu, 23 May 2019 19:19:40 + > Saeed Mahameed wrote: > > > On Thu, 2019-05-23 at 10:54 -0700, Stephen Hemminger wrote: > > > When a device is stacked like (team, bonding, failsafe or netvsc) the > > > XDP generic program for the parent device was not called. > > > > > > Move the call to XDP generic inside __netif_receive_skb_core where > > > it can be done multiple times for stacked case. > > > > > > Suggested-by: Jiri Pirko > > > Fixes: d445516966dc ("net: xdp: support xdp generic on virtual > > > devices") > > > Signed-off-by: Stephen Hemminger > > > --- > > > v1 - call xdp_generic in netvsc handler > > > v2 - do xdp_generic in generic rx handler processing > > > v3 - move xdp_generic call inside the another pass loop > > > > > > net/core/dev.c | 56 ++ > > > > > > 1 file changed, 11 insertions(+), 45 deletions(-) > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > > index b6b8505cfb3e..696776e14d00 100644 > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -4502,23 +4502,6 @@ static int netif_rx_internal(struct sk_buff > > > *skb) > > > > > > trace_netif_rx(skb); > > > > > > - if (static_branch_unlikely(&generic_xdp_needed_key)) { > > > - int ret; > > > - > > > - preempt_disable(); > > > - rcu_read_lock(); > > > - ret = do_xdp_generic(rcu_dereference(skb->dev- > > > >xdp_prog), skb); > > > - rcu_read_unlock(); > > > - preempt_enable(); > > > - > > > - /* Consider XDP consuming the packet a success from > > > - * the netdev point of view we do not want to count > > > - * this as an error. > > > - */ > > > - if (ret != XDP_PASS) > > > - return NET_RX_SUCCESS; > > > - } > > > - > > > > Adding Jesper, > > > > There is a small behavioral change due to this patch, > > the XDP program after this patch will run on the RPS CPU, if > > configured, which could cause some behavioral changes in > > xdp_redirect_cpu: bpf_redirect_map(cpu_map). > > > > Maybe this is acceptable, but it should be documented, as the current > > assumption dictates: XDP program runs on the core where the XDP > > frame/SKB was first seen. This does break some assumptions, that I worry about. I've not optimized generic XDP much, as this is suppose to be slow-path, but as you can see in my evaluation[1] generic-XDP do have a performance potential (XDP drop: native=12Mpps and generic=8.4Mpps), but the XDP-generic performance dies as soon as we e.g. do XDP_TX (native=10Mpps and generic=4Mpps). The reason is lack of bulking. We could implement the same kind of RX-bulking tricks as we do for XDP_REDIRECT, where bpf_redirect_map store frames in the map and send them once NAPI-poll exit via a xdp_do_flush_map(). These tricks depend on per-CPU data (bpf_redirect_info), thus I cannot see how this could work if XDP-generic now happens after RPS on a remote CPU. Notice, that TX bulking at XDP-generic level, is actually rather simple, as netstack TX path-code support xmit_more via creating a list of SKBs... Last time I hacked it up, I saw 20%-30% speedup... anyone motivated to do this? > > Or maybe XDP should just force off RPS (like it does gro) I guess, we could force off RPS. But I do see one valid use-case of combining CPUMAP redirect with RFS (Receive Flow Steering) which is part of RPS. Yes, I know we/I *also* have to implement generic-XDP-cpumap support. But for native-XDP CPUMAP redirect, from 1-2 RX-CPUs into N-remote CPUs via CPUMAP, and then lets RFS send SKBs to where the application runs, does make sense to me. (I do have an "assignment" to implement this in eBPF here[2]). [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/Documentation/blogposts/xdp25_eval_generic_xdp_tx.rst [2] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap.org#cpumap-implement-dynamic-load-balancer-that-is-ooo-safe -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX
On Thu, 23 May 2019 22:51:34 +0900 Toshiaki Makita wrote: > On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote: > > On Thu, 23 May 2019 20:35:50 +0900 > > Toshiaki Makita wrote: > > > >> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote: > >>> Toshiaki Makita writes: > >>> > >>>> This improves XDP_TX performance by about 8%. > >>>> > >>>> Here are single core XDP_TX test results. CPU consumptions are taken > >>>> from "perf report --no-child". > >>>> > >>>> - Before: > >>>> > >>>>7.26 Mpps > >>>> > >>>>_raw_spin_lock 7.83% > >>>>veth_xdp_xmit 12.23% > >>>> > >>>> - After: > >>>> > >>>>7.84 Mpps > >>>> > >>>>_raw_spin_lock 1.17% > >>>>veth_xdp_xmit 6.45% > >>>> > >>>> Signed-off-by: Toshiaki Makita > >>>> --- > >>>> drivers/net/veth.c | 26 +- > >>>> 1 file changed, 25 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c > >>>> index 52110e5..4edc75f 100644 > >>>> --- a/drivers/net/veth.c > >>>> +++ b/drivers/net/veth.c > >>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, > >>>> int n, > >>>> return ret; > >>>> } > >>>> > >>>> +static void veth_xdp_flush_bq(struct net_device *dev) > >>>> +{ > >>>> +struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq); > >>>> +int sent, i, err = 0; > >>>> + > >>>> +sent = veth_xdp_xmit(dev, bq->count, bq->q, 0); > >>> > >>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So > >>> you're introducing an additional per-cpu bulk queue, only to avoid lock > >>> contention around the existing pointer ring. But the pointer ring is > >>> per-rq, so if you have lock contention, this means you must have > >>> multiple CPUs servicing the same rq, no? > >> > >> Yes, it's possible. Not recommended though. > >> > > > > I think the general per-cpu TX bulk queue is overkill. There is a loop > > over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and > > the caller veth_poll() will call veth_xdp_flush(rq->dev). > > > > Why can't you store this "temp" bulk array in struct veth_rq ? > > Of course I can. But I thought tun has the same problem and we can > decrease memory footprint by sharing the same storage between devices. > Or if other devices want to reduce queues so that we can use XDP on > many-cpu servers and introduce locks, we can use this storage for > that case as well. > > Still do you prefer veth-specific solution? Yes. Another reason is that with this shared/general per-cpu TX bulk queue, I can easily see bugs resulting in xdp_frames getting transmitted on a completely other NIC, which will be hard to debug for people. > > > > You could even alloc/create it on the stack of veth_poll() and send > > it along via a pointer to veth_xdp_rcv). IHMO it would be cleaner code wise to place the "temp" bulk array in struct veth_rq. But if you worry about performance and want a hot cacheline for this, then you could just use the call-stack for veth_poll(), as I described. It should not be too ugly code wise to do this I think. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v3 2/2] net: core: support XDP generic on stacked devices.
On Fri, 24 May 2019 12:17:27 +0800 Jason Wang wrote: > > Maybe this is acceptable, but it should be documented, as the current > > assumption dictates: XDP program runs on the core where the XDP > > frame/SKB was first seen. > > > At lest for TUN, this is not true. XDP frames were built by vhost_net > and passed to TUN. There's no guarantee that vhost_net kthread won't > move to another core. This sound a little scary, as we depend on per-CPU variables (e.g. bpf_redirect_info). Can the vhost_net kthread move between CPUs within/during the NAPI-poll? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v3 2/3] libbpf: add bpf_object__load_xattr() API function to pass log_level
On Fri, 24 May 2019 11:36:47 +0100 Quentin Monnet wrote: > libbpf was recently made aware of the log_level attribute for programs, > used to specify the level of information expected to be dumped by the > verifier. Function bpf_prog_load_xattr() got support for this log_level > parameter. > > But some applications using libbpf rely on another function to load > programs, bpf_object__load(), which does accept any parameter for log > level. Create an API function based on bpf_object__load(), but accepting > an "attr" object as a parameter. Then add a log_level field to that > object, so that applications calling the new bpf_object__load_xattr() > can pick the desired log level. Does this allow us to extend struct bpf_object_load_attr later? > v3: > - Rewrite commit log. > > v2: > - We are in a new cycle, bump libbpf extraversion number. > > Signed-off-by: Quentin Monnet > Reviewed-by: Jakub Kicinski > --- > tools/lib/bpf/Makefile | 2 +- > tools/lib/bpf/libbpf.c | 20 +--- > tools/lib/bpf/libbpf.h | 6 ++ > tools/lib/bpf/libbpf.map | 5 + > 4 files changed, 29 insertions(+), 4 deletions(-) > > diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile > index a2aceadf68db..9312066a1ae3 100644 > --- a/tools/lib/bpf/Makefile > +++ b/tools/lib/bpf/Makefile > @@ -3,7 +3,7 @@ > > BPF_VERSION = 0 > BPF_PATCHLEVEL = 0 > -BPF_EXTRAVERSION = 3 > +BPF_EXTRAVERSION = 4 > > MAKEFLAGS += --no-print-directory > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c > index 197b574406b3..1c6fb7a3201e 100644 > --- a/tools/lib/bpf/libbpf.c > +++ b/tools/lib/bpf/libbpf.c > @@ -,7 +,7 @@ static bool bpf_program__is_function_storage(struct > bpf_program *prog, > } > > static int > -bpf_object__load_progs(struct bpf_object *obj) > +bpf_object__load_progs(struct bpf_object *obj, int log_level) > { > size_t i; > int err; > @@ -2230,6 +2230,7 @@ bpf_object__load_progs(struct bpf_object *obj) > for (i = 0; i < obj->nr_programs; i++) { > if (bpf_program__is_function_storage(&obj->programs[i], obj)) > continue; > + obj->programs[i].log_level = log_level; > err = bpf_program__load(&obj->programs[i], > obj->license, > obj->kern_version); > @@ -2381,10 +2382,14 @@ int bpf_object__unload(struct bpf_object *obj) > return 0; > } > > -int bpf_object__load(struct bpf_object *obj) > +int bpf_object__load_xattr(struct bpf_object_load_attr *attr) > { > + struct bpf_object *obj; > int err; > > + if (!attr) > + return -EINVAL; > + obj = attr->obj; > if (!obj) > return -EINVAL; > > @@ -2397,7 +2402,7 @@ int bpf_object__load(struct bpf_object *obj) > > CHECK_ERR(bpf_object__create_maps(obj), err, out); > CHECK_ERR(bpf_object__relocate(obj), err, out); > - CHECK_ERR(bpf_object__load_progs(obj), err, out); > + CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out); > > return 0; > out: > @@ -2406,6 +2411,15 @@ int bpf_object__load(struct bpf_object *obj) > return err; > } > > +int bpf_object__load(struct bpf_object *obj) > +{ > + struct bpf_object_load_attr attr = { > + .obj = obj, > + }; > + > + return bpf_object__load_xattr(&attr); > +} > + > static int check_path(const char *path) > { > char *cp, errmsg[STRERR_BUFSIZE]; > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h > index c5ff00515ce7..e1c748db44f6 100644 > --- a/tools/lib/bpf/libbpf.h > +++ b/tools/lib/bpf/libbpf.h > @@ -89,8 +89,14 @@ LIBBPF_API int bpf_object__unpin_programs(struct > bpf_object *obj, > LIBBPF_API int bpf_object__pin(struct bpf_object *object, const char *path); > LIBBPF_API void bpf_object__close(struct bpf_object *object); > > +struct bpf_object_load_attr { > + struct bpf_object *obj; > + int log_level; > +}; Can this be extended later? > /* Load/unload object into/from kernel */ > LIBBPF_API int bpf_object__load(struct bpf_object *obj); > +LIBBPF_API int bpf_object__load_xattr(struct bpf_object_load_attr *attr); > LIBBPF_API int bpf_object__unload(struct bpf_object *obj); > LIBBPF_API const char *bpf_object__name(struct bpf_object *obj); > LIBBPF_API unsigned int bpf_object__kversion(struct bpf_object *obj); > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map > index 673001787cba..6ce61fa0baf3 100644 > --- a/tools/lib/bpf/libbpf.map > +++ b/tools/lib/bpf/libbpf.map > @@ -164,3 +164,8 @@ LIBBPF_0.0.3 { > bpf_map_freeze; > btf__finalize_data; > } LIBBPF_0.0.2; > + > +LIBBPF_0.0.4 { > + global: > + bpf_object__load_xattr; > +} LIBBPF_0.0.3; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v3 2/3] libbpf: add bpf_object__load_xattr() API function to pass log_level
On Fri, 24 May 2019 12:51:14 +0100 Quentin Monnet wrote: > 2019-05-24 13:22 UTC+0200 ~ Jesper Dangaard Brouer > > On Fri, 24 May 2019 11:36:47 +0100 > > Quentin Monnet wrote: > > > >> libbpf was recently made aware of the log_level attribute for programs, > >> used to specify the level of information expected to be dumped by the > >> verifier. Function bpf_prog_load_xattr() got support for this log_level > >> parameter. > >> > >> But some applications using libbpf rely on another function to load > >> programs, bpf_object__load(), which does accept any parameter for log > >> level. Create an API function based on bpf_object__load(), but accepting > >> an "attr" object as a parameter. Then add a log_level field to that > >> object, so that applications calling the new bpf_object__load_xattr() > >> can pick the desired log level. > > > > Does this allow us to extend struct bpf_object_load_attr later? > > I see no reason why it could not. Having the _xattr() version of the > function is precisely a way to have something extensible in the future, > without having to create additional API functions each time we want to > pass a new parameter. And e.g. struct bpf_prog_load_attr (used with > bpf_prog_load_xattr()) has already been extended in the past. So, yeah, > we can add to it in the future. Great. I just don't know/understand how user-space handle this. If a binary is compiled with libbpf as dynamic loadable lib, then it e.g. saw libbpf.so.2 when it was compiled, then can't it choose to use libbpf.so.3 then? (e.g. when libbpf.so.2 is not on the system). (I would actually like to learn/understand this, so links are welcome). > Do you have something in mind? I was playing with extending bpf_prog_load_attr, but instead I created a bpf_prog_load_attr_maps instead and a new function bpf_prog_load_xattr_maps(), e.g. see: https://github.com/xdp-project/xdp-tutorial/blob/master/common/common_libbpf.h https://github.com/xdp-project/xdp-tutorial/blob/master/common/common_libbpf.c I guess, I could just extend bpf_prog_load_attr instead, right? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Bad XDP performance with mlx5
On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette wrote: > CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a huge > CPU increase on CX5 when introducing a XDP program. > > See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be > around 0:40. We're talking something like 15% while it's near 0 for > other drivers. The machine is a recent Skylake. For us it makes XDP > unusable. Is that a known problem? I have a similar test setup, and I can reproduce. I have found the root-cause see below. But on my system it was even worse, with an XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see 100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58% CPU (43% idle) and total 89.7 Gbits/sec. > I wonder if it doesn't simply come from mlx5/en_main.c: > rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; > Nope, that is not the problem. > Which would be inline from my observation that memory access seems > heavier. I guess this is for the XDP_TX case. > > If this is indeed the problem. Any chance we can: > a) detect automatically that a program will not return XDP_TX (I'm not > quite sure about what the BPF limitations allow to guess in advance) or > b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in > performance when not needed? This was kind of hard to root-cause, but I solved it by increasing the TCP socket size used by the iperf tool, like this (please reproduce): $ iperf -s --window 4M Server listening on TCP port 5001 TCP window size: 416 KByte (WARNING: requested 4.00 MByte) Given I could reproduce, I took at closer look at perf record/report stats, and it was actually quite clear that this was related to stalling on getting pages from the page allocator (function calls top#6 get_page_from_freelist and free_pcppages_bulk). Using my tool: ethtool_stats.pl https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl It was clear that the mlx5 driver page-cache was not working: Ethtool(mlx5p1 ) stat: 6653761 ( 6,653,761) <= rx_cache_busy /sec Ethtool(mlx5p1 ) stat: 6653732 ( 6,653,732) <= rx_cache_full /sec Ethtool(mlx5p1 ) stat: 669481 ( 669,481) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 1 ( 1) <= rx_congst_umr /sec Ethtool(mlx5p1 ) stat: 7323230 ( 7,323,230) <= rx_csum_unnecessary /sec Ethtool(mlx5p1 ) stat:1034 ( 1,034) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 7323230 ( 7,323,230) <= rx_packets /sec Ethtool(mlx5p1 ) stat: 7324244 ( 7,324,244) <= rx_packets_phy /sec While the non-XDP case looked like this: Ethtool(mlx5p1 ) stat: 298929 ( 298,929) <= rx_cache_busy /sec Ethtool(mlx5p1 ) stat: 298971 ( 298,971) <= rx_cache_full /sec Ethtool(mlx5p1 ) stat: 3548789 ( 3,548,789) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 7695476 ( 7,695,476) <= rx_csum_complete /sec Ethtool(mlx5p1 ) stat: 7695476 ( 7,695,476) <= rx_packets /sec Ethtool(mlx5p1 ) stat: 7695169 ( 7,695,169) <= rx_packets_phy /sec Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44 With the increased TCP window size, the mlx5 driver cache is working better, but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU usage. Ethtool(mlx5p1 ) stat: 894438 ( 894,438) <= rx_cache_busy /sec Ethtool(mlx5p1 ) stat: 894453 ( 894,453) <= rx_cache_full /sec Ethtool(mlx5p1 ) stat: 6638518 ( 6,638,518) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 6 ( 6) <= rx_congst_umr /sec Ethtool(mlx5p1 ) stat: 7532983 ( 7,532,983) <= rx_csum_unnecessary /sec Ethtool(mlx5p1 ) stat: 164 ( 164) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 7532983 ( 7,532,983) <= rx_packets /sec Ethtool(mlx5p1 ) stat: 7533193 ( 7,533,193) <= rx_packets_phy /sec Manual consistence calc: 7532983-(6638518+894453) = 12 To understand why this is happening, you first have to know that the difference is between the two RX-memory modes used by mlx5 for non-XDP vs XDP. With non-XDP two frames are stored per memory-page, while for XDP only a single frame per page is used. The packets available in the RX-rings are actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024. I believe, the real issue is that TCP use the SKB->truesize (based on frame size) for different memory pressure and window calculations, which is why it solved the issue to increase the window size manually. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Bad XDP performance with mlx5
On Fri, 31 May 2019 18:06:01 + Saeed Mahameed wrote: > On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote: [...] > > > > To understand why this is happening, you first have to know that the > > difference is between the two RX-memory modes used by mlx5 for non- > > XDP vs XDP. With non-XDP two frames are stored per memory-page, > > while for XDP only a single frame per page is used. The packets > > available in the RX- rings are actually the same, as the ring > > sizes are non-XDP=512 vs. XDP=1024. > > Thanks Jesper ! this was a well put together explanation. > I want to point out that some other drivers are using alloc_skb APIs > which provide a good caching mechanism, which is even better than the > mlx5 internal one (which uses the alloc_page APIs directly), this can > explain the difference, and your explanation shows the root cause of > the higher cpu util with XDP on mlx5, since the mlx5 page cache works > with half of its capacity when enabling XDP. > > Now do we really need to keep this page per packet in mlx5 when XDP is > enabled ? i think it is time to drop that .. No, we need to keep the page per packet (at least until, I've solved some corner-cases with page_pool, which could likely require getting a page-flag). > > I believe, the real issue is that TCP use the SKB->truesize (based > > on frame size) for different memory pressure and window > > calculations, which is why it solved the issue to increase the > > window size manually. The TCP performance issue is not solely a SKB->truesize issue, but also an issue with how the driver level page-cache works. It is actually very fragile, as single page with elevated refcnt can block the cache (see mlx5e_rx_cache_get()). Which easily happens with TCP packets that is waiting to be re-transmitted in-case of loss. This is happening here, as indicated by the rx_cache_busy and rx_cache_full being the same. We (Ilias, Tariq and I) have been planning to remove this small driver cache, and instead use the page_pool, and create a page-return path for SKBs. Which should make this problem go away. I'm going to be working on this the next couple of weeks (the tricky part is all the corner cases). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer On Fri, 31 May 2019 18:18:17 +0200 Jesper Dangaard Brouer wrote: > It was clear that the mlx5 driver page-cache was not working: > Ethtool(mlx5p1 ) stat: 6653761 ( 6,653,761) <= rx_cache_busy /sec > Ethtool(mlx5p1 ) stat: 6653732 ( 6,653,732) <= rx_cache_full /sec > Ethtool(mlx5p1 ) stat: 669481 ( 669,481) <= rx_cache_reuse /sec > Ethtool(mlx5p1 ) stat: 1 ( 1) <= rx_congst_umr /sec > Ethtool(mlx5p1 ) stat: 7323230 ( 7,323,230) <= rx_csum_unnecessary > /sec > Ethtool(mlx5p1 ) stat:1034 ( 1,034) <= rx_discards_phy /sec > Ethtool(mlx5p1 ) stat: 7323230 ( 7,323,230) <= rx_packets /sec > Ethtool(mlx5p1 ) stat: 7324244 ( 7,324,244) <= rx_packets_phy /sec
Re: Bad XDP performance with mlx5
On Tue, 4 Jun 2019 09:28:22 +0200 Tom Barbette wrote: > Thanks Jesper for looking into this! > > I don't think I will be of much help further on this matter. My take > out would be: as a first-time user looking into XDP after watching a > dozen of XDP talks, I would have expected XDP default settings to be > identical to SKB, so I don't have to watch out for a set of > per-driver parameter checklist to avoid increasing my CPU consumption > by 15% when inserting "a super efficient and light BPF program". But > I understand it's not that easy... The gap should not be this large, but as I demonstrated it was primarily because you hit an unfortunate interaction with TCP and how the mlx5 driver does page-caching (p.s. we are working on removing this driver local recycle-cache). When loading an XDP/eBPF-prog then the driver change the underlying RX memory model, which waste memory to gain packets-per-sec speed, but TCP sees this memory waste and gives us a penalty. It is important to understand, that XDP is not optimized for TCP. XDP is designed and optimized for L2-L3 handling of packets (TCP is L4). Before XDP these L2-L3 use-cases were "slow", because the kernel netstack assumes a L4/socket use-case (full SKB), when less was really needed. This is actually another good example of why XDP programs per RX-queue, will be useful (notice: which is not implemented upstream, yet...). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v2 1/2] net: xdp: refactor XDP_QUERY_PROG{,_HW} to netdev
On Mon, 3 Jun 2019 21:20:30 + Saeed Mahameed wrote: > On Mon, 2019-06-03 at 11:04 +0200, Björn Töpel wrote: > > On Sat, 1 Jun 2019 at 21:42, Jakub Kicinski > > wrote: > > > On Fri, 31 May 2019 19:18:17 +, Saeed Mahameed wrote: > > > > On Fri, 2019-05-31 at 11:42 +0200, Björn Töpel wrote: > > > > > From: Björn Töpel > > > > > > > > > > All XDP capable drivers need to implement the > > > > > XDP_QUERY_PROG{,_HW} command of ndo_bpf. The query code is > > > > > fairly generic. This commit refactors the query code up from > > > > > the drivers to the netdev level. > > > > > > > > > > The struct net_device has gained two new members: xdp_prog_hw > > > > > and xdp_flags. The former is the offloaded XDP program, if > > > > > any, and the latter tracks the flags that the supplied when > > > > > attaching the XDP program. The flags only apply to SKB_MODE > > > > > or DRV_MODE, not HW_MODE. > > > > > > > > > > The xdp_prog member, previously only used for SKB_MODE, is > > > > > shared with DRV_MODE. This is OK, due to the fact that > > > > > SKB_MODE and DRV_MODE are mutually exclusive. To > > > > > differentiate between the two modes, a new internal flag is > > > > > introduced as well. > > > > > > > > Just thinking out loud, why can't we allow any combination of > > > > HW/DRV/SKB modes? they are totally different attach points in a > > > > totally different checkpoints in a frame life cycle. The DRV_MODE and SKB_MODE is tricky to run concurrently. The XDP-redirect scheme (designed by John) use a BPF helper (bpf_xdp_redirect_map) that set global per-CPU state (this_cpu_ptr(&bpf_redirect_info)). The tricky part (which I warned about, and we already have some fixes for) is that the XDP/BPF-prog can call bpf_redirect_map, which update the per-CPU state, but it can still choose not to return XDP_REDIRECT, which then miss an invocation of xdp_do_redirect(). Later, your SKB_MODE XDP/BPF-prog can return XDP_REDIRECT without calling the helper, and then use/reach to the per-CPU info set by the DRV_MODE program, which is NOT something I want us to "support". > > > FWIW see Message-ID: <20190201080236.446d8...@redhat.com> > > > > I've always seen the SKB-mode as something that will eventually be > > removed. > > I don't think so, we are too deep into SKB-mode. I wish we could remove it. After we got native XDP support in veth, then its original purpose is gone, which were making it easier for developers to get something working on their laptop, without having to deploy it to the server with physical hardware all the time. > > Clickable link: > > https://lore.kernel.org/netdev/20190201080236.446d8...@redhat.com/ > > > > So we are all hanging on Jesper's refactoring ideas that are not > getting any priority for now :). Well, that is not true. This patchset _is_ the refactor idea that Bjørn is taking over and working on. Specifically [2] from above link. [2] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#refactor-idea-xdpbpf_prog-into-netdev_rx_queuenet_device > > > > Down the road i think we will utilize this fact and start > > > > introducing SKB helpers for SKB mode and driver helpers for DRV > > > > mode.. > > > > > > Any reason why we would want the extra complexity? There is > > > cls_bpf if someone wants skb features after all.. > > Donno, SKB mode is earlier in the stack maybe .. >From a BPF perspective you cannot introduce SKB helpers for SKB mode. An XDP-prog have bpf prog type XDP (BPF_PROG_TYPE_XDP), and the program itself is identical regardless of attaching for DRV_MODE or SKB_MODE. You cannot detect this at attach time, due to tail-calls (which have painted us into a corner). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next 2/2] devmap: Allow map lookups from eBPF
On Tue, 04 Jun 2019 17:24:10 +0200 Toke Høiland-Jørgensen wrote: > We don't currently allow lookups into a devmap from eBPF, because the map > lookup returns a pointer directly to the dev->ifindex, which shouldn't be > modifiable from eBPF. > > However, being able to do lookups in devmaps is useful to know (e.g.) > whether forwarding to a specific interface is enabled. Currently, programs > work around this by keeping a shadow map of another type which indicates > whether a map index is valid. > > To allow lookups, simply copy the ifindex into a scratch variable and > return a pointer to this. If an eBPF program does modify it, this doesn't > matter since it will be overridden on the next lookup anyway. While this > does add a write to every lookup, the overhead of this is negligible > because the cache line is hot when both the write and the subsequent > read happens. When we choose the return value, here the ifindex, then this basically becomes UABI, right? Can we somehow use BTF to help us to make this extensible? As Toke mention in the cover letter, we really want to know if the chosen egress have actually enabled/allocated resources for XDP transmitting, but as we currently don't have in-kernel way to query thus (thus, we cannot expose such info). > Signed-off-by: Toke Høiland-Jørgensen > --- > kernel/bpf/devmap.c |8 +++- > kernel/bpf/verifier.c |7 ++- > 2 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c > index 5ae7cce5ef16..830650300ea4 100644 > --- a/kernel/bpf/devmap.c > +++ b/kernel/bpf/devmap.c > @@ -65,6 +65,7 @@ struct xdp_bulk_queue { > struct bpf_dtab_netdev { > struct net_device *dev; /* must be first member, due to tracepoint */ > struct bpf_dtab *dtab; > + int ifindex_scratch; > unsigned int bit; > struct xdp_bulk_queue __percpu *bulkq; > struct rcu_head rcu; > @@ -375,7 +376,12 @@ static void *dev_map_lookup_elem(struct bpf_map *map, > void *key) > struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key); > struct net_device *dev = obj ? obj->dev : NULL; > > - return dev ? &dev->ifindex : NULL; > + if (dev) { > + obj->ifindex_scratch = dev->ifindex; > + return &obj->ifindex_scratch; > + } > + > + return NULL; > } > > static void dev_map_flush_old(struct bpf_dtab_netdev *dev) > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 5c2cb5bd84ce..7128a9821481 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -2893,12 +2893,9 @@ static int check_map_func_compatibility(struct > bpf_verifier_env *env, > if (func_id != BPF_FUNC_get_local_storage) > goto error; > break; > - /* devmap returns a pointer to a live net_device ifindex that we cannot > - * allow to be modified from bpf side. So do not allow lookup elements > - * for now. > - */ > case BPF_MAP_TYPE_DEVMAP: > - if (func_id != BPF_FUNC_redirect_map) > + if (func_id != BPF_FUNC_redirect_map && > + func_id != BPF_FUNC_map_lookup_elem) > goto error; > break; > /* Restrict bpf side of cpumap and xskmap, open when use-cases > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v4 bpf-next 1/2] bpf: Allow bpf_map_lookup_elem() on an xskmap
On Mon, 3 Jun 2019 09:38:51 -0700 Jonathan Lemon wrote: > Currently, the AF_XDP code uses a separate map in order to > determine if an xsk is bound to a queue. Instead of doing this, > have bpf_map_lookup_elem() return the queue_id, as a way of > indicating that there is a valid entry at the map index. Just a reminder, that once we choose a return value, there the queue_id, then it basically becomes UAPI, and we cannot change it. Can we somehow use BTF to allow us to extend this later? I was also going to point out that, you cannot return a direct pointer to queue_id, as BPF-prog side can modify this... but Daniel already pointed this out. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
On Wed, 5 Jun 2019 14:36:12 +0900 Toshiaki Makita wrote: > This is introduced for admins to check what is happening on XDP_TX when > bulk XDP_TX is in use, which will be first introduced in veth in next > commit. Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by all drivers? (more below) > Signed-off-by: Toshiaki Makita > --- > include/trace/events/xdp.h | 25 + > kernel/bpf/core.c | 1 + > 2 files changed, 26 insertions(+) > > diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h > index e95cb86..e06ea65 100644 > --- a/include/trace/events/xdp.h > +++ b/include/trace/events/xdp.h > @@ -50,6 +50,31 @@ > __entry->ifindex) > ); > > +TRACE_EVENT(xdp_bulk_tx, > + > + TP_PROTO(const struct net_device *dev, > + int sent, int drops, int err), > + > + TP_ARGS(dev, sent, drops, err), > + > + TP_STRUCT__entry( All other tracepoints in this file starts with: __field(int, prog_id) __field(u32, act) or __field(int, map_id) __field(u32, act) Could you please add those? > + __field(int, ifindex) > + __field(int, drops) > + __field(int, sent) > + __field(int, err) > + ), The reason is that this make is easier to attach to multiple tracepoints, and extract the same value. Example with bpftrace oneliner: $ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }' Attaching 8 probes... ^C @action[4]: 30259246 @action[0]: 34489024 XDP_ABORTED = 0 XDP_REDIRECT= 4 > + > + TP_fast_assign( __entry->act= XDP_TX; > + __entry->ifindex= dev->ifindex; > + __entry->drops = drops; > + __entry->sent = sent; > + __entry->err= err; > + ), > + > + TP_printk("ifindex=%d sent=%d drops=%d err=%d", > + __entry->ifindex, __entry->sent, __entry->drops, __entry->err) > +); > + Other fun bpftrace stuff: sudo bpftrace -e 'tracepoint:xdp:xdp_*map* { @map_id[comm, args->map_id] = count(); }' Attaching 5 probes... ^C @map_id[swapper/2, 113]: 1428 @map_id[swapper/0, 113]: 2085 @map_id[ksoftirqd/4, 113]: 2253491 @map_id[ksoftirqd/2, 113]: 25677560 @map_id[ksoftirqd/0, 113]: 29004338 @map_id[ksoftirqd/3, 113]: 31034885 $ bpftool map list id 113 113: devmap name tx_port flags 0x0 key 4B value 4B max_entries 100 memlock 4096B p.s. People should look out for Brendan Gregg's upcoming book on BPF performance tools, from which I learned to use bpftrace :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next 1/2] bpf_xdp_redirect_map: Add flag to return XDP_PASS on map lookup failure
On Tue, 04 Jun 2019 17:24:10 +0200 Toke Høiland-Jørgensen wrote: > The bpf_redirect_map() helper used by XDP programs doesn't return any > indication of whether it can successfully redirect to the map index it was > given. Instead, BPF programs have to track this themselves, leading to > programs using duplicate maps to track which entries are populated in the > devmap. > > This adds a flag to the XDP version of the bpf_redirect_map() helper, which > makes the helper do a lookup in the map when called, and return XDP_PASS if > there is no value at the provided index. This enables two use cases: To Jonathan Lemon, notice this approach of adding a flag to the helper call, it actually also works for your use-case of XSK AF_XDP maps. > - A BPF program can check the return code from the helper call and react if > it is XDP_PASS (by, for instance, redirecting out a different interface). > > - Programs that just return the value of the bpf_redirect() call will > automatically fall back to the regular networking stack, simplifying > programs that (for instance) build a router with the fib_lookup() helper. > > Signed-off-by: Toke Høiland-Jørgensen > --- > include/uapi/linux/bpf.h |8 > net/core/filter.c| 10 +- > 2 files changed, 17 insertions(+), 1 deletion(-) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 7c6aef253173..4c41482b7604 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -3098,6 +3098,14 @@ enum xdp_action { > XDP_REDIRECT, > }; > > +/* Flags for bpf_xdp_redirect_map helper */ > + > +/* If set, the help will check if the entry exists in the map and return > + * XDP_PASS if it doesn't. > + */ > +#define XDP_REDIRECT_PASS_ON_INVALID BIT(0) > +#define XDP_REDIRECT_ALL_FLAGS XDP_REDIRECT_PASS_ON_INVALID > + > /* user accessible metadata for XDP packet hook > * new fields must be added to the end of this structure > */ > diff --git a/net/core/filter.c b/net/core/filter.c > index 55bfc941d17a..dfab8478f66c 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -3755,9 +3755,17 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, > map, u32, ifindex, > { > struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); > > - if (unlikely(flags)) > + if (unlikely(flags & ~XDP_REDIRECT_ALL_FLAGS)) > return XDP_ABORTED; > > + if (flags & XDP_REDIRECT_PASS_ON_INVALID) { > + struct net_device *fwd; It is slightly misguiding that '*fwd' is a 'struct net_device', as the __xdp_map_lookup_elem() call works for all the supported redirect-map types. People should realize that this patch is a general approach for all the redirect-map types. > + > + fwd = __xdp_map_lookup_elem(map, ifindex); > + if (unlikely(!fwd)) > + return XDP_PASS; > + } > + > ri->ifindex = ifindex; > ri->flags = flags; > WRITE_ONCE(ri->map, map); -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next v2 2/2] devmap: Allow map lookups from eBPF
On Thu, 06 Jun 2019 15:24:14 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > We don't currently allow lookups into a devmap from eBPF, because the map > lookup returns a pointer directly to the dev->ifindex, which shouldn't be > modifiable from eBPF. > > However, being able to do lookups in devmaps is useful to know (e.g.) > whether forwarding to a specific interface is enabled. Currently, programs > work around this by keeping a shadow map of another type which indicates > whether a map index is valid. > > Since we now have a flag to make maps read-only from the eBPF side, we can > simply lift the lookup restriction if we make sure this flag is always set. Nice, I didn't know this was possible. I like it! :-) > Signed-off-by: Toke Høiland-Jørgensen > --- > kernel/bpf/devmap.c |5 + > kernel/bpf/verifier.c |7 ++- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c > index 5ae7cce5ef16..0e6875a462ef 100644 > --- a/kernel/bpf/devmap.c > +++ b/kernel/bpf/devmap.c > @@ -99,6 +99,11 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr) > attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) > return ERR_PTR(-EINVAL); > > + /* Lookup returns a pointer straight to dev->ifindex, so make sure the > + * verifier prevents writes from the BPF side > + */ > + attr->map_flags |= BPF_F_RDONLY_PROG; > + > dtab = kzalloc(sizeof(*dtab), GFP_USER); > if (!dtab) > return ERR_PTR(-ENOMEM); > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 5c2cb5bd84ce..7128a9821481 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -2893,12 +2893,9 @@ static int check_map_func_compatibility(struct > bpf_verifier_env *env, > if (func_id != BPF_FUNC_get_local_storage) > goto error; > break; > - /* devmap returns a pointer to a live net_device ifindex that we cannot > - * allow to be modified from bpf side. So do not allow lookup elements > - * for now. > - */ > case BPF_MAP_TYPE_DEVMAP: > - if (func_id != BPF_FUNC_redirect_map) > + if (func_id != BPF_FUNC_redirect_map && > + func_id != BPF_FUNC_map_lookup_elem) > goto error; > break; > /* Restrict bpf side of cpumap and xskmap, open when use-cases > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
On Thu, 6 Jun 2019 20:04:20 +0900 Toshiaki Makita wrote: > On 2019/06/05 16:59, Jesper Dangaard Brouer wrote: > > On Wed, 5 Jun 2019 14:36:12 +0900 > > Toshiaki Makita wrote: > > > >> This is introduced for admins to check what is happening on XDP_TX when > >> bulk XDP_TX is in use, which will be first introduced in veth in next > >> commit. > > > > Is the plan that this tracepoint 'xdp:xdp_bulk_tx' should be used by > > all drivers? > > I guess you mean all drivers that implement similar mechanism should use > this? Then yes. > (I don't think all drivers needs bulk tx mechanism though) > > > (more below) > > > >> Signed-off-by: Toshiaki Makita > >> --- > >> include/trace/events/xdp.h | 25 + > >> kernel/bpf/core.c | 1 + > >> 2 files changed, 26 insertions(+) > >> > >> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h > >> index e95cb86..e06ea65 100644 > >> --- a/include/trace/events/xdp.h > >> +++ b/include/trace/events/xdp.h > >> @@ -50,6 +50,31 @@ > >> __entry->ifindex) > >> ); > >> > >> +TRACE_EVENT(xdp_bulk_tx, > >> + > >> + TP_PROTO(const struct net_device *dev, > >> + int sent, int drops, int err), > >> + > >> + TP_ARGS(dev, sent, drops, err), > >> + > >> + TP_STRUCT__entry( > > > > All other tracepoints in this file starts with: > > > > __field(int, prog_id) > > __field(u32, act) > > or > > __field(int, map_id) > > __field(u32, act) > > > > Could you please add those? > > So... prog_id is the problem. The program can be changed while we are > enqueueing packets to the bulk queue, so the prog_id at flush may be an > unexpected one. Hmmm... that sounds problematic, if the XDP bpf_prog for veth can change underneath, before the flush. Our redirect system, depend on things being stable until the xdp_do_flush_map() operation, as will e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend on XDP prog), and expect it to be correct/valid. > It can be fixed by disabling NAPI when changing XDP programs. This stops > packet processing while changing XDP programs, but I guess it is an > acceptable compromise. Having said that, I'm honestly not so eager to > make this change, since this will require refurbishment of one of the > most delicate part of veth XDP, NAPI disabling/enabling mechanism. > > WDYT? Sound like a bug, if XDP bpf_prog is not stable within the NAPI poll... > >> + __field(int, ifindex) > >> + __field(int, drops) > >> + __field(int, sent) > >> + __field(int, err) > >> + ), > > > > The reason is that this make is easier to attach to multiple > > tracepoints, and extract the same value. > > > > Example with bpftrace oneliner: > > > > $ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }' -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v2 bpf-next 1/2] xdp: Add tracepoint for bulk XDP_TX
On Fri, 7 Jun 2019 11:22:00 +0900 Toshiaki Makita wrote: > On 2019/06/07 4:41, Jesper Dangaard Brouer wrote: > > On Thu, 6 Jun 2019 20:04:20 +0900 > > Toshiaki Makita wrote: > > > >> On 2019/06/05 16:59, Jesper Dangaard Brouer wrote: > >>> On Wed, 5 Jun 2019 14:36:12 +0900 > >>> Toshiaki Makita wrote: > >>> [...] > >> > >> So... prog_id is the problem. The program can be changed while we are > >> enqueueing packets to the bulk queue, so the prog_id at flush may be an > >> unexpected one. > > > > Hmmm... that sounds problematic, if the XDP bpf_prog for veth can > > change underneath, before the flush. Our redirect system, depend on > > things being stable until the xdp_do_flush_map() operation, as will > > e.g. set per-CPU (bpf_redirect_info) map_to_flush pointer (which depend > > on XDP prog), and expect it to be correct/valid. > > Sorry, I don't get how maps depend on programs. BPF/XDP programs have a reference count on the map (e.g. used for redirect) and when the XDP is removed, and last refcnt for the map is reached, then the map is also removed (redirect maps does a call_rcu when shutdown). > At least xdp_do_redirect_map() handles map_to_flush change during NAPI. > Is there a problem when the map is not changed but the program is changed? > Also I believe this is not veth-specific behavior. Looking at tun and > i40e, they seem to change xdp_prog without stopping data path. I guess this could actually happen, but we are "saved" by the 'map_to_flush' (pointer) is still valid due to RCU protection. But it does look fishy, as our rcu_read_lock's does not encapsulation this. There is RCU-read-section in veth_xdp_rcv_skb(), which via can call xdp_do_redirect() which set per-CPU ri->map_to_flush. Do we get this protection by running under softirq, and does this prevent an RCU grace-period (call_rcu callbacks) from happening? (between veth_xdp_rcv_skb() and xdp_do_flush_map() in veth_poll()) To Toshiaki, regarding your patch 2/2, you are not affected by this per-CPU map storing, as you pass along the bulk-queue. I do see you point, with prog_id could change. Could you change the tracepoint to include the 'act' and place 'ifindex' above this in the struct, this way the 'act' member is in the same location/offset as other XDP tracepoints. I see the 'ifindex' as the identifier for this tracepoint (other have map_id or prog_id in this location). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v4 3/6] xdp: Add devmap_hash map type for looking up devices by hashed index
On Mon, 22 Jul 2019 13:52:48 +0200 Toke Høiland-Jørgensen wrote: > +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab, > + int idx) > +{ > + return &dtab->dev_index_head[idx & (NETDEV_HASHENTRIES - 1)]; > +} It is good for performance that our "hash" function is simply an AND operation on the idx. We want to keep it this way. I don't like that you are using NETDEV_HASHENTRIES, because the BPF map infrastructure already have a way to specify the map size (struct bpf_map_def .max_entries). BUT for performance reasons, to keep the AND operation, we would need to round up the hash-array size to nearest power of 2 (or reject if user didn't specify a power of 2, if we want to "expose" this limit to users). > +struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 > key) > +{ > + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); > + struct hlist_head *head = dev_map_index_hash(dtab, key); > + struct bpf_dtab_netdev *dev; > + > + hlist_for_each_entry_rcu(dev, head, index_hlist) > + if (dev->idx == key) > + return dev; > + > + return NULL; > +} -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v4 3/6] xdp: Add devmap_hash map type for looking up devices by hashed index
On Thu, 25 Jul 2019 12:32:19 +0200 Toke Høiland-Jørgensen wrote: > Jesper Dangaard Brouer writes: > > > On Mon, 22 Jul 2019 13:52:48 +0200 > > Toke Høiland-Jørgensen wrote: > > > >> +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab, > >> + int idx) > >> +{ > >> + return &dtab->dev_index_head[idx & (NETDEV_HASHENTRIES - 1)]; > >> +} > > > > It is good for performance that our "hash" function is simply an AND > > operation on the idx. We want to keep it this way. > > > > I don't like that you are using NETDEV_HASHENTRIES, because the BPF map > > infrastructure already have a way to specify the map size (struct > > bpf_map_def .max_entries). BUT for performance reasons, to keep the > > AND operation, we would need to round up the hash-array size to nearest > > power of 2 (or reject if user didn't specify a power of 2, if we want > > to "expose" this limit to users). > > But do we really want the number of hash buckets to be equal to the max > number of entries? The values are not likely to be evenly distributed, > so we'll end up with big buckets if the number is small, meaning we'll > blow performance on walking long lists in each bucket. The requested change makes it user-configurable, instead of fixed 256 entries. I've seen production use-case with >5000 net_devices, thus they need a knob to increase this (to avoid the list walking as you mention). > Also, if the size is dynamic the size needs to be loaded from memory > instead of being a compile-time constant, which will presumably hurt > performance (though not sure by how much)? To counter this, the mask value which need to be loaded from memory, needs to be placed next to some other struct member which is already in use (at least on same cacheline, Intel have some 16 bytes access micro optimizations, which I've never been able to measure, as its in 0.5 nanosec scale). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH net-next] MAINTAINERS: Remove mailing-list entry for XDP (eXpress Data Path)
This removes the mailing list xdp-newb...@vger.kernel.org from the XDP kernel maintainers entry. Being in the kernel MAINTAINERS file successfully caused the list to receive kbuild bot warnings, syzbot reports and sometimes developer patches. The level of details in these messages, doesn't match the target audience of the XDP-newbies list. This is based on a survey on the mailing list, where 73% voted for removal from MAINTAINERS file. Signed-off-by: Jesper Dangaard Brouer --- MAINTAINERS |1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 9cc156c58f0c..45cb4237eddc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17560,7 +17560,6 @@ M: Jakub Kicinski M: Jesper Dangaard Brouer M: John Fastabend L: netdev@vger.kernel.org -L: xdp-newb...@vger.kernel.org L: b...@vger.kernel.org S: Supported F: net/core/xdp.c
Re: [PATCH bpf-next v5 1/6] include/bpf.h: Remove map_insert_ctx() stubs
On Fri, 26 Jul 2019 18:06:52 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > When we changed the device and CPU maps to use linked lists instead of > bitmaps, we also removed the need for the map_insert_ctx() helpers to keep > track of the bitmaps inside each map. However, it seems I forgot to remove > the function definitions stubs, so remove those here. > > Signed-off-by: Toke Høiland-Jørgensen > Acked-by: Yonghong Song Acked-by: Jesper Dangaard Brouer -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v5 2/6] xdp: Refactor devmap allocation code for reuse
On Fri, 26 Jul 2019 18:06:53 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > The subsequent patch to add a new devmap sub-type can re-use much of the > initialisation and allocation code, so refactor it into separate functions. > > Signed-off-by: Toke Høiland-Jørgensen > Acked-by: Yonghong Song Acked-by: Jesper Dangaard Brouer -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v5 3/6] xdp: Add devmap_hash map type for looking up devices by hashed index
On Fri, 26 Jul 2019 18:06:55 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > A common pattern when using xdp_redirect_map() is to create a device map > where the lookup key is simply ifindex. Because device maps are arrays, > this leaves holes in the map, and the map has to be sized to fit the > largest ifindex, regardless of how many devices actually are actually > needed in the map. > > This patch adds a second type of device map where the key is looked up > using a hashmap, instead of being used as an array index. This allows maps > to be densely packed, so they can be smaller. > > Signed-off-by: Toke Høiland-Jørgensen > Acked-by: Yonghong Song Acked-by: Jesper Dangaard Brouer [...] > +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab, > + int idx) > +{ > + return &dtab->dev_index_head[idx & (dtab->n_buckets - 1)]; > +} I was about to complain about, that you are not using a pre-calculated MASK value, instead of doing the -1 operation each time. But I looked at the ASM code, and the LEA operation used does the -1 operation in the same instruction, so I guess this makes no performance difference. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v5 4/6] tools/include/uapi: Add devmap_hash BPF map type
On Fri, 26 Jul 2019 18:06:56 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > This adds the devmap_hash BPF map type to the uapi headers in tools/. > > Signed-off-by: Toke Høiland-Jørgensen > Acked-by: Yonghong Song Acked-by: Jesper Dangaard Brouer -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v5 5/6] tools/libbpf_probes: Add new devmap_hash type
On Fri, 26 Jul 2019 18:06:57 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > This adds the definition for BPF_MAP_TYPE_DEVMAP_HASH to libbpf_probes.c in > tools/lib/bpf. > > Signed-off-by: Toke Høiland-Jørgensen > Acked-by: Yonghong Song Acked-by: Jesper Dangaard Brouer http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v5 6/6] tools: Add definitions for devmap_hash map type
On Fri, 26 Jul 2019 18:06:58 +0200 Toke Høiland-Jørgensen wrote: > From: Toke Høiland-Jørgensen > > This adds selftest and bpftool updates for the devmap_hash map type. > > Signed-off-by: Toke Høiland-Jørgensen Acked-by: Jesper Dangaard Brouer -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
version (in /usr/local) it was not first in $PATH for root. Load order also matters see issue#3. [3] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/common_user.c#L381-L493 (Issue#3) Next to get TC and XDP programs to cooperate, pinned maps via bpffs is used. And iproute2 dictates that pinned maps are located in directory /sys/fs/bpf/tc/globals/. What could go wrong, it's only a static dir path. Sub-problem(1): If XDP loads _before_ any tc-bpf cmd, then the subdirs are not created, leading to replicating mkdir creation in C[4]. Else the XDP load will fail. (Troubleshooting this was complicated by [4] https://github.com/xdp-project/xdp-cpumap-tc/commit/25e7e56699cd75a4a Sub-problem(2): We really want to load XDP first, because libbpf creates maps in a better way, e.g. with "name" (and BTF info). The "name" part was needed by libbpf to find a given map (to avoid depending on the order maps are defined in C/ELF file) via helper bpf_object__find_map_fd_by_name(). To handle TC noname case, I had to code up this workaround[5], which depend on extracting the name of the pinned file name. [5] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/xdp_iphash_to_cpu_user.c#L117-L131 Experience/conclusion: Getting XDP and TC to cooperate suck, primarily because iproute2 tc is based on a separate BPF-ELF loader, which is features are not in sync / lacking behind. Calling cmdline binaries from C also sucks, and I would prefer some libbpf TC attach function, but most pain comes from the slight differences between ELF-loaders. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface
On Wed, 31 Jul 2019 03:48:20 +0900 "Daniel T. Lee" wrote: > diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c > index 67e99c56bc88..d3a4f18b5b95 100644 > --- a/tools/bpf/bpftool/net.c > +++ b/tools/bpf/bpftool/net.c > @@ -55,6 +55,35 @@ struct bpf_attach_info { > __u32 flow_dissector_id; > }; > > +enum net_load_type { > + NET_LOAD_TYPE_XDP, > + NET_LOAD_TYPE_XDP_GENERIC, > + NET_LOAD_TYPE_XDP_DRIVE, Why "DRIVE" ? Why not "DRIVER" ? > + NET_LOAD_TYPE_XDP_OFFLOAD, > + __MAX_NET_LOAD_TYPE > +}; > + > +static const char * const load_type_strings[] = { > + [NET_LOAD_TYPE_XDP] = "xdp", > + [NET_LOAD_TYPE_XDP_GENERIC] = "xdpgeneric", > + [NET_LOAD_TYPE_XDP_DRIVE] = "xdpdrv", > + [NET_LOAD_TYPE_XDP_OFFLOAD] = "xdpoffload", > + [__MAX_NET_LOAD_TYPE] = NULL, > +}; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface
On Wed, 31 Jul 2019 03:48:20 +0900 "Daniel T. Lee" wrote: > By this commit, using `bpftool net load`, user can load XDP prog on > interface. New type of enum 'net_load_type' has been made, as stated at > cover-letter, the meaning of 'load' is, prog will be loaded on interface. Why the keyword "load" ? Why not "attach" (and "detach")? For BPF there is a clear distinction between the "load" and "attach" steps. I know this is under subcommand "net", but to follow the conversion of other subcommands e.g. "prog" there are both "load" and "attach" commands. > BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'. Again this is a "set" operation, not a "load" operation. > Signed-off-by: Daniel T. Lee [...] > static int do_show(int argc, char **argv) > { > struct bpf_attach_info attach_info = {}; > @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv) > > fprintf(stderr, > "Usage: %s %s { show | list } [dev ]\n" > + " %s %s load PROG LOAD_TYPE \n" The "PROG" here does it correspond to the 'bpftool prog' syntax?: PROG := { id PROG_ID | pinned FILE | tag PROG_TAG } > " %s %s help\n" > + "\n" > + " " HELP_SPEC_PROGRAM "\n" > + " LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload > }\n" > "Note: Only xdp and tc attachments are supported now.\n" > " For progs attached to cgroups, use \"bpftool cgroup\"\n" > " to dump program attachments. For program types\n" > " sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n" > " consult iproute2.\n", -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[net v1 PATCH 2/4] selftests/bpf: add wrapper scripts for test_xdp_vlan.sh
In-order to test both native-XDP (xdpdrv) and generic-XDP (xdpgeneric) create two wrapper test scripts, that start the test_xdp_vlan.sh script with these modes. Signed-off-by: Jesper Dangaard Brouer --- tools/testing/selftests/bpf/Makefile |3 ++- tools/testing/selftests/bpf/test_xdp_vlan.sh |5 - .../selftests/bpf/test_xdp_vlan_mode_generic.sh|9 + .../selftests/bpf/test_xdp_vlan_mode_native.sh |9 + 4 files changed, 24 insertions(+), 2 deletions(-) create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 3bd0f4a0336a..29001f944db7 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -57,7 +57,8 @@ TEST_PROGS := test_kmod.sh \ test_lirc_mode2.sh \ test_skb_cgroup_id.sh \ test_flow_dissector.sh \ - test_xdp_vlan.sh \ + test_xdp_vlan_mode_generic.sh \ + test_xdp_vlan_mode_native.sh \ test_lwt_ip_encap.sh \ test_tcp_check_syncookie.sh \ test_tc_tunnel.sh \ diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.sh b/tools/testing/selftests/bpf/test_xdp_vlan.sh index c8aed63b0ffe..7348661be815 100755 --- a/tools/testing/selftests/bpf/test_xdp_vlan.sh +++ b/tools/testing/selftests/bpf/test_xdp_vlan.sh @@ -2,7 +2,10 @@ # SPDX-License-Identifier: GPL-2.0 # Author: Jesper Dangaard Brouer -TESTNAME=xdp_vlan +# Allow wrapper scripts to name test +if [ -z "$TESTNAME" ]; then +TESTNAME=xdp_vlan +fi # Default XDP mode XDP_MODE=xdpgeneric diff --git a/tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh b/tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh new file mode 100755 index ..c515326d6d59 --- /dev/null +++ b/tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh @@ -0,0 +1,9 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Exit on failure +set -e + +# Wrapper script to test generic-XDP +export TESTNAME=xdp_vlan_mode_generic +./test_xdp_vlan.sh --mode=xdpgeneric diff --git a/tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh b/tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh new file mode 100755 index ..5cf7ce1f16c1 --- /dev/null +++ b/tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh @@ -0,0 +1,9 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Exit on failure +set -e + +# Wrapper script to test native-XDP +export TESTNAME=xdp_vlan_mode_native +./test_xdp_vlan.sh --mode=xdpdrv
[net v1 PATCH 0/4] net: fix regressions for generic-XDP
Thanks to Brandon Cazander, who wrote a very detailed bug report that even used perf probe's on xdp-newbies mailing list, we discovered that generic-XDP contains some regressions when using bpf_xdp_adjust_head(). First issue were that my selftests script, that use bpf_xdp_adjust_head(), by mistake didn't use generic-XDP any-longer. That selftest should have caught the real regression introduced in commit 458bf2f224f0 ("net: core: support XDP generic on stacked devices."). To verify this patchset fix the regressions, you can invoked manually via: cd tools/testing/selftests/bpf/ sudo ./test_xdp_vlan_mode_generic.sh sudo ./test_xdp_vlan_mode_native.sh Link: https://www.spinics.net/lists/xdp-newbies/msg01231.html Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") Reported by: Brandon Cazander Signed-off-by: Jesper Dangaard Brouer --- Jesper Dangaard Brouer (4): bpf: fix XDP vlan selftests test_xdp_vlan.sh selftests/bpf: add wrapper scripts for test_xdp_vlan.sh selftests/bpf: reduce time to execute test_xdp_vlan.sh net: fix bpf_xdp_adjust_head regression for generic-XDP net/core/dev.c | 15 - tools/testing/selftests/bpf/Makefile |3 + tools/testing/selftests/bpf/test_xdp_vlan.sh | 57 .../selftests/bpf/test_xdp_vlan_mode_generic.sh|9 +++ .../selftests/bpf/test_xdp_vlan_mode_native.sh |9 +++ 5 files changed, 75 insertions(+), 18 deletions(-) create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_generic.sh create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan_mode_native.sh --
[net v1 PATCH 1/4] bpf: fix XDP vlan selftests test_xdp_vlan.sh
Change BPF selftest test_xdp_vlan.sh to (default) use generic XDP. This selftest was created together with a fix for generic XDP, in commit 297249569932 ("net: fix generic XDP to handle if eth header was mangled"). And was suppose to catch if generic XDP was broken again. The tests are using veth and assumed that veth driver didn't support native driver XDP, thus it used the (ip link set) 'xdp' attach that fell back to generic-XDP. But veth gained native-XDP support in 948d4f214fde ("veth: Add driver XDP"), which caused this test script to use native-XDP. Fixes: 948d4f214fde ("veth: Add driver XDP") Fixes: 97396ff0bc2d ("selftests/bpf: add XDP selftests for modifying and popping VLAN headers") Signed-off-by: Jesper Dangaard Brouer --- tools/testing/selftests/bpf/test_xdp_vlan.sh | 42 ++ 1 file changed, 36 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.sh b/tools/testing/selftests/bpf/test_xdp_vlan.sh index 51a3a31d1aac..c8aed63b0ffe 100755 --- a/tools/testing/selftests/bpf/test_xdp_vlan.sh +++ b/tools/testing/selftests/bpf/test_xdp_vlan.sh @@ -1,7 +1,12 @@ #!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Author: Jesper Dangaard Brouer TESTNAME=xdp_vlan +# Default XDP mode +XDP_MODE=xdpgeneric + usage() { echo "Testing XDP + TC eBPF VLAN manipulations: $TESTNAME" echo "" @@ -9,9 +14,23 @@ usage() { echo " -v | --verbose : Verbose" echo " --flush: Flush before starting (e.g. after --interactive)" echo " --interactive : Keep netns setup running after test-run" + echo " --mode=XXX : Choose XDP mode (xdp | xdpgeneric | xdpdrv)" echo "" } +valid_xdp_mode() +{ + local mode=$1 + + case "$mode" in + xdpgeneric | xdpdrv | xdp) + return 0 + ;; + *) + return 1 + esac +} + cleanup() { local status=$? @@ -37,7 +56,7 @@ cleanup() # Using external program "getopt" to get --long-options OPTIONS=$(getopt -o hvfi: \ ---long verbose,flush,help,interactive,debug -- "$@") +--long verbose,flush,help,interactive,debug,mode: -- "$@") if (( $? != 0 )); then usage echo "selftests: $TESTNAME [FAILED] Error calling getopt, unknown option?" @@ -60,6 +79,11 @@ while true; do cleanup shift ;; + --mode ) + shift + XDP_MODE=$1 + shift + ;; -- ) shift break @@ -81,8 +105,14 @@ if [ "$EUID" -ne 0 ]; then exit 1 fi -ip link set dev lo xdp off 2>/dev/null > /dev/null -if [ $? -ne 0 ];then +valid_xdp_mode $XDP_MODE +if [ $? -ne 0 ]; then + echo "selftests: $TESTNAME [FAILED] unknown XDP mode ($XDP_MODE)" + exit 1 +fi + +ip link set dev lo xdpgeneric off 2>/dev/null > /dev/null +if [ $? -ne 0 ]; then echo "selftests: $TESTNAME [SKIP] need ip xdp support" exit 0 fi @@ -166,7 +196,7 @@ export FILE=test_xdp_vlan.o # First test: Remove VLAN by setting VLAN ID 0, using "xdp_vlan_change" export XDP_PROG=xdp_vlan_change -ip netns exec ns1 ip link set $DEVNS1 xdp object $FILE section $XDP_PROG +ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE object $FILE section $XDP_PROG # In ns1: egress use TC to add back VLAN tag 4011 # (del cmd) @@ -187,8 +217,8 @@ ip netns exec ns1 ping -W 2 -c 3 $IPADDR2 # ETH_P_8021Q indication, and this cause overwriting of our changes. # export XDP_PROG=xdp_vlan_remove_outer2 -ip netns exec ns1 ip link set $DEVNS1 xdp off -ip netns exec ns1 ip link set $DEVNS1 xdp object $FILE section $XDP_PROG +ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE off +ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE object $FILE section $XDP_PROG # Now the namespaces should still be able reach each-other, test with ping: ip netns exec ns2 ping -W 2 -c 3 $IPADDR1
[net v1 PATCH 4/4] net: fix bpf_xdp_adjust_head regression for generic-XDP
When generic-XDP was moved to a later processing step by commit 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") a regression was introduced when using bpf_xdp_adjust_head. The issue is that after this commit the skb->network_header is now changed prior to calling generic XDP and not after. Thus, if the header is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header also need to be updated again. Fix by calling skb_reset_network_header(). Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") Reported-by: Brandon Cazander Signed-off-by: Jesper Dangaard Brouer --- net/core/dev.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index fc676b2610e3..b5533795c3c1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4374,12 +4374,17 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, act = bpf_prog_run_xdp(xdp_prog, xdp); + /* check if bpf_xdp_adjust_head was used */ off = xdp->data - orig_data; - if (off > 0) - __skb_pull(skb, off); - else if (off < 0) - __skb_push(skb, -off); - skb->mac_header += off; + if (off) { + if (off > 0) + __skb_pull(skb, off); + else if (off < 0) + __skb_push(skb, -off); + + skb->mac_header += off; + skb_reset_network_header(skb); + } /* check if bpf_xdp_adjust_tail was used. it can only "shrink" * pckt.
[net v1 PATCH 3/4] selftests/bpf: reduce time to execute test_xdp_vlan.sh
Given the increasing number of BPF selftests, it makes sense to reduce the time to execute these tests. The ping parameters are adjusted to reduce the time from measures 9 sec to approx 2.8 sec. Signed-off-by: Jesper Dangaard Brouer --- tools/testing/selftests/bpf/test_xdp_vlan.sh | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.sh b/tools/testing/selftests/bpf/test_xdp_vlan.sh index 7348661be815..bb8b0da91686 100755 --- a/tools/testing/selftests/bpf/test_xdp_vlan.sh +++ b/tools/testing/selftests/bpf/test_xdp_vlan.sh @@ -188,7 +188,7 @@ ip netns exec ns2 ip link set lo up # At this point, the hosts cannot reach each-other, # because ns2 are using VLAN tags on the packets. -ip netns exec ns2 sh -c 'ping -W 1 -c 1 100.64.41.1 || echo "Okay ping fails"' +ip netns exec ns2 sh -c 'ping -W 1 -c 1 100.64.41.1 || echo "Success: First ping must fail"' # Now we can use the test_xdp_vlan.c program to pop/push these VLAN tags @@ -210,8 +210,8 @@ ip netns exec ns1 tc filter add dev $DEVNS1 egress \ prio 1 handle 1 bpf da obj $FILE sec tc_vlan_push # Now the namespaces can reach each-other, test with ping: -ip netns exec ns2 ping -W 2 -c 3 $IPADDR1 -ip netns exec ns1 ping -W 2 -c 3 $IPADDR2 +ip netns exec ns2 ping -i 0.2 -W 2 -c 2 $IPADDR1 +ip netns exec ns1 ping -i 0.2 -W 2 -c 2 $IPADDR2 # Second test: Replace xdp prog, that fully remove vlan header # @@ -224,5 +224,5 @@ ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE off ip netns exec ns1 ip link set $DEVNS1 $XDP_MODE object $FILE section $XDP_PROG # Now the namespaces should still be able reach each-other, test with ping: -ip netns exec ns2 ping -W 2 -c 3 $IPADDR1 -ip netns exec ns1 ping -W 2 -c 3 $IPADDR2 +ip netns exec ns2 ping -i 0.2 -W 2 -c 2 $IPADDR1 +ip netns exec ns1 ping -i 0.2 -W 2 -c 2 $IPADDR2
Re: [net v1 PATCH 4/4] net: fix bpf_xdp_adjust_head regression for generic-XDP
On Thu, 1 Aug 2019 17:44:06 -0700 Jakub Kicinski wrote: > On Thu, 01 Aug 2019 20:00:31 +0200, Jesper Dangaard Brouer wrote: > > When generic-XDP was moved to a later processing step by commit > > 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") > > a regression was introduced when using bpf_xdp_adjust_head. > > > > The issue is that after this commit the skb->network_header is now > > changed prior to calling generic XDP and not after. Thus, if the header > > is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header > > also need to be updated again. Fix by calling skb_reset_network_header(). > > > > Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") > > Reported-by: Brandon Cazander > > Signed-off-by: Jesper Dangaard Brouer > > Out of curiosity what was your conclusion regarding resetting the > transport header as well? Well, I don't know... I need some review, from e.g. Stephen that changed this... I've added code snippets below signature to helper reviewers (also helps understand below paragraph). I think, we perhaps should call skb_reset_transport_header(), as we change skb->data (via either __skb_pull() or __skb_push()), *BUT* I'm not sure it is needed/required, as someone/something afterwards still need to call skb_set_transport_header(), which also calls skb_reset_transport_header() anyway. I also want to ask reviewers, if we should move the call to skb_reset_mac_len() (which Stephen added) in __netif_receive_skb_core() into netif_receive_generic_xdp() (after the call to skb_reset_network_header()), (it would be more natural to keep them together)? - - Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer Code snippet /* check if bpf_xdp_adjust_head was used */ off = xdp->data - orig_data; if (off) { if (off > 0) __skb_pull(skb, off); else if (off < 0) __skb_push(skb, -off); skb->mac_header += off; skb_reset_network_header(skb); } static inline bool skb_transport_header_was_set(const struct sk_buff *skb) { return skb->transport_header != (typeof(skb->transport_header))~0U; } static inline unsigned char *skb_transport_header(const struct sk_buff *skb) { return skb->head + skb->transport_header; } static inline void skb_reset_transport_header(struct sk_buff *skb) { skb->transport_header = skb->data - skb->head; } static inline void skb_set_transport_header(struct sk_buff *skb, const int offset) { skb_reset_transport_header(skb); skb->transport_header += offset; }
[bpf-next PATCH 0/3] bpf: improvements to xdp_fwd sample
This patchset is focused on improvements for XDP forwarding sample named xdp_fwd, which leverage the existing FIB routing tables as described in LPC2018[1] talk by David Ahern. The primary motivation is to illustrate how Toke's recent work improves usability of XDP_REDIRECT via lookups in devmap. The other patches are to help users understand the sample. I have more improvements to xdp_fwd, but those might requires changes to libbpf. Thus, sending these patches first as they are isolated. [1] http://vger.kernel.org/lpc-networking2018.html#session-1 --- Jesper Dangaard Brouer (3): samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports samples/bpf: make xdp_fwd more practically usable via devmap lookup samples/bpf: xdp_fwd explain bpf_fib_lookup return codes samples/bpf/xdp_fwd_kern.c | 42 +- samples/bpf/xdp_fwd_user.c | 38 ++ 2 files changed, 59 insertions(+), 21 deletions(-) --
[bpf-next PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out that the chosen egress index should be checked for existence in the devmap. This can now be done via taking advantage of Toke's work in commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF"). This change makes xdp_fwd more practically usable, as this allows for a mixed environment, where IP-forwarding fallback to network stack, if the egress device isn't configured to use XDP. Signed-off-by: Jesper Dangaard Brouer --- samples/bpf/xdp_fwd_kern.c | 20 ++-- samples/bpf/xdp_fwd_user.c | 36 +--- 2 files changed, 39 insertions(+), 17 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index e6ffc4ea06f4..4a5ad381ed2a 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -104,13 +104,21 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); - /* verify egress index has xdp support -* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with -* cannot pass map_type 14 into func bpf_map_lookup_elem#1: -* NOTE: without verification that egress index supports XDP -* forwarding packets are dropped. -*/ if (rc == 0) { + int *val; + + /* Verify egress index has been configured as TX-port. +* (Note: User can still have inserted an egress ifindex that +* doesn't support XDP xmit, which will result in packet drops). +* +* Note: lookup in devmap supported since 0cdbb4b09a0. +* If not supported will fail with: +* cannot pass map_type 14 into func bpf_map_lookup_elem#1: +*/ + val = bpf_map_lookup_elem(&tx_port, &fib_params.ifindex); + if (!val) + return XDP_PASS; + if (h_proto == htons(ETH_P_IP)) ip_decrease_ttl(iph); else if (h_proto == htons(ETH_P_IPV6)) diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c index ba012d9f93dd..20951bc27477 100644 --- a/samples/bpf/xdp_fwd_user.c +++ b/samples/bpf/xdp_fwd_user.c @@ -27,14 +27,20 @@ #include "libbpf.h" #include - -static int do_attach(int idx, int fd, const char *name) +static int do_attach(int idx, int prog_fd, int map_fd, const char *name) { int err; - err = bpf_set_link_xdp_fd(idx, fd, 0); - if (err < 0) + err = bpf_set_link_xdp_fd(idx, prog_fd, 0); + if (err < 0) { printf("ERROR: failed to attach program to %s\n", name); + return err; + } + + /* Adding ifindex as a possible egress TX port */ + err = bpf_map_update_elem(map_fd, &idx, &idx, 0); + if (err) + printf("ERROR: failed using device %s as TX-port\n", name); return err; } @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name) if (err < 0) printf("ERROR: failed to detach program from %s\n", name); + /* TODO: Remember to cleanup map, when adding use of shared map +* bpf_map_delete_elem((map_fd, &idx); +*/ return err; } @@ -67,10 +76,10 @@ int main(int argc, char **argv) }; const char *prog_name = "xdp_fwd"; struct bpf_program *prog; + int prog_fd, map_fd = -1; char filename[PATH_MAX]; struct bpf_object *obj; int opt, i, idx, err; - int prog_fd, map_fd; int attach = 1; int ret = 0; @@ -103,8 +112,17 @@ int main(int argc, char **argv) return 1; } - if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) + err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd); + if (err) { + if (err == -22) { + printf("Does kernel support devmap lookup?\n"); + /* If not, the error message will be: +* "cannot pass map_type 14 into func +* bpf_map_lookup_elem#1" +*/ + } return 1; + } prog = bpf_object__find_program_by_title(obj, prog_name); prog_fd = bpf_program__fd(prog); @@ -119,10 +137,6 @@ int main(int argc, char **argv) return 1; } } - if (attach) { - for (i = 1; i < 64; ++i) - bpf_map_update_elem(map_fd, &i, &i, 0); - }
[bpf-next PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map which only have a single TX port. Change name to xdp_tx_ports to make it more descriptive. Signed-off-by: Jesper Dangaard Brouer --- samples/bpf/xdp_fwd_kern.c |5 +++-- samples/bpf/xdp_fwd_user.c |2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index a7e94e7ff87d..e6ffc4ea06f4 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -23,7 +23,8 @@ #define IPV6_FLOWINFO_MASK cpu_to_be32(0x0FFF) -struct bpf_map_def SEC("maps") tx_port = { +/* For TX-traffic redirect requires net_device ifindex to be in this devmap */ +struct bpf_map_def SEC("maps") xdp_tx_ports = { .type = BPF_MAP_TYPE_DEVMAP, .key_size = sizeof(int), .value_size = sizeof(int), @@ -117,7 +118,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN); memcpy(eth->h_source, fib_params.smac, ETH_ALEN); - return bpf_redirect_map(&tx_port, fib_params.ifindex, 0); + return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0); } return XDP_PASS; diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c index 5b46ee12c696..ba012d9f93dd 100644 --- a/samples/bpf/xdp_fwd_user.c +++ b/samples/bpf/xdp_fwd_user.c @@ -113,7 +113,7 @@ int main(int argc, char **argv) return 1; } map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj, - "tx_port")); + "xdp_tx_ports")); if (map_fd < 0) { printf("map not found: %s\n", strerror(map_fd)); return 1;
[bpf-next PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
Make it clear that this XDP program depend on the network stack to do the ARP resolution. This is connected with the BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup(). Another common mistake (seen via XDP-tutorial) is that users don't realize that sysctl net.ipv{4,6}.conf.all.forwarding setting is honored by bpf_fib_lookup. Reported-by: Anton Protopopov Signed-off-by: Jesper Dangaard Brouer --- samples/bpf/xdp_fwd_kern.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index 4a5ad381ed2a..df11163454e7 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -103,8 +103,23 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) fib_params.ifindex = ctx->ingress_ifindex; rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); - - if (rc == 0) { + /* +* Some rc (return codes) from bpf_fib_lookup() are important, +* to understand how this XDP-prog interacts with network stack. +* +* BPF_FIB_LKUP_RET_NO_NEIGH: +* Even if route lookup was a success, then the MAC-addresses are also +* needed. This is obtained from arp/neighbour table, but if table is +* (still) empty then BPF_FIB_LKUP_RET_NO_NEIGH is returned. To avoid +* doing ARP lookup directly from XDP, then send packet to normal +* network stack via XDP_PASS and expect it will do ARP resolution. +* +* BPF_FIB_LKUP_RET_FWD_DISABLED: +* The bpf_fib_lookup respect sysctl net.ipv{4,6}.conf.all.forwarding +* setting, and will return BPF_FIB_LKUP_RET_FWD_DISABLED if not +* enabled this on ingress device. +*/ + if (rc == BPF_FIB_LKUP_RET_SUCCESS) { int *val; /* Verify egress index has been configured as TX-port.
Re: [bpf-next PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
On Wed, 7 Aug 2019 11:04:17 -0700 Y Song wrote: > On Wed, Aug 7, 2019 at 5:37 AM Jesper Dangaard Brouer > wrote: > > > > This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out > > that the chosen egress index should be checked for existence in the > > devmap. This can now be done via taking advantage of Toke's work in > > commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF"). > > > > This change makes xdp_fwd more practically usable, as this allows for > > a mixed environment, where IP-forwarding fallback to network stack, if > > the egress device isn't configured to use XDP. > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > samples/bpf/xdp_fwd_kern.c | 20 ++-- > > samples/bpf/xdp_fwd_user.c | 36 +--- > > 2 files changed, 39 insertions(+), 17 deletions(-) > > > > diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c > > index e6ffc4ea06f4..4a5ad381ed2a 100644 > > --- a/samples/bpf/xdp_fwd_kern.c > > +++ b/samples/bpf/xdp_fwd_kern.c > > @@ -104,13 +104,21 @@ static __always_inline int xdp_fwd_flags(struct > > xdp_md *ctx, u32 flags) > > > > rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); > > > > - /* verify egress index has xdp support > > -* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with > > -* cannot pass map_type 14 into func bpf_map_lookup_elem#1: > > -* NOTE: without verification that egress index supports XDP > > -* forwarding packets are dropped. > > -*/ > > if (rc == 0) { > > + int *val; > > + > > + /* Verify egress index has been configured as TX-port. > > +* (Note: User can still have inserted an egress ifindex > > that > > +* doesn't support XDP xmit, which will result in packet > > drops). > > +* > > +* Note: lookup in devmap supported since 0cdbb4b09a0. > > +* If not supported will fail with: > > +* cannot pass map_type 14 into func bpf_map_lookup_elem#1: > > +*/ > > + val = bpf_map_lookup_elem(&tx_port, &fib_params.ifindex); > > It should be "xdp_tx_ports". Otherwise, you will have compilation errors. Ups. This happened in my rebase, where I moved the rename patch [1/3] before this one. Thanks for catching this! > > + if (!val) > > + return XDP_PASS; > > Also, maybe we can do > if (!bpf_map_lookup_elem(&tx_port, &fib_params.ifindex)) > return XDP_PASS; > so we do not need to define val at all. I had it this way, because I also checked the contents (of the pointer *val) to check if this was the correct ifindex, but I removed that check again (as user side always insert correctly). So, I guess I could take your suggestion now. > > + > > if (h_proto == htons(ETH_P_IP)) > > ip_decrease_ttl(iph); > > else if (h_proto == htons(ETH_P_IPV6)) > > diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c > > index ba012d9f93dd..20951bc27477 100644 > > --- a/samples/bpf/xdp_fwd_user.c > > +++ b/samples/bpf/xdp_fwd_user.c > > @@ -27,14 +27,20 @@ > > #include "libbpf.h" > > #include > > > > - > > -static int do_attach(int idx, int fd, const char *name) > > +static int do_attach(int idx, int prog_fd, int map_fd, const char > > *name) { > > int err; > > > > - err = bpf_set_link_xdp_fd(idx, fd, 0); > > - if (err < 0) > > + err = bpf_set_link_xdp_fd(idx, prog_fd, 0); > > + if (err < 0) { > > printf("ERROR: failed to attach program to %s\n", > > name); > > + return err; > > + } > > + > > + /* Adding ifindex as a possible egress TX port */ > > + err = bpf_map_update_elem(map_fd, &idx, &idx, 0); > > + if (err) > > + printf("ERROR: failed using device %s as > > TX-port\n", name); > > > > return err; > > } > > @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name) > > if (err < 0) > > printf("ERROR: failed to detach program from %s\n", > > name); > > > > + /* TODO: Remember to cleanup map, when adding use o
Re: [bpf-next PATCH 0/3] bpf: improvements to xdp_fwd sample
On Wed, 7 Aug 2019 15:09:09 -0700 Zvi Effron wrote: > On Wed, Aug 7, 2019 at 6:00 AM Jesper Dangaard Brouer > wrote: > > > > Toke's devmap lookup improvement is first avail in kernel v5.3. > > Thus, not part of XDP-tutorial yet. > > > I probably missed this in an earlier email, but what are Toke's devmap > improvements? Performance? Capability? Toke's devmap and redirect improvements are primarily about usability. Currently, from BPF-context (kernel-side) you cannot read the contents of devmap (or cpumap or xskmap(AF_XDP)). Because for devmap you get the real pointer to the net_device ifindex, and we cannot allow you to write/change that from BPF (kernel would likely crash or be inconsistent). The work-around, is to keep a shadow map, that contains the "config" of the devmap, which you check/validate against instead. It is just a pain to maintain this shadow map. Toke's change allow you to read devmap from BPF-context. Thus, you can avoid this shadow map. Another improvement from Toke, is that the bpf_redirect_map() helper, now also check if the redirect index is valid in the map. If not, then it returns another value than XDP_REDIRECT. You can choose the alternative return value yourself, via "flags" e.g. XDP_PASS. Thus, you don't even need to check/validate devmap in your BPF-code, as it is part of the bpf_redirect_map() call now. action = bpf_redirect_map(&map, &index, flags_as_xdp_value) The default flags used in most programs today is 0, which maps to XDP_ABORTED. This is sort of a small UAPI change, but for the better. As today, the packet is dropped later, only diagnose/seen via tracepoint xdp:xdp_redirect_map_err. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[bpf-next v2 PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
Make it clear that this XDP program depend on the network stack to do the ARP resolution. This is connected with the BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup(). Another common mistake (seen via XDP-tutorial) is that users don't realize that sysctl net.ipv{4,6}.conf.all.forwarding setting is honored by bpf_fib_lookup. Reported-by: Anton Protopopov Signed-off-by: Jesper Dangaard Brouer Reviewed-by: David Ahern Acked-by: Yonghong Song --- samples/bpf/xdp_fwd_kern.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index a43d6953c054..701a30f258b1 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -103,8 +103,23 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) fib_params.ifindex = ctx->ingress_ifindex; rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); - - if (rc == 0) { + /* +* Some rc (return codes) from bpf_fib_lookup() are important, +* to understand how this XDP-prog interacts with network stack. +* +* BPF_FIB_LKUP_RET_NO_NEIGH: +* Even if route lookup was a success, then the MAC-addresses are also +* needed. This is obtained from arp/neighbour table, but if table is +* (still) empty then BPF_FIB_LKUP_RET_NO_NEIGH is returned. To avoid +* doing ARP lookup directly from XDP, then send packet to normal +* network stack via XDP_PASS and expect it will do ARP resolution. +* +* BPF_FIB_LKUP_RET_FWD_DISABLED: +* The bpf_fib_lookup respect sysctl net.ipv{4,6}.conf.all.forwarding +* setting, and will return BPF_FIB_LKUP_RET_FWD_DISABLED if not +* enabled this on ingress device. +*/ + if (rc == BPF_FIB_LKUP_RET_SUCCESS) { /* Verify egress index has been configured as TX-port. * (Note: User can still have inserted an egress ifindex that * doesn't support XDP xmit, which will result in packet drops).
[bpf-next v2 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out that the chosen egress index should be checked for existence in the devmap. This can now be done via taking advantage of Toke's work in commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF"). This change makes xdp_fwd more practically usable, as this allows for a mixed environment, where IP-forwarding fallback to network stack, if the egress device isn't configured to use XDP. Signed-off-by: Jesper Dangaard Brouer Reviewed-by: David Ahern --- samples/bpf/xdp_fwd_kern.c | 17 +++-- samples/bpf/xdp_fwd_user.c | 36 +--- 2 files changed, 36 insertions(+), 17 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index e6ffc4ea06f4..a43d6953c054 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -104,13 +104,18 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); - /* verify egress index has xdp support -* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with -* cannot pass map_type 14 into func bpf_map_lookup_elem#1: -* NOTE: without verification that egress index supports XDP -* forwarding packets are dropped. -*/ if (rc == 0) { + /* Verify egress index has been configured as TX-port. +* (Note: User can still have inserted an egress ifindex that +* doesn't support XDP xmit, which will result in packet drops). +* +* Note: lookup in devmap supported since 0cdbb4b09a0. +* If not supported will fail with: +* cannot pass map_type 14 into func bpf_map_lookup_elem#1: +*/ + if (!bpf_map_lookup_elem(&xdp_tx_ports, &fib_params.ifindex)) + return XDP_PASS; + if (h_proto == htons(ETH_P_IP)) ip_decrease_ttl(iph); else if (h_proto == htons(ETH_P_IPV6)) diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c index ba012d9f93dd..20951bc27477 100644 --- a/samples/bpf/xdp_fwd_user.c +++ b/samples/bpf/xdp_fwd_user.c @@ -27,14 +27,20 @@ #include "libbpf.h" #include - -static int do_attach(int idx, int fd, const char *name) +static int do_attach(int idx, int prog_fd, int map_fd, const char *name) { int err; - err = bpf_set_link_xdp_fd(idx, fd, 0); - if (err < 0) + err = bpf_set_link_xdp_fd(idx, prog_fd, 0); + if (err < 0) { printf("ERROR: failed to attach program to %s\n", name); + return err; + } + + /* Adding ifindex as a possible egress TX port */ + err = bpf_map_update_elem(map_fd, &idx, &idx, 0); + if (err) + printf("ERROR: failed using device %s as TX-port\n", name); return err; } @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name) if (err < 0) printf("ERROR: failed to detach program from %s\n", name); + /* TODO: Remember to cleanup map, when adding use of shared map +* bpf_map_delete_elem((map_fd, &idx); +*/ return err; } @@ -67,10 +76,10 @@ int main(int argc, char **argv) }; const char *prog_name = "xdp_fwd"; struct bpf_program *prog; + int prog_fd, map_fd = -1; char filename[PATH_MAX]; struct bpf_object *obj; int opt, i, idx, err; - int prog_fd, map_fd; int attach = 1; int ret = 0; @@ -103,8 +112,17 @@ int main(int argc, char **argv) return 1; } - if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) + err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd); + if (err) { + if (err == -22) { + printf("Does kernel support devmap lookup?\n"); + /* If not, the error message will be: +* "cannot pass map_type 14 into func +* bpf_map_lookup_elem#1" +*/ + } return 1; + } prog = bpf_object__find_program_by_title(obj, prog_name); prog_fd = bpf_program__fd(prog); @@ -119,10 +137,6 @@ int main(int argc, char **argv) return 1; } } - if (attach) { - for (i = 1; i < 64; ++i) - bpf_map_update_elem(map_fd, &i, &i, 0); - } for (i = optind; i < argc; ++i) {
[bpf-next v2 PATCH 0/3] bpf: improvements to xdp_fwd sample
V2: Addressed issues point out by Yonghong Song - Please ACK patch 2/3 again - Added ACKs and reviewed-by to other patches This patchset is focused on improvements for XDP forwarding sample named xdp_fwd, which leverage the existing FIB routing tables as described in LPC2018[1] talk by David Ahern. The primary motivation is to illustrate how Toke's recent work improves usability of XDP_REDIRECT via lookups in devmap. The other patches are to help users understand the sample. I have more improvements to xdp_fwd, but those might requires changes to libbpf. Thus, sending these patches first as they are isolated. [1] http://vger.kernel.org/lpc-networking2018.html#session-1 --- Jesper Dangaard Brouer (3): samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports samples/bpf: make xdp_fwd more practically usable via devmap lookup samples/bpf: xdp_fwd explain bpf_fib_lookup return codes samples/bpf/xdp_fwd_kern.c | 39 ++- samples/bpf/xdp_fwd_user.c | 38 ++ 2 files changed, 56 insertions(+), 21 deletions(-) --
[bpf-next v2 PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map which only have a single TX port. Change name to xdp_tx_ports to make it more descriptive. Signed-off-by: Jesper Dangaard Brouer Reviewed-by: David Ahern Acked-by: Yonghong Song --- samples/bpf/xdp_fwd_kern.c |5 +++-- samples/bpf/xdp_fwd_user.c |2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index a7e94e7ff87d..e6ffc4ea06f4 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -23,7 +23,8 @@ #define IPV6_FLOWINFO_MASK cpu_to_be32(0x0FFF) -struct bpf_map_def SEC("maps") tx_port = { +/* For TX-traffic redirect requires net_device ifindex to be in this devmap */ +struct bpf_map_def SEC("maps") xdp_tx_ports = { .type = BPF_MAP_TYPE_DEVMAP, .key_size = sizeof(int), .value_size = sizeof(int), @@ -117,7 +118,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN); memcpy(eth->h_source, fib_params.smac, ETH_ALEN); - return bpf_redirect_map(&tx_port, fib_params.ifindex, 0); + return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0); } return XDP_PASS; diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c index 5b46ee12c696..ba012d9f93dd 100644 --- a/samples/bpf/xdp_fwd_user.c +++ b/samples/bpf/xdp_fwd_user.c @@ -113,7 +113,7 @@ int main(int argc, char **argv) return 1; } map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj, - "tx_port")); + "xdp_tx_ports")); if (map_fd < 0) { printf("map not found: %s\n", strerror(map_fd)); return 1;
Re: [bpf-next v2 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
On Thu, 08 Aug 2019 17:46:46 +0200 Jesper Dangaard Brouer wrote: > @@ -103,8 +112,17 @@ int main(int argc, char **argv) > return 1; > } > > - if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) > + err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd); > + if (err) { > + if (err == -22) { Darn - I forgot this, need to be changed. I'll send a V3. > + printf("Does kernel support devmap lookup?\n"); > + /* If not, the error message will be: > + * "cannot pass map_type 14 into func > + * bpf_map_lookup_elem#1" > + */ > + } > return 1; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[bpf-next v3 PATCH 0/3] bpf: improvements to xdp_fwd sample
V3: Hopefully fixed all issues point out by Yonghong Song V2: Addressed issues point out by Yonghong Song - Please ACK patch 2/3 again - Added ACKs and reviewed-by to other patches This patchset is focused on improvements for XDP forwarding sample named xdp_fwd, which leverage the existing FIB routing tables as described in LPC2018[1] talk by David Ahern. The primary motivation is to illustrate how Toke's recent work improves usability of XDP_REDIRECT via lookups in devmap. The other patches are to help users understand the sample. I have more improvements to xdp_fwd, but those might requires changes to libbpf. Thus, sending these patches first as they are isolated. [1] http://vger.kernel.org/lpc-networking2018.html#session-1 --- Jesper Dangaard Brouer (3): samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports samples/bpf: make xdp_fwd more practically usable via devmap lookup samples/bpf: xdp_fwd explain bpf_fib_lookup return codes samples/bpf/xdp_fwd_kern.c | 39 ++- samples/bpf/xdp_fwd_user.c | 35 +++ 2 files changed, 53 insertions(+), 21 deletions(-) --
[bpf-next v3 PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
Make it clear that this XDP program depend on the network stack to do the ARP resolution. This is connected with the BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup(). Another common mistake (seen via XDP-tutorial) is that users don't realize that sysctl net.ipv{4,6}.conf.all.forwarding setting is honored by bpf_fib_lookup. Reported-by: Anton Protopopov Signed-off-by: Jesper Dangaard Brouer Reviewed-by: David Ahern Acked-by: Yonghong Song --- samples/bpf/xdp_fwd_kern.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index a43d6953c054..701a30f258b1 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -103,8 +103,23 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) fib_params.ifindex = ctx->ingress_ifindex; rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); - - if (rc == 0) { + /* +* Some rc (return codes) from bpf_fib_lookup() are important, +* to understand how this XDP-prog interacts with network stack. +* +* BPF_FIB_LKUP_RET_NO_NEIGH: +* Even if route lookup was a success, then the MAC-addresses are also +* needed. This is obtained from arp/neighbour table, but if table is +* (still) empty then BPF_FIB_LKUP_RET_NO_NEIGH is returned. To avoid +* doing ARP lookup directly from XDP, then send packet to normal +* network stack via XDP_PASS and expect it will do ARP resolution. +* +* BPF_FIB_LKUP_RET_FWD_DISABLED: +* The bpf_fib_lookup respect sysctl net.ipv{4,6}.conf.all.forwarding +* setting, and will return BPF_FIB_LKUP_RET_FWD_DISABLED if not +* enabled this on ingress device. +*/ + if (rc == BPF_FIB_LKUP_RET_SUCCESS) { /* Verify egress index has been configured as TX-port. * (Note: User can still have inserted an egress ifindex that * doesn't support XDP xmit, which will result in packet drops).
[bpf-next v3 PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map which only have a single TX port. Change name to xdp_tx_ports to make it more descriptive. Signed-off-by: Jesper Dangaard Brouer Reviewed-by: David Ahern Acked-by: Yonghong Song --- samples/bpf/xdp_fwd_kern.c |5 +++-- samples/bpf/xdp_fwd_user.c |2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index a7e94e7ff87d..e6ffc4ea06f4 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -23,7 +23,8 @@ #define IPV6_FLOWINFO_MASK cpu_to_be32(0x0FFF) -struct bpf_map_def SEC("maps") tx_port = { +/* For TX-traffic redirect requires net_device ifindex to be in this devmap */ +struct bpf_map_def SEC("maps") xdp_tx_ports = { .type = BPF_MAP_TYPE_DEVMAP, .key_size = sizeof(int), .value_size = sizeof(int), @@ -117,7 +118,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN); memcpy(eth->h_source, fib_params.smac, ETH_ALEN); - return bpf_redirect_map(&tx_port, fib_params.ifindex, 0); + return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0); } return XDP_PASS; diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c index 5b46ee12c696..ba012d9f93dd 100644 --- a/samples/bpf/xdp_fwd_user.c +++ b/samples/bpf/xdp_fwd_user.c @@ -113,7 +113,7 @@ int main(int argc, char **argv) return 1; } map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj, - "tx_port")); + "xdp_tx_ports")); if (map_fd < 0) { printf("map not found: %s\n", strerror(map_fd)); return 1;
[bpf-next v3 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out that the chosen egress index should be checked for existence in the devmap. This can now be done via taking advantage of Toke's work in commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF"). This change makes xdp_fwd more practically usable, as this allows for a mixed environment, where IP-forwarding fallback to network stack, if the egress device isn't configured to use XDP. Signed-off-by: Jesper Dangaard Brouer Reviewed-by: David Ahern --- samples/bpf/xdp_fwd_kern.c | 17 +++-- samples/bpf/xdp_fwd_user.c | 33 ++--- 2 files changed, 33 insertions(+), 17 deletions(-) diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c index e6ffc4ea06f4..a43d6953c054 100644 --- a/samples/bpf/xdp_fwd_kern.c +++ b/samples/bpf/xdp_fwd_kern.c @@ -104,13 +104,18 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags) rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags); - /* verify egress index has xdp support -* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with -* cannot pass map_type 14 into func bpf_map_lookup_elem#1: -* NOTE: without verification that egress index supports XDP -* forwarding packets are dropped. -*/ if (rc == 0) { + /* Verify egress index has been configured as TX-port. +* (Note: User can still have inserted an egress ifindex that +* doesn't support XDP xmit, which will result in packet drops). +* +* Note: lookup in devmap supported since 0cdbb4b09a0. +* If not supported will fail with: +* cannot pass map_type 14 into func bpf_map_lookup_elem#1: +*/ + if (!bpf_map_lookup_elem(&xdp_tx_ports, &fib_params.ifindex)) + return XDP_PASS; + if (h_proto == htons(ETH_P_IP)) ip_decrease_ttl(iph); else if (h_proto == htons(ETH_P_IPV6)) diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c index ba012d9f93dd..97ff1dad7669 100644 --- a/samples/bpf/xdp_fwd_user.c +++ b/samples/bpf/xdp_fwd_user.c @@ -27,14 +27,20 @@ #include "libbpf.h" #include - -static int do_attach(int idx, int fd, const char *name) +static int do_attach(int idx, int prog_fd, int map_fd, const char *name) { int err; - err = bpf_set_link_xdp_fd(idx, fd, 0); - if (err < 0) + err = bpf_set_link_xdp_fd(idx, prog_fd, 0); + if (err < 0) { printf("ERROR: failed to attach program to %s\n", name); + return err; + } + + /* Adding ifindex as a possible egress TX port */ + err = bpf_map_update_elem(map_fd, &idx, &idx, 0); + if (err) + printf("ERROR: failed using device %s as TX-port\n", name); return err; } @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name) if (err < 0) printf("ERROR: failed to detach program from %s\n", name); + /* TODO: Remember to cleanup map, when adding use of shared map +* bpf_map_delete_elem((map_fd, &idx); +*/ return err; } @@ -67,10 +76,10 @@ int main(int argc, char **argv) }; const char *prog_name = "xdp_fwd"; struct bpf_program *prog; + int prog_fd, map_fd = -1; char filename[PATH_MAX]; struct bpf_object *obj; int opt, i, idx, err; - int prog_fd, map_fd; int attach = 1; int ret = 0; @@ -103,8 +112,14 @@ int main(int argc, char **argv) return 1; } - if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) + err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd); + if (err) { + printf("Does kernel support devmap lookup?\n"); + /* If not, the error message will be: +* "cannot pass map_type 14 into func bpf_map_lookup_elem#1" +*/ return 1; + } prog = bpf_object__find_program_by_title(obj, prog_name); prog_fd = bpf_program__fd(prog); @@ -119,10 +134,6 @@ int main(int argc, char **argv) return 1; } } - if (attach) { - for (i = 1; i < 64; ++i) - bpf_map_update_elem(map_fd, &i, &i, 0); - } for (i = optind; i < argc; ++i) { idx = if_nametoindex(argv[i]); @@ -138,7 +149,7 @@ int main(int argc, char **argv) if (err)
Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type
On Thu, 8 Aug 2019 12:57:05 -0700 Y Song wrote: > On Thu, Aug 8, 2019 at 12:43 PM Toke Høiland-Jørgensen > wrote: > > > > Alexei Starovoitov writes: > > > > > On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen > > > wrote: > > >> > > >> This series adds a new map type, devmap_hash, that works like the > > >> existing > > >> devmap type, but using a hash-based indexing scheme. This is useful for > > >> the use > > >> case where a devmap is indexed by ifindex (for instance for use with the > > >> routing > > >> table lookup helper). For this use case, the regular devmap needs to be > > >> sized > > >> after the maximum ifindex number, not the number of devices in it. A > > >> hash-based > > >> indexing scheme makes it possible to size the map after the number of > > >> devices it > > >> should contain instead. > > >> > > >> This was previously part of my patch series that also turned the regular > > >> bpf_redirect() helper into a map-based one; for this series I just > > >> pulled out > > >> the patches that introduced the new map type. > > >> > > >> Changelog: > > >> > > >> v5: > > >> > > >> - Dynamically set the number of hash buckets by rounding up max_entries > > >> to the > > >> nearest power of two (mirroring the regular hashmap), as suggested by > > >> Jesper. > > > > > > fyi I'm waiting for Jesper to review this new version. > > > > Ping Jesper? :) > > Toke, the patch set has been merged to net-next. > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=d3406913561c322323ec2898cc58f55e79786be7 > Yes, and I did review this... :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
On Thu, 14 Jan 2021 00:07:14 +0100 Daniel Borkmann wrote: > On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote: > > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. > [...] > > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, > > u64 flags) > > + * Description > > + * Check ctx packet size against MTU of net device (based on > > + * *ifindex*). This helper will likely be used in combination with > > + * helpers that adjust/change the packet size. The argument > > + * *len_diff* can be used for querying with a planned size > > + * change. This allows to check MTU prior to changing packet ctx. > > + * > > + * Specifying *ifindex* zero means the MTU check is performed > > + * against the current net device. This is practical if this isn't > > + * used prior to redirect. > > + * > > + * The Linux kernel route table can configure MTUs on a more > > + * specific per route level, which is not provided by this helper. > > + * For route level MTU checks use the **bpf_fib_lookup**\ () > > + * helper. > > + * > > + * *ctx* is either **struct xdp_md** for XDP programs or > > + * **struct sk_buff** for tc cls_act programs. > > + * > > + * The *flags* argument can be a combination of one or more of the > > + * following values: > > + * > > + * **BPF_MTU_CHK_SEGS** > > + * This flag will only works for *ctx* **struct sk_buff**. > > + * If packet context contains extra packet segment buffers > > + * (often knows as GSO skb), then MTU check is harder to > > + * check at this point, because in transmit path it is > > + * possible for the skb packet to get re-segmented > > + * (depending on net device features). This could still be > > + * a MTU violation, so this flag enables performing MTU > > + * check against segments, with a different violation > > + * return code to tell it apart. Check cannot use len_diff. > > + * > > + * On return *mtu_len* pointer contains the MTU value of the net > > + * device. Remember the net device configured MTU is the L3 size, > > + * which is returned here and XDP and TX length operate at L2. > > + * Helper take this into account for you, but remember when using > > + * MTU value in your BPF-code. On input *mtu_len* must be a valid > > + * pointer and be initialized (to zero), else verifier will reject > > + * BPF program. > > + * > > + * Return > > + * * 0 on success, and populate MTU value in *mtu_len* pointer. > > + * > > + * * < 0 if any input argument is invalid (*mtu_len* not updated) > > + * > > + * MTU violations return positive values, but also populate MTU > > + * value in *mtu_len* pointer, as this can be needed for > > + * implementing PMTU handing: > > + * > > + * * **BPF_MTU_CHK_RET_FRAG_NEEDED** > > + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG** > > + * > >*/ > > #define __BPF_FUNC_MAPPER(FN) \ > > FN(unspec), \ > > @@ -3998,6 +4053,7 @@ union bpf_attr { > > FN(ktime_get_coarse_ns),\ > > FN(ima_inode_hash), \ > > FN(sock_from_file), \ > > + FN(check_mtu), \ > > /* */ > > > > /* integer value in 'imm' field of BPF_CALL instruction selects which > > helper > > @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh { > > }; > > }; > > > > +/* bpf_check_mtu flags*/ > > +enum bpf_check_mtu_flags { > > + BPF_MTU_CHK_SEGS = (1U << 0), > > +}; > > + > > +enum bpf_check_mtu_ret { > > + BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */ > > + BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */ > > + BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */ > > +}; > > + > > enum bpf_task_fd_type { > > BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */ > > BPF_FD_TYPE_TRACEPOINT, /* tp name */ > > diff --git a/net/core/filter.c b/net/core/filter.c > > index db59ab55572c..3f2e593244ca 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto >
Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
On Tue, 12 Jan 2021 11:23:33 -0800 Andrii Nakryiko wrote: > On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer > wrote: > > > > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. > > > > The SKB object is complex and the skb->len value (accessible from > > BPF-prog) also include the length of any extra GRO/GSO segments, but > > without taking into account that these GRO/GSO segments get added > > transport (L4) and network (L3) headers before being transmitted. Thus, > > this BPF-helper is created such that the BPF-programmer don't need to > > handle these details in the BPF-prog. > > > > The API is designed to help the BPF-programmer, that want to do packet > > context size changes, which involves other helpers. These other helpers > > usually does a delta size adjustment. This helper also support a delta > > size (len_diff), which allow BPF-programmer to reuse arguments needed by > > these other helpers, and perform the MTU check prior to doing any actual > > size adjustment of the packet context. > > > > It is on purpose, that we allow the len adjustment to become a negative > > result, that will pass the MTU check. This might seem weird, but it's not > > this helpers responsibility to "catch" wrong len_diff adjustments. Other > > helpers will take care of these checks, if BPF-programmer chooses to do > > actual size adjustment. > > > > V9: > > - Use dev->hard_header_len (instead of ETH_HLEN) > > - Annotate with unlikely req from Daniel > > - Fix logic error using skb_gso_validate_network_len from Daniel > > > > V6: > > - Took John's advice and dropped BPF_MTU_CHK_RELAX > > - Returned MTU is kept at L3-level (like fib_lookup) > > > > V4: Lot of changes > > - ifindex 0 now use current netdev for MTU lookup > > - rename helper from bpf_mtu_check to bpf_check_mtu > > - fix bug for GSO pkt length (as skb->len is total len) > > - remove __bpf_len_adj_positive, simply allow negative len adj > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > include/uapi/linux/bpf.h | 67 ++ > > net/core/filter.c | 122 > > > > tools/include/uapi/linux/bpf.h | 67 ++ > > 3 files changed, 256 insertions(+) > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 649586d656b6..fa2e99351758 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -3833,6 +3833,61 @@ union bpf_attr { > > * Return > > * A pointer to a struct socket on success or NULL if the file > > is > > * not a socket. > > + * > > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, > > u64 flags) > > should return long, same as most other helpers Is it enough to change it here? (as this will be used for generating the helpers header file, via ./scripts/bpf_helpers_doc.py --header) Or do I also need to change bpf_func_proto.ret_type ? > > + * Description > > + * Check ctx packet size against MTU of net device (based on > > + * *ifindex*). This helper will likely be used in combination > > with > > + * helpers that adjust/change the packet size. The argument > > + * *len_diff* can be used for querying with a planned size > > + * change. This allows to check MTU prior to changing packet > > ctx. > > + * > > [...] > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
On Thu, 14 Jan 2021 01:03:33 -0800 John Fastabend wrote: > Jesper Dangaard Brouer wrote: > > The use-case for dropping the MTU check when TC-BPF does redirect to > > ingress, is described by Eyal Birger in email[0]. The summary is the > > ability to increase packet size (e.g. with IPv6 headers for NAT64) and > > ingress redirect packet and let normal netstack fragment packet as needed. > > > > [0] > > https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/ > > > > V9: > > - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect > > > > V4: > > - Keep net_device "up" (IFF_UP) check. > > - Adjustment to handle bpf_redirect_peer() helper > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > include/linux/netdevice.h | 31 +-- > > net/core/dev.c| 19 ++- > > net/core/filter.c | 14 +++--- > > 3 files changed, 42 insertions(+), 22 deletions(-) > > > > [...] > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 3f2e593244ca..1908800b671c 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto > > bpf_csum_level_proto = { > > > > static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb) > > { > > - return dev_forward_skb(dev, skb); > > > + int ret = dev_forward_skb(dev, skb, false); > > + > > + if (likely(!ret)) { > > + skb->protocol = eth_type_trans(skb, dev); > > + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > > + ret = netif_rx(skb); > > + } > > + > > + return ret; > > How about putting above block into a dev.c routine call it > > dev_forward_skb_nomtu(...) > > or something like that. Then we keep this code next to its pair > with mtu check, dev_forward_skb(). > > dev_forward_skb() also uses netif_rx_internal() looks like maybe we should > just do the same here? I love the idea. I'm coding it up and it looks much nicer. And yes we obviously can use netif_rx_internal() once the code in core/dev.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
On Thu, 14 Jan 2021 23:28:57 +0100 Daniel Borkmann wrote: > On 1/14/21 3:36 PM, Jesper Dangaard Brouer wrote: > [...] > >>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > >>> +u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) > >>> +{ > >>> + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > >>> + struct net_device *dev = skb->dev; > >>> + int skb_len, dev_len; > >>> + int mtu; > >>> + > >>> + if (unlikely(flags & ~(BPF_MTU_CHK_SEGS))) > >>> + return -EINVAL; > >>> + > >>> + dev = __dev_via_ifindex(dev, ifindex); > >>> + if (unlikely(!dev)) > >>> + return -ENODEV; > >>> + > >>> + mtu = READ_ONCE(dev->mtu); > >>> + > >>> + dev_len = mtu + dev->hard_header_len; > >>> + skb_len = skb->len + len_diff; /* minus result pass check */ > >>> + if (skb_len <= dev_len) { > >>> + ret = BPF_MTU_CHK_RET_SUCCESS; > >>> + goto out; > >>> + } > >>> + /* At this point, skb->len exceed MTU, but as it include length of all > >>> + * segments, it can still be below MTU. The SKB can possibly get > >>> + * re-segmented in transmit path (see validate_xmit_skb). Thus, user > >>> + * must choose if segs are to be MTU checked. Last SKB "headlen" is > >>> + * checked against MTU. > >>> + */ > >>> + if (skb_is_gso(skb)) { > >>> + ret = BPF_MTU_CHK_RET_SUCCESS; > >>> + > >>> + if (!(flags & BPF_MTU_CHK_SEGS)) > >>> + goto out; > >>> + > >>> + if (!skb_gso_validate_network_len(skb, mtu)) { > >>> + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; > >>> + goto out; > >>> + } > >>> + > >>> + skb_len = skb_headlen(skb) + len_diff; > >>> + if (skb_len > dev_len) { > [...] > >> Do you have a particular use case for the BPF_MTU_CHK_SEGS? > > > > The complaint from Maze (and others) were that when skb_is_gso then all > > the MTU checks are bypassed. This flag enables checking the GSO part > > via skb_gso_validate_network_len(). We cannot enable it per default, > > as you say, it is universally correct in all cases. > > If there is a desire to have access to the skb_gso_validate_network_len(), I'd > keep that behind the flag then, but would drop the skb_headlen(skb) + len_diff > case given the mentioned case on rx where it would yield misleading results to > users that might be unintuitive & hard to debug. Okay, I will update the patch, and drop those lines. > >> I also don't see the flag being used anywhere in your selftests, so I > >> presume > >> not as otherwise you would have added an example there? > > > > I'm using the flag in the bpf-examples code[1], this is how I've tested > > the code path. > > > > I've not found a way to generate GSO packet via the selftests > > infrastructure via bpf_prog_test_run_xattr(). I'm > > > > [1] > > https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c > > > > Haven't checked but likely something as prog_tests/skb_ctx.c might not be > sufficient > to pass it into the helper. For real case you might need a netns + veth setup > like > some of the other tests are doing and then generating TCP stream from one end > to the > other. I have looked at prog_tests/skb_ctx.c and (as you say yourself) this is not sufficient. I can look into creating a netns+veth setup, but I will appreciate if we can merge this patchset to make forward progress, as I'm sure the netns+veth setup will require its own round of nitpicking. I have created netns+veth test scripts before (see test_xdp_vlan.sh), but my experience is that people/maintainers forget/don't to run these separate shell scripts. Thus, if I create a netns+veth test, then I will prefer if I can integrate this into the "test_progs", as I know that will be run by people/maintainers. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH bpf-next V12 1/7] bpf: Remove MTU check in __bpf_skb_max_len
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests Signed-off-by: Jesper Dangaard Brouer Acked-by: John Fastabend --- net/core/filter.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 9ab94e90d660..5beadd659091 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, return 0; } -static u32 __bpf_skb_max_len(const struct sk_buff *skb) -{ - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : - SKB_MAX_ALLOC; -} +#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, u32, mode, u64, flags) @@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, { u32 len_cur, len_diff_abs = abs(len_diff); u32 len_min = bpf_skb_net_base_len(skb); - u32 len_max = __bpf_skb_max_len(skb); + u32 len_max = BPF_SKB_MAX_LEN; __be16 proto = skb->protocol; bool shrink = len_diff < 0; u32 off; @@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 min_len = __bpf_skb_min_len(skb); int ret; @@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = { static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 new_len = skb->len + head_room; int ret;
[PATCH bpf-next V12 4/7] bpf: add BPF-helper for MTU checking
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. The SKB object is complex and the skb->len value (accessible from BPF-prog) also include the length of any extra GRO/GSO segments, but without taking into account that these GRO/GSO segments get added transport (L4) and network (L3) headers before being transmitted. Thus, this BPF-helper is created such that the BPF-programmer don't need to handle these details in the BPF-prog. The API is designed to help the BPF-programmer, that want to do packet context size changes, which involves other helpers. These other helpers usually does a delta size adjustment. This helper also support a delta size (len_diff), which allow BPF-programmer to reuse arguments needed by these other helpers, and perform the MTU check prior to doing any actual size adjustment of the packet context. It is on purpose, that we allow the len adjustment to become a negative result, that will pass the MTU check. This might seem weird, but it's not this helpers responsibility to "catch" wrong len_diff adjustments. Other helpers will take care of these checks, if BPF-programmer chooses to do actual size adjustment. V12: - Simplify segment check that calls skb_gso_validate_network_len. - Helpers should return long V9: - Use dev->hard_header_len (instead of ETH_HLEN) - Annotate with unlikely req from Daniel - Fix logic error using skb_gso_validate_network_len from Daniel V6: - Took John's advice and dropped BPF_MTU_CHK_RELAX - Returned MTU is kept at L3-level (like fib_lookup) V4: Lot of changes - ifindex 0 now use current netdev for MTU lookup - rename helper from bpf_mtu_check to bpf_check_mtu - fix bug for GSO pkt length (as skb->len is total len) - remove __bpf_len_adj_positive, simply allow negative len adj Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 67 net/core/filter.c | 111 tools/include/uapi/linux/bpf.h | 67 3 files changed, 245 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 05bfc8c843dc..f17381a337ec 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3839,6 +3839,61 @@ union bpf_attr { * Return * A pointer to a struct socket on success or NULL if the file is * not a socket. + * + * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) + * Description + * Check ctx packet size against MTU of net device (based on + * *ifindex*). This helper will likely be used in combination with + * helpers that adjust/change the packet size. The argument + * *len_diff* can be used for querying with a planned size + * change. This allows to check MTU prior to changing packet ctx. + * + * Specifying *ifindex* zero means the MTU check is performed + * against the current net device. This is practical if this isn't + * used prior to redirect. + * + * The Linux kernel route table can configure MTUs on a more + * specific per route level, which is not provided by this helper. + * For route level MTU checks use the **bpf_fib_lookup**\ () + * helper. + * + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** for tc cls_act programs. + * + * The *flags* argument can be a combination of one or more of the + * following values: + * + * **BPF_MTU_CHK_SEGS** + * This flag will only works for *ctx* **struct sk_buff**. + * If packet context contains extra packet segment buffers + * (often knows as GSO skb), then MTU check is harder to + * check at this point, because in transmit path it is + * possible for the skb packet to get re-segmented + * (depending on net device features). This could still be + * a MTU violation, so this flag enables performing MTU + * check against segments, with a different violation + * return code to tell it apart. Check cannot use len_diff. + * + * On return *mtu_len* pointer contains the MTU value of the net + * device. Remember the net device configured MTU is the L3 size, + * which is returned here and XDP and TX length operate at L2. + * Helper take this into account for you, but remember when using + * MTU value in your BPF-code. On input *mtu_len* must be a valid + * pointer and be initialized (to zero), else verifier will reject + * BPF program. + * + * Return + * * 0 on success, and populate MTU value in *mtu_len* pointer. + * + *
[PATCH bpf-next V12 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up
The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup) can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog don't know the MTU value that caused this rejection. If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it need to know this MTU value for the ICMP packet. Patch change lookup and result struct bpf_fib_lookup, to contain this MTU value as output via a union with 'tot_len' as this is the value used for the MTU lookup. V5: - Fixed uninit value spotted by Dan Carpenter. - Name struct output member mtu_result Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 11 +-- net/core/filter.c | 22 +++--- tools/include/uapi/linux/bpf.h | 11 +-- 3 files changed, 33 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c001766adcbc..05bfc8c843dc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2231,6 +2231,9 @@ union bpf_attr { * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the * packet is not forwarded or needs assist from full stack * + * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU + * was exceeded and output params->mtu_result contains the MTU. + * * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -4981,9 +4984,13 @@ struct bpf_fib_lookup { __be16 sport; __be16 dport; - /* total length of packet from network header - used for MTU check */ - __u16 tot_len; + union { /* used for MTU check */ + /* input to lookup */ + __u16 tot_len; /* L3 length from network hdr (iph->tot_len) */ + /* output: MTU value */ + __u16 mtu_result; + }; /* input: L3 device index for lookup * output: device index from FIB lookup */ diff --git a/net/core/filter.c b/net/core/filter.c index d5e6f395cf64..da162e64578a 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5289,12 +5289,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = { #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, const struct neighbour *neigh, - const struct net_device *dev) + const struct net_device *dev, u32 mtu) { memcpy(params->dmac, neigh->ha, ETH_ALEN); memcpy(params->smac, dev->dev_addr, ETH_ALEN); params->h_vlan_TCI = 0; params->h_vlan_proto = 0; + if (mtu) + params->mtu_result = mtu; /* union with tot_len */ return 0; } @@ -5310,8 +5312,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct net_device *dev; struct fib_result res; struct flowi4 fl4; + u32 mtu = 0; int err; - u32 mtu; dev = dev_get_by_index_rcu(net, params->ifindex); if (unlikely(!dev)) @@ -5378,8 +5380,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } nhc = res.nhc; @@ -5413,7 +5417,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (!neigh) return BPF_FIB_LKUP_RET_NO_NEIGH; - return bpf_fib_set_fwd_params(params, neigh, dev); + return bpf_fib_set_fwd_params(params, neigh, dev, mtu); } #endif @@ -5430,7 +5434,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct flowi6 fl6; int strict = 0; int oif, err; - u32 mtu; + u32 mtu = 0; /* link local addresses are never forwarded */ if (rt6_need_strict(dst) || rt6_need_strict(src)) @@ -5505,8 +5509,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; +
[PATCH bpf-next V12 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size, by adjusting fib_params 'tot_len' with the packet length plus the expected encap size. (Just like the bpf_check_mtu helper supports). He discovered that for SKB ctx the param->tot_len was not used, instead skb->len was used (via MTU check in is_skb_forwardable() that checks against netdev MTU). Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g. zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU check is done like XDP code-path, which checks against FIB-dst MTU. V10: - Use same method as XDP for 'tot_len' MTU check Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status") Reported-by: Carlo Carraro Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 5beadd659091..d5e6f395cf64 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5569,6 +5569,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, { struct net *net = dev_net(skb->dev); int rc = -EAFNOSUPPORT; + bool check_mtu = false; if (plen < sizeof(*params)) return -EINVAL; @@ -5576,22 +5577,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT)) return -EINVAL; + if (params->tot_len) + check_mtu = true; + switch (params->family) { #if IS_ENABLED(CONFIG_INET) case AF_INET: - rc = bpf_ipv4_fib_lookup(net, params, flags, false); + rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu); break; #endif #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: - rc = bpf_ipv6_fib_lookup(net, params, flags, false); + rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu); break; #endif } - if (!rc) { + if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) { struct net_device *dev; + /* When tot_len isn't provided by user, +* check skb against net_device MTU +*/ dev = dev_get_by_index_rcu(net, params->ifindex); if (!is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
[PATCH bpf-next V12 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
Adding selftest for BPF-helper bpf_check_mtu(). Making sure it can be used from both XDP and TC. V11: - Addresse nitpicks from Andrii Nakryiko V10: - Remove errno non-zero test in CHECK_ATTR() - Addresse comments from Andrii Nakryiko Signed-off-by: Jesper Dangaard Brouer Acked-by: Andrii Nakryiko --- tools/testing/selftests/bpf/prog_tests/check_mtu.c | 216 tools/testing/selftests/bpf/progs/test_check_mtu.c | 198 ++ 2 files changed, 414 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c new file mode 100644 index ..9e2fd01b7c65 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c @@ -0,0 +1,216 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Jesper Dangaard Brouer */ + +#include /* before test_progs.h, avoid bpf_util.h redefines */ +#include +#include "test_check_mtu.skel.h" +#include "network_helpers.h" + +#include +#include + +#define IFINDEX_LO 1 + +static __u32 duration; /* Hint: needed for CHECK macro */ + +static int read_mtu_device_lo(void) +{ + const char *filename = "/sys/class/net/lo/mtu"; + char buf[11] = {}; + int value, n, fd; + + fd = open(filename, 0, O_RDONLY); + if (fd == -1) + return -1; + + n = read(fd, buf, sizeof(buf)); + close(fd); + + if (n == -1) + return -2; + + value = strtoimax(buf, NULL, 10); + if (errno == ERANGE) + return -3; + + return value; +} + +static void test_check_mtu_xdp_attach() +{ + struct bpf_link_info link_info; + __u32 link_info_len = sizeof(link_info); + struct test_check_mtu *skel; + struct bpf_program *prog; + struct bpf_link *link; + int err = 0; + int fd; + + skel = test_check_mtu__open_and_load(); + if (CHECK(!skel, "open and load skel", "failed")) + return; /* Exit if e.g. helper unknown to kernel */ + + prog = skel->progs.xdp_use_helper_basic; + + link = bpf_program__attach_xdp(prog, IFINDEX_LO); + if (CHECK(IS_ERR(link), "link_attach", "failed: %ld\n", PTR_ERR(link))) + goto out; + skel->links.xdp_use_helper_basic = link; + + memset(&link_info, 0, sizeof(link_info)); + fd = bpf_link__fd(link); + err = bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + if (CHECK(err, "link_info", "failed: %d\n", err)) + goto out; + + CHECK(link_info.type != BPF_LINK_TYPE_XDP, "link_type", + "got %u != exp %u\n", link_info.type, BPF_LINK_TYPE_XDP); + CHECK(link_info.xdp.ifindex != IFINDEX_LO, "link_ifindex", + "got %u != exp %u\n", link_info.xdp.ifindex, IFINDEX_LO); + + err = bpf_link__detach(link); + CHECK(err, "link_detach", "failed %d\n", err); + +out: + test_check_mtu__destroy(skel); +} + +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, + struct bpf_program *prog, + __u32 mtu_expect) +{ + const char *prog_name = bpf_program__name(prog); + int retval_expect = XDP_PASS; + __u32 mtu_result = 0; + char buf[256] = {}; + int err; + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .prog_fd = bpf_program__fd(prog), + }; + + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(err != 0, "bpf_prog_test_run", + "prog_name:%s (err %d errno %d retval %d)\n", + prog_name, err, errno, tattr.retval); + + CHECK(tattr.retval != retval_expect, "retval", + "progname:%s unexpected retval=%d expected=%d\n", + prog_name, tattr.retval, retval_expect); + + /* Extract MTU that BPF-prog got */ + mtu_result = skel->bss->global_bpf_mtu_xdp; + ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user"); +} + + +static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex) +{ + struct test_check_mtu *skel; + int err; + + skel = test_check_mtu__open(); + if (CHECK(!skel, "skel_open", "failed")) + return; + + /* Update "constants" in BPF-prog *BEFORE* libbpf load */ + skel->rodata->GLOBAL_USER_MTU = mtu; + skel->rodata->GLOBAL_USER_IFINDEX = ifindex;
[PATCH bpf-next V12 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/ V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: Jesper Dangaard Brouer --- include/linux/netdevice.h | 32 ++-- net/core/dev.c| 32 ++-- net/core/filter.c |6 +++--- 3 files changed, 47 insertions(+), 23 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5b949076ed23..b7915484369c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3926,14 +3926,42 @@ int xdp_umem_query(struct net_device *dev, u16 queue_id); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); +int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb); +static __always_inline bool __is_skb_forwardable(const struct net_device *dev, +const struct sk_buff *skb, +const bool check_mtu) +{ + const u32 vlan_hdr_len = 4; /* VLAN_HLEN */ + unsigned int len; + + if (!(dev->flags & IFF_UP)) + return false; + + if (!check_mtu) + return true; + + len = dev->mtu + dev->hard_header_len + vlan_hdr_len; + if (skb->len <= len) + return true; + + /* if TSO is enabled, we don't care about the length as the packet +* could be forwarded without being segmented before +*/ + if (skb_is_gso(skb)) + return true; + + return false; +} + static __always_inline int dev_forward_skb(struct net_device *dev, - struct sk_buff *skb) + struct sk_buff *skb, + const bool check_mtu) { if (skb_orphan_frags(skb, GFP_ATOMIC) || - unlikely(!is_skb_forwardable(dev, skb))) { + unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; diff --git a/net/core/dev.c b/net/core/dev.c index bae35c1ae192..87bb2cd62189 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2194,28 +2194,14 @@ static inline void net_timestamp_set(struct sk_buff *skb) bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) { - unsigned int len; - - if (!(dev->flags & IFF_UP)) - return false; - - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; - if (skb->len <= len) - return true; - - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before -*/ - if (skb_is_gso(skb)) - return true; - - return false; + return __is_skb_forwardable(dev, skb, true); } EXPORT_SYMBOL_GPL(is_skb_forwardable); -int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +int __dev_forward_skb2(struct net_device *dev, struct sk_buff *skb, +bool check_mtu) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, check_mtu); if (likely(!ret)) { skb->protocol = eth_type_trans(skb, dev); @@ -2224,6 +2210,11 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) return ret; } + +int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +{ + return __dev_forward_skb2(dev, skb, true); +} EXPORT_SYMBOL_GPL(__dev_forward_skb); /** @@ -2250,6 +2241,11 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) } EXPORT_SYMBOL_GPL(dev_forward_skb); +int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb) +{ + return __dev_forward_skb2(dev, skb, false) ?: netif_rx_internal(skb); +} + static inline int deliver_skb(struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev) diff --git a/net/core/filter.c b/net/core/filter.c index 0be81f499f51..07876e9ab5e3 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2083,13 +2083,13 @@ static const struct bpf
[PATCH bpf-next V12 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
This demonstrate how bpf_check_mtu() helper can easily be used together with bpf_skb_adjust_room() helper, prior to doing size adjustment, as delta argument is already setup. Hint: This specific test can be selected like this: ./test_progs -t cls_redirect Signed-off-by: Jesper Dangaard Brouer --- .../selftests/bpf/progs/test_cls_redirect.c|7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c b/tools/testing/selftests/bpf/progs/test_cls_redirect.c index c9f8464996ea..3c1e042962e6 100644 --- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c +++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c @@ -70,6 +70,7 @@ typedef struct { uint64_t errors_total_encap_adjust_failed; uint64_t errors_total_encap_buffer_too_small; uint64_t errors_total_redirect_loop; + uint64_t errors_total_encap_mtu_violate; } metrics_t; typedef enum { @@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr); int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead; uint16_t proto = ETH_P_IP; + uint32_t mtu_len = 0; /* Loop protection: the inner packet's TTL is decremented as a safeguard * against any forwarding loop. As the only interesting field is the TTL @@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e } } + if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) { + metrics->errors_total_encap_mtu_violate++; + return TC_ACT_SHOT; + } + if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET, BPF_F_ADJ_ROOM_FIXED_GSO | BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
[PATCH bpf-next V12 0/7] bpf: New approach for BPF MTU handling
This patchset drops all the MTU checks in TC BPF-helpers that limits growing the packet size. This is done because these BPF-helpers doesn't take redirect into account, which can result in their MTU check being done against the wrong netdev. The new approach is to give BPF-programs knowledge about the MTU on a netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers are added and extended to make it possible to do MTU checks in the BPF-code. If BPF-prog doesn't comply with the MTU then the packet will eventually get dropped as some other layer. In some cases the existing kernel MTU checks will drop the packet, but there are also cases where BPF can bypass these checks. Specifically doing TC-redirect from ingress step (sch_handle_ingress) into egress code path (basically calling dev_queue_xmit()). It is left up to driver code to handle these kind of MTU violations. One advantage of this approach is that it ingress-to-egress BPF-prog can send information via packet data. With the MTU checks removed in the helpers, and also not done in skb_do_redirect() call, this allows for an ingress BPF-prog to communicate with an egress BPF-prog via packet data, as long as egress BPF-prog remove this prior to transmitting packet. This patchset is primarily focused on TC-BPF, but I've made sure that the MTU BPF-helpers also works for XDP BPF-programs. V2: Change BPF-helper API from lookup to check. V3: Drop enforcement of MTU in net-core, leave it to drivers. V4: Keep sanity limit + netdev "up" checks + rename BPF-helper. V5: Fix uninit variable + name struct output member mtu_result. V6: Use bpf_check_mtu() in selftest V7: Fix logic using tot_len and add another selftest V8: Add better selftests for BPF-helper bpf_check_mtu V9: Remove patch that use skb_set_redirected V10: Fix selftests and 'tot_len' MTU check like XDP V11: Fix nitpicks in selftests V12: Adjustments requested by Daniel --- Jesper Dangaard Brouer (7): bpf: Remove MTU check in __bpf_skb_max_len bpf: fix bpf_fib_lookup helper MTU check for SKB ctx bpf: bpf_fib_lookup return MTU value as output when looked up bpf: add BPF-helper for MTU checking bpf: drop MTU check when doing TC-BPF redirect to ingress selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect bpf/selftests: tests using bpf_check_mtu BPF-helper include/linux/netdevice.h | 32 +++ include/uapi/linux/bpf.h | 78 +++ net/core/dev.c | 32 +-- net/core/filter.c | 164 +-- tools/include/uapi/linux/bpf.h | 78 +++ tools/testing/selftests/bpf/prog_tests/check_mtu.c | 216 tools/testing/selftests/bpf/progs/test_check_mtu.c | 198 ++ .../selftests/bpf/progs/test_cls_redirect.c|7 + 8 files changed, 760 insertions(+), 45 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c --
Re: [PATCHv8 bpf-next] samples/bpf: add xdp program on egress for xdp_redirect_map
return 1; > + } > > if (optind == argc) { > printf("usage: %s _IN _OUT\n", > argv[0]); > @@ -150,24 +191,28 @@ int main(int argc, char **argv) > if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) > return 1; > > - prog = bpf_program__next(NULL, obj); > - dummy_prog = bpf_program__next(prog, obj); > - if (!prog || !dummy_prog) { > - printf("finding a prog in obj file failed\n"); > - return 1; > + if (xdp_flags & XDP_FLAGS_SKB_MODE) { > + prog = bpf_object__find_program_by_name(obj, > "xdp_redirect_map_general"); > + tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, > "tx_port_general"); > + } else { > + prog = bpf_object__find_program_by_name(obj, > "xdp_redirect_map_native"); > + tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, > "tx_port_native"); > } > - /* bpf_prog_load_xattr gives us the pointer to first prog's fd, > - * so we're missing only the fd for dummy prog > - */ > + dummy_prog = bpf_object__find_program_by_name(obj, > "xdp_redirect_dummy_prog"); > + if (!prog || dummy_prog < 0 || tx_port_map_fd < 0) { > + printf("finding prog/dummy_prog/tx_port_map in obj file > failed\n"); > + goto out; > + } > + prog_fd = bpf_program__fd(prog); > dummy_prog_fd = bpf_program__fd(dummy_prog); > - if (prog_fd < 0 || dummy_prog_fd < 0) { > + if (prog_fd < 0 || dummy_prog_fd < 0 || tx_port_map_fd < 0) { > printf("bpf_prog_load_xattr: %s\n", strerror(errno)); > return 1; > } > > - tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_port"); > + tx_mac_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_mac"); > rxcnt_map_fd = bpf_object__find_map_fd_by_name(obj, "rxcnt"); > - if (tx_port_map_fd < 0 || rxcnt_map_fd < 0) { > + if (tx_mac_map_fd < 0 || rxcnt_map_fd < 0) { > printf("bpf_object__find_map_fd_by_name failed\n"); > return 1; > } > @@ -199,11 +244,66 @@ int main(int argc, char **argv) > } > dummy_prog_id = info.id; > > + /* Load 2nd xdp prog on egress. */ > + if (xdp_devmap_attached) { > + unsigned char mac_addr[6]; > + > + devmap_prog = bpf_object__find_program_by_name(obj, > "xdp_redirect_map_egress_0"); > + if (!devmap_prog) { > + printf("finding devmap_prog in obj file failed\n"); > + goto out; > + } > + devmap_prog_fd_0 = bpf_program__fd(devmap_prog); > + if (devmap_prog_fd_0 < 0) { > + printf("finding devmap_prog fd failed\n"); > + goto out; > + } > + > + devmap_prog = bpf_object__find_program_by_name(obj, > "xdp_redirect_map_egress_1"); > + if (!devmap_prog) { > + printf("finding devmap_prog in obj file failed\n"); > + goto out; > + } > + devmap_prog_fd_1 = bpf_program__fd(devmap_prog); > + if (devmap_prog_fd_1 < 0) { > + printf("finding devmap_prog fd failed\n"); > + goto out; > + } > + > + if (get_mac_addr(ifindex_out, mac_addr) < 0) { > + printf("get interface %d mac failed\n", ifindex_out); > + goto out; > + } > + > + ret = bpf_map_update_elem(tx_mac_map_fd, &key, mac_addr, 0); > + if (ret) { > + perror("bpf_update_elem tx_mac_map_fd"); > + goto out; > + } > + } > + > signal(SIGINT, int_exit); > signal(SIGTERM, int_exit); > > - /* populate virtual to physical port map */ > - ret = bpf_map_update_elem(tx_port_map_fd, &key, &ifindex_out, 0); > + /* devmap prog 0 will set src mac to 00:00:00:00:00:01 > + * if 2nd xdp prog attached on egress > + */ > + key = 0; > + devmap_val.ifindex = ifindex_out; (ifindex_out same as below) > + devmap_val.bpf_prog.fd = devmap_prog_fd_0; > + ret = bpf_map_update_elem(tx_port_map_fd, &key, &devmap_val, 0); > + if (ret) { > + perror("bpf_update_elem"); > + goto out; > + } > + > + /* devmap prog 1 will set src mac to 00:00:00:00:01:01 > + * if 2nd xdp prog attached on egress > + */ > + key = 1; > + devmap_val.ifindex = ifindex_out; (ifindex_out same as above) > + devmap_val.bpf_prog.fd = devmap_prog_fd_1; > + ret = bpf_map_update_elem(tx_port_map_fd, &key, &devmap_val, 0); > if (ret) { > perror("bpf_update_elem"); > goto out; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
On Wed, 20 Jan 2021 15:15:22 +0200 Maxim Mikityanskiy wrote: > On 2021-01-19 17:50, Björn Töpel wrote: > > This series extends bind() for XDP sockets, so that the bound socket > > is added to the netdev_rx_queue _rx array in the netdevice. We call > > this to register the socket. To redirect packets to the registered > > socket, a new BPF helper is used: bpf_redirect_xsk(). > > > > For shared XDP sockets, only the first bound socket is > > registered. Users that need more complex setup has to use XSKMAP and > > bpf_redirect_map(). > > > > Now, why would one use bpf_redirect_xsk() over the regular > > bpf_redirect_map() helper? > > > > * Better performance! > > * Convenience; Most user use one socket per queue. This scenario is > >what registered sockets support. There is no need to create an > >XSKMAP. This can also reduce complexity from containerized setups, > >where users might what to use XDP sockets without CAP_SYS_ADMIN > >capabilities. I'm buying into the convenience and reduce complexity, and XDP sockets without CAP_SYS_ADMIN into containers. People might be surprised that I'm actually NOT buying into the better performance argument here. At these speeds we are basically comparing how close we are to zero (and have to use nanosec time scale for our comparisons), more below. > > The first patch restructures xdp_do_redirect() a bit, to make it > > easier to add the new helper. This restructure also give us a slight > > performance benefit. The following three patches extends bind() and > > adds the new helper. After that, two libbpf patches that selects XDP > > program based on what kernel is running. Finally, selftests for the new > > functionality is added. > > > > Note that the libbpf "auto-selection" is based on kernel version, so > > it is hard coded to the "-next" version (5.12). If you would like to > > try this is out, you will need to change the libbpf patch locally! > > > > Thanks to Maciej and Magnus for the internal review/comments! > > > > Performance (rxdrop, zero-copy) > > > > Baseline > > Two cores: 21.3 Mpps > > One core:24.5 Mpps > > Two cores is slower? It used to be faster all the time, didn't it? > > > Patched > > Two cores, bpf_redirect_map: 21.7 Mpps + 2% > > One core, bpf_redirect_map: 24.9 Mpps + 2% > > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13% > > Nice, impressive improvement! I do appreciate you work and performance optimizations at this level, because when we are using this few CPU cycles per packet, then it is really hard to find new ways to reduce cycles further. Thank for you saying +13% instead of saying +2.7 Mpps. It *is* impressive to basically reduce cycles with 13%. 21.3 Mpps = 46.94 nanosec per packet 24.0 Mpps = 41.66 nanosec per packet 21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved On my 3.60GHz testlab machine that gives me 19 cycles. > > One core, bpf_redirect_xsk: 25.5 Mpps + 4% > > 24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns. We still need these optimization in the kernel, but end-users in userspace are very quickly going to waste the 19 cycles we found. I still support/believe that the OS need to have a little overhead as possible, but for me 42 nanosec overhead is close to zero overhead. For comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for delivery of UDP packets into socket (almost 15 times slower). I guess my point is that with XDP we have already achieved and exceeded (my original) performance goals, making it even faster is just showing off ;-P -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c [3] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
Re: [PATCHv14 bpf-next 1/6] bpf: run devmap xdp_prog on flush instead of bulk enqueue
On Mon, 18 Jan 2021 18:07:17 +0800 Hangbin Liu wrote: > On Sun, Jan 17, 2021 at 02:57:02PM -0800, John Fastabend wrote: > [...] > > It looks like we could embed xdp_buff in xdp_frame and then keep the > > metadata > > at the end. > > > > Because you are working performance here wdyt? <- @Jesper as well. > > Leave this question to Jesper. The struct xdp_buff is larger than struct xdp_frame. The size of xdp_frame matters. It is a reserved areas in top of the frame. An XDP BPF-program cannot access this area (and limit headroom grow). This is why this code works, as afterwards xdp_frame is still valid. Looking at the code xdp_update_frame_from_buff() we do seem to update more fields than actually needed. > > > > > > - sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags); > > > + if (unlikely(bq->xdp_prog)) { > > > > Whats the rational for making above unlikely()? Seems for users its not > > unlikely. Can you measure a performance increase/decrease here? I think > > its probably fine to just let compiler/prefetcher do its thing here. Or > > I'm not reading this right, but seems users of bq->xdp_prog would disagree > > on unlikely case? > > > > Either way a comment might be nice to give us some insight in 6 months > > why we decided this is unlikely. > > I agree that there is no need to use unlikely() here. I added the unlikely() to preserve the baseline performance when not having the 2nd prog loaded. But I'm fine with removing that. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCHv9 bpf-next] samples/bpf: add xdp program on egress for xdp_redirect_map
On Thu, 21 Jan 2021 21:06:42 +0800 Hangbin Liu wrote: > This patch add a xdp program on egress to show that we can modify > the packet on egress. In this sample we will set the pkt's src > mac to egress's mac address. The xdp_prog will be attached when > -X option supplied. > > Signed-off-by: Hangbin Liu > > --- > v9: > roll back to just set src mac to egress interface mac on 2nd xdp prog, > this could avoid packet reorder introduce in patch v6. I like this V9 much better. One (hopefully) last nitpick below. > diff --git a/samples/bpf/xdp_redirect_map_kern.c > b/samples/bpf/xdp_redirect_map_kern.c > index 6489352ab7a4..35ce5e26bbe5 100644 > --- a/samples/bpf/xdp_redirect_map_kern.c > +++ b/samples/bpf/xdp_redirect_map_kern.c > -SEC("xdp_redirect_map") > -int xdp_redirect_map_prog(struct xdp_md *ctx) > +static int xdp_redirect_map(struct xdp_md *ctx, void *redirect_map) Can you make this function always inline? > { > void *data_end = (void *)(long)ctx->data_end; > void *data = (void *)(long)ctx->data; > struct ethhdr *eth = data; > int rc = XDP_DROP; > - int vport, port = 0, m = 0; > long *value; > u32 key = 0; > u64 nh_off; > + int vport; > > nh_off = sizeof(*eth); > if (data + nh_off > data_end) > @@ -79,7 +96,40 @@ int xdp_redirect_map_prog(struct xdp_md *ctx) > swap_src_dst_mac(data); > > /* send packet out physical port */ > - return bpf_redirect_map(&tx_port, vport, 0); > + return bpf_redirect_map(redirect_map, vport, 0); > +} > + > +SEC("xdp_redirect_general") > +int xdp_redirect_map_general(struct xdp_md *ctx) > +{ > + return xdp_redirect_map(ctx, &tx_port_general); > +} > + > +SEC("xdp_redirect_native") > +int xdp_redirect_map_native(struct xdp_md *ctx) > +{ > + return xdp_redirect_map(ctx, &tx_port_native); > +} -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
On Wed, 20 Jan 2021 17:19:31 +0100 Maciej Fijalkowski wrote: > On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote: > > On Wed, 20 Jan 2021 15:15:22 +0200 > > Maxim Mikityanskiy wrote: > > > > > On 2021-01-19 17:50, Björn Töpel wrote: > > > > This series extends bind() for XDP sockets, so that the bound socket > > > > is added to the netdev_rx_queue _rx array in the netdevice. We call > > > > this to register the socket. To redirect packets to the registered > > > > socket, a new BPF helper is used: bpf_redirect_xsk(). > > > > > > > > For shared XDP sockets, only the first bound socket is > > > > registered. Users that need more complex setup has to use XSKMAP and > > > > bpf_redirect_map(). > > > > > > > > Now, why would one use bpf_redirect_xsk() over the regular > > > > bpf_redirect_map() helper? > > > > > > > > * Better performance! > > > > * Convenience; Most user use one socket per queue. This scenario is > > > >what registered sockets support. There is no need to create an > > > >XSKMAP. This can also reduce complexity from containerized setups, > > > >where users might what to use XDP sockets without CAP_SYS_ADMIN > > > >capabilities. > > > > I'm buying into the convenience and reduce complexity, and XDP sockets > > without CAP_SYS_ADMIN into containers. > > > > People might be surprised that I'm actually NOT buying into the better > > performance argument here. At these speeds we are basically comparing > > how close we are to zero (and have to use nanosec time scale for our > > comparisons), more below. > > > > > > > > The first patch restructures xdp_do_redirect() a bit, to make it > > > > easier to add the new helper. This restructure also give us a slight > > > > performance benefit. The following three patches extends bind() and > > > > adds the new helper. After that, two libbpf patches that selects XDP > > > > program based on what kernel is running. Finally, selftests for the new > > > > functionality is added. > > > > > > > > Note that the libbpf "auto-selection" is based on kernel version, so > > > > it is hard coded to the "-next" version (5.12). If you would like to > > > > try this is out, you will need to change the libbpf patch locally! > > > > > > > > Thanks to Maciej and Magnus for the internal review/comments! > > > > > > > > Performance (rxdrop, zero-copy) > > > > > > > > Baseline > > > > Two cores: 21.3 Mpps > > > > One core:24.5 Mpps > > > > > > Two cores is slower? It used to be faster all the time, didn't it? > > > > > > > Patched > > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2% > > > > One core, bpf_redirect_map: 24.9 Mpps + 2% > > > > > > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13% > > > > > > Nice, impressive improvement! > > > > I do appreciate you work and performance optimizations at this level, > > because when we are using this few CPU cycles per packet, then it is > > really hard to find new ways to reduce cycles further. > > > > Thank for you saying +13% instead of saying +2.7 Mpps. > > It *is* impressive to basically reduce cycles with 13%. > > > > 21.3 Mpps = 46.94 nanosec per packet > > 24.0 Mpps = 41.66 nanosec per packet > > > > 21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved > > > > On my 3.60GHz testlab machine that gives me 19 cycles. > > > > > > > > One core, bpf_redirect_xsk: 25.5 Mpps + 4% > > > > > > > > 24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved > > > > At this point with saving 1.6 ns this is around the cost of a function call > > 1.3 ns. > > > > > > We still need these optimization in the kernel, but end-users in > > userspace are very quickly going to waste the 19 cycles we found. > > I still support/believe that the OS need to have a little overhead as > > possible, but for me 42 nanosec overhead is close to zero overhead. For > > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for > > delivery of UDP packets into socket (almost 15 times slower). > > > > I guess my point is that with XDP we have already achieved and exceed
Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues
On Wed, 20 Jan 2021 13:27:59 -0800 Ivan Babrou wrote: > Without this change the driver tries to allocate too many queues, > breaching the number of available msi-x interrupts on machines > with many logical cpus and default adapter settings: > > Insufficient resources for 12 XDP event queues (24 other channels, max 32) > > Which in turn triggers EINVAL on XDP processing: > > sfc :86:00.0 ext0: XDP TX failed (-22) > > Signed-off-by: Ivan Babrou > --- I guess the patch is good in itself due to available msi-x interrupts. Per earlier discussion: What will happen if a CPU with an ID higher than available XDP TX-queues redirect a packet out this driver? > drivers/net/ethernet/sfc/efx_channels.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/sfc/efx_channels.c > b/drivers/net/ethernet/sfc/efx_channels.c > index a4a626e9cd9a..1bfeee283ea9 100644 > --- a/drivers/net/ethernet/sfc/efx_channels.c > +++ b/drivers/net/ethernet/sfc/efx_channels.c > @@ -17,6 +17,7 @@ > #include "rx_common.h" > #include "nic.h" > #include "sriov.h" > +#include "workarounds.h" > > /* This is the first interrupt mode to try out of: > * 0 => MSI-X > @@ -137,6 +138,7 @@ static int efx_allocate_msix_channels(struct efx_nic *efx, > { > unsigned int n_channels = parallelism; > int vec_count; > + int tx_per_ev; > int n_xdp_tx; > int n_xdp_ev; > > @@ -149,9 +151,9 @@ static int efx_allocate_msix_channels(struct efx_nic *efx, >* multiple tx queues, assuming tx and ev queues are both >* maximum size. >*/ > - > + tx_per_ev = EFX_MAX_EVQ_SIZE / EFX_TXQ_MAX_ENT(efx); > n_xdp_tx = num_possible_cpus(); > - n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, EFX_MAX_TXQ_PER_CHANNEL); > + n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, tx_per_ev); > > vec_count = pci_msix_vec_count(efx->pci_dev); > if (vec_count < 0) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCHv10 bpf-next] samples/bpf: add xdp program on egress for xdp_redirect_map
On Fri, 22 Jan 2021 10:50:07 +0800 Hangbin Liu wrote: > This patch add a xdp program on egress to show that we can modify > the packet on egress. In this sample we will set the pkt's src > mac to egress's mac address. The xdp_prog will be attached when > -X option supplied. > > Signed-off-by: Hangbin Liu > --- I think we have nitpicked this enough ;-) Acked-by: Jesper Dangaard Brouer > v10: > make xdp_redirect_map() always inline. > > v9: > roll back to just set src mac to egress interface mac on 2nd xdp prog, > this could avoid packet reorder introduce in patch v6. > > v8: Fix some checkpatch issues. > > v7: > a) use bpf_object__find_program_by_name() instad of >bpf_object__find_program_by_title() > b) set default devmap fd to 0 > > v6: no code update, only rebase the code on latest bpf-next > > v5: > a) close fd when err out in get_mac_addr() > b) exit program when both -S and -X supplied. > > v4: > a) Update get_mac_addr socket create > b) Load dummy prog regardless of 2nd xdp prog on egress > > v3: > a) modify the src mac address based on egress mac > > v2: > a) use pkt counter instead of IP ttl modification on egress program > b) make the egress program selectable by option -X > --- > samples/bpf/xdp_redirect_map_kern.c | 60 +-- > samples/bpf/xdp_redirect_map_user.c | 112 +++- > 2 files changed, 147 insertions(+), 25 deletions(-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)
On Wed, 9 Dec 2020 08:44:33 -0700 David Ahern wrote: > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets > > (up-to 16) so it should not be a problem to solve this by sharing > > TX-queue and talking a lock per 16 packets. I still recommend that, > > for fallback case, you allocated a number a TX-queue and distribute > > this across CPUs to avoid hitting a congested lock (above measurements > > are the optimal non-congested atomic lock operation) > > I have been meaning to ask you why 16 for the XDP batching? If the > netdev budget is 64, why not something higher like 32 or 64? Thanks you for asking as there are multiple good reasons and consideration for this 16 batch size. Notice cpumap have batch size 8, which is also an explicit choice. And AF_XDP went in the wrong direction IMHO and I think have 256. I designed this to be a choice in the map code, for the level of bulking it needs/wants. The low level explanation is that these 8 and 16 batch sizes are optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher with 10 elements). I'm betting on that memory backing these 8 or 16 packets have higher chance to remain/being in cache, and I can prefetch them without evicting them from cache again. In some cases the pointer to these packets are queued into a ptr_ring, and it is more optimal to write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring. The general explanation is my goal to do bulking without adding latency. This is explicitly stated in my presentation[1] as of Feb 2016, slide 20. Sure, you/we can likely make the micro-benchmarks look better by using 64 batch size, but that will introduce added latency and likely shoot our-selves in the foot for real workloads. With experience from bufferbloat and real networks, we know that massive TX bulking have bad effects. Still XDP-redirect does massive bulking (NIC flush is after full 64 budget) and we don't have pushback or a queue mechanism (so I know we are already shooting ourselves in the foot) ... Fortunately we now have a PhD student working on queuing for XDP. It is also important to understand that this is an adaptive bulking scheme, which comes from NAPI. We don't wait for packets arriving shortly, we pickup what NIC have available, but by only taking 8 or 16 packets (instead of emptying the entire RX-queue), and then spending some time to send them along, I'm hoping that NIC could have gotten some more frame. For cpumap and veth (in-some-cases) they can start to consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to update their internal queue state (and BQL), and if it gets close to full they can choose to flush/doorbell the NIC earlier. When doing queuing for XDP we need to expose these NIC queue states, and having 4 calls with 16 packets (64 budget) also gives us more chances to get NIC queue state info which the NIC already touch. [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [Intel-wired-lan] Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set)
On Thu, 10 Dec 2020 15:14:18 +0100 Magnus Karlsson wrote: > On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer > wrote: > > > > On Wed, 9 Dec 2020 08:44:33 -0700 > > David Ahern wrote: > > > > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets > > > > (up-to 16) so it should not be a problem to solve this by sharing > > > > TX-queue and talking a lock per 16 packets. I still recommend that, > > > > for fallback case, you allocated a number a TX-queue and distribute > > > > this across CPUs to avoid hitting a congested lock (above measurements > > > > are the optimal non-congested atomic lock operation) > > > > > > I have been meaning to ask you why 16 for the XDP batching? If the > > > netdev budget is 64, why not something higher like 32 or 64? > > > > Thanks you for asking as there are multiple good reasons and > > consideration for this 16 batch size. Notice cpumap have batch size 8, > > which is also an explicit choice. And AF_XDP went in the wrong > > direction IMHO and I think have 256. I designed this to be a choice in > > the map code, for the level of bulking it needs/wants. > > FYI, as far as I know, there is nothing in AF_XDP that says bulking > should be 256. There is a 256 number in the i40e driver that states > the maximum number of packets to be sent within one napi_poll loop. > But this is just a maximum number and only for that driver. (In case > you wonder, that number was inherited from the original skb Tx > implementation in the driver.) Ah, that explains the issue I have on the production system that runs the EDT-pacer[2]. I see that i40e function i40e_clean_tx_irq() ignores napi_budget but uses it own budget, that defaults to 256. Looks like I can adjust this via ethtool -C tx-frames-irq. I turned this down to 64 (32 was giving worse results, and below 16 system acted strange). Now the issue is gone, which was that if TX-DMA completion was running (i40e_clean_tx_irq()) on the same CPU that send packets via FQ-pacer qdisc, then the pacing was not accurate, and was sending too bursty. System have already tuned "net/core/dev_weight" and RX/TX-bias to reduce bulking, as this can influence latency and the EDT-pacing accuracy. (It is a middlebox bridging VLANs and BPF-EDT tiemstamping and FQ-pacing packets to solve bursts overflowing switch ports). sudo sysctl net/core/dev_weight net.core.dev_weight = 1 net.core.dev_weight_rx_bias = 32 net.core.dev_weight_tx_bias = 1 This net.core.dev_weight_tx_bias=1 (together with dev_weight=1) cause qdisc transmit budget to become one packet, cycling through NET_TX_SOFTIRQ which consumes time and gives a little more pacing space for the packets. > The actual batch size is controlled by > the application. If it puts 1 packet in the Tx ring and calls send(), > the batch size will be 1. If it puts 128 packets in the Tx ring and > calls send(), you get a batch size of 128, and so on. It is flexible, > so you can trade-off latency with throughput in the way the > application desires. Rx batch size has also become flexible now with > the introduction of Björn's prefer_busy_poll patch set [1]. > > [1] > https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.to...@gmail.com/ This looks like a cool trick, to get even more accurate packet scheduling. I played with the tunings, and could see changed behavior with mpstat, but ended up tuning it off again, as I could not measure a direct correlation with the bpftrace tools[3]. > > The low level explanation is that these 8 and 16 batch sizes are > > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher > > with 10 elements). I'm betting on that memory backing these 8 or 16 > > packets have higher chance to remain/being in cache, and I can prefetch > > them without evicting them from cache again. In some cases the pointer > > to these packets are queued into a ptr_ring, and it is more optimal to > > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring. > > > > The general explanation is my goal to do bulking without adding latency. > > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20. > > Sure, you/we can likely make the micro-benchmarks look better by using > > 64 batch size, but that will introduce added latency and likely shoot > > our-selves in the foot for real workloads. With experience from > > bufferbloat and real networks, we know that massive TX bulking have bad > > effects. Still XDP-redirect does massive bulking (NIC flush is after > > full 64 budget) and we don't
Re: [PATCH bpf-next V7 4/8] bpf: add BPF-helper for MTU checking
On Thu, 3 Dec 2020 10:25:41 -0800 s...@google.com wrote: > > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > > + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) > > +{ > > + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > > + struct net_device *dev = skb->dev; > > + int len; > > + int mtu; > > + > > + if (flags & ~(BPF_MTU_CHK_SEGS)) > > + return -EINVAL; > > + > > + dev = __dev_via_ifindex(dev, ifindex); > > + if (!dev) > > + return -ENODEV; > > + > > + mtu = READ_ONCE(dev->mtu); > > + > > + /* TC len is L2, remove L2-header as dev MTU is L3 size */ > > [..] > > + len = skb->len - ETH_HLEN; > Any reason not to do s/ETH_HLEN/dev->hard_header_len/ (or min_header_len?) > thought this patch? Will fix in V9. There is a very small (performance) overhead, but mostly because net_device struct layout have placed mtu and hard_header_len on different cache-lines. (This is something that should be fixed separately). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next V8 4/8] bpf: add BPF-helper for MTU checking
On Thu, 3 Dec 2020 00:23:14 +0100 Daniel Borkmann wrote: > On 11/27/20 7:06 PM, Jesper Dangaard Brouer wrote: > [...] > > +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr, > > + u32 ifindex) > > +{ > > + struct net *netns = dev_net(dev_curr); > > + > > + /* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */ > > + if (ifindex == 0) > > + return dev_curr; > > + > > + return dev_get_by_index_rcu(netns, ifindex); > > +} > > + > > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > > + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) > > +{ > > + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > > + struct net_device *dev = skb->dev; > > + int len; > > + int mtu; > > + > > + if (flags & ~(BPF_MTU_CHK_SEGS)) > > nit: unlikely() (similar for XDP case) ok > > + return -EINVAL; > > + > > + dev = __dev_via_ifindex(dev, ifindex); > > + if (!dev) > > nit: unlikely() (ditto XDP) ok > > + return -ENODEV; > > + > > + mtu = READ_ONCE(dev->mtu); > > + > > + /* TC len is L2, remove L2-header as dev MTU is L3 size */ > > + len = skb->len - ETH_HLEN; > > s/ETH_HLEN/dev->hard_header_len/ ? ok > > + len += len_diff; /* len_diff can be negative, minus result pass check */ > > + if (len <= mtu) { > > + ret = BPF_MTU_CHK_RET_SUCCESS; > > Wouldn't it be more intuitive to do ... > > len_dev = READ_ONCE(dev->mtu) + dev->hard_header_len + VLAN_HLEN; > len_skb = skb->len + len_diff; > if (len_skb <= len_dev) { >ret = BPF_MTU_CHK_RET_SUCCESS; >got out; > } Yes, that is more intuitive to read. > > + goto out; > > + } > > + /* At this point, skb->len exceed MTU, but as it include length of all > > +* segments, it can still be below MTU. The SKB can possibly get > > +* re-segmented in transmit path (see validate_xmit_skb). Thus, user > > +* must choose if segs are to be MTU checked. Last SKB "headlen" is > > +* checked against MTU. > > +*/ > > + if (skb_is_gso(skb)) { > > + ret = BPF_MTU_CHK_RET_SUCCESS; > > + > > + if (flags & BPF_MTU_CHK_SEGS && > > + skb_gso_validate_network_len(skb, mtu)) { > > + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; > > + goto out; > > Maybe my lack of coffe, but looking at ip_exceeds_mtu() for example, shouldn't > the above test be on !skb_gso_validate_network_len() instead? Yes, you are right! > skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu) would indicate that > it does /not/ exceed mtu. > > > + } > > + > > + len = skb_headlen(skb) - ETH_HLEN + len_diff; > > How does this work with GRO when we invoke this helper at tc ingress, e.g. > when > there is still non-linear data in skb_shinfo(skb)->frags[]? In case of skb_is_gso() then this code will check the linear part skb_headlen(skb) against the MTU. I though this was an improvement from what we have today, where skb_is_gso() packets will skip all checks, which have caused a lot of confusion by end-users. I will put this under the BPF_MTU_CHK_SEGS flag (in V9) as I understand from you comment, you don't think this is correct at tc ingress. > > + if (len > mtu) { > > + ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > > + goto out; > > + } > > + } > > +out: > > + /* BPF verifier guarantees valid pointer */ > > + *mtu_len = mtu; > > + > > + return ret; > > +} [...] -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues
On Mon, 14 Dec 2020 17:29:06 -0800 Ivan Babrou wrote: > Without this change the driver tries to allocate too many queues, > breaching the number of available msi-x interrupts on machines > with many logical cpus and default adapter settings: > > Insufficient resources for 12 XDP event queues (24 other channels, max 32) > > Which in turn triggers EINVAL on XDP processing: > > sfc :86:00.0 ext0: XDP TX failed (-22) I have a similar QA report with XDP_REDIRECT: sfc :05:00.0 ens1f0np0: XDP redirect failed (-22) Here we are back to the issue we discussed with ixgbe, that NIC / msi-x interrupts hardware resources are not enough on machines with many logical cpus. After this fix, what will happen if (cpu >= efx->xdp_tx_queue_count) ? (Copied efx_xdp_tx_buffers code below signature) The question leads to, does this driver need a fallback mechanism when HW resource or systems logical cpus exceed the one TX-queue per CPU assumption? > Signed-off-by: Ivan Babrou > --- > drivers/net/ethernet/sfc/efx_channels.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/sfc/efx_channels.c > b/drivers/net/ethernet/sfc/efx_channels.c index > a4a626e9cd9a..1bfeee283ea9 100644 --- > a/drivers/net/ethernet/sfc/efx_channels.c +++ > b/drivers/net/ethernet/sfc/efx_channels.c @@ -17,6 +17,7 @@ > #include "rx_common.h" > #include "nic.h" > #include "sriov.h" > +#include "workarounds.h" > > /* This is the first interrupt mode to try out of: > * 0 => MSI-X > @@ -137,6 +138,7 @@ static int efx_allocate_msix_channels(struct > efx_nic *efx, { > unsigned int n_channels = parallelism; > int vec_count; > + int tx_per_ev; > int n_xdp_tx; > int n_xdp_ev; > > @@ -149,9 +151,9 @@ static int efx_allocate_msix_channels(struct > efx_nic *efx, >* multiple tx queues, assuming tx and ev queues are both >* maximum size. >*/ > - > + tx_per_ev = EFX_MAX_EVQ_SIZE / EFX_TXQ_MAX_ENT(efx); > n_xdp_tx = num_possible_cpus(); > - n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, EFX_MAX_TXQ_PER_CHANNEL); > + n_xdp_ev = DIV_ROUND_UP(n_xdp_tx, tx_per_ev); > > vec_count = pci_msix_vec_count(efx->pci_dev); > if (vec_count < 0) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer /* Transmit a packet from an XDP buffer * * Returns number of packets sent on success, error code otherwise. * Runs in NAPI context, either in our poll (for XDP TX) or a different NIC * (for XDP redirect). */ int efx_xdp_tx_buffers(struct efx_nic *efx, int n, struct xdp_frame **xdpfs, bool flush) { struct efx_tx_buffer *tx_buffer; struct efx_tx_queue *tx_queue; struct xdp_frame *xdpf; dma_addr_t dma_addr; unsigned int len; int space; int cpu; int i; cpu = raw_smp_processor_id(); if (!efx->xdp_tx_queue_count || unlikely(cpu >= efx->xdp_tx_queue_count)) return -EINVAL; tx_queue = efx->xdp_tx_queues[cpu]; if (unlikely(!tx_queue)) return -EINVAL; if (unlikely(n && !xdpfs)) return -EINVAL; if (!n) return 0; /* Check for available space. We should never need multiple * descriptors per frame. */ space = efx->txq_entries + tx_queue->read_count - tx_queue->insert_count; for (i = 0; i < n; i++) { xdpf = xdpfs[i]; if (i >= space) break; /* We'll want a descriptor for this tx. */ prefetchw(__efx_tx_queue_get_insert_buffer(tx_queue)); len = xdpf->len; /* Map for DMA. */ dma_addr = dma_map_single(&efx->pci_dev->dev, xdpf->data, len, DMA_TO_DEVICE); if (dma_mapping_error(&efx->pci_dev->dev, dma_addr)) break; /* Create descriptor and set up for unmapping DMA. */ tx_buffer = efx_tx_map_chunk(tx_queue, dma_addr, len); tx_buffer->xdpf = xdpf; tx_buffer->flags = EFX_TX_BUF_XDP | EFX_TX_BUF_MAP_SINGLE; tx_buffer->dma_offset = 0; tx_buffer->unmap_len = len; tx_queue->tx_packets++; } /* Pass mapped frames to hardware. */ if (flush && i > 0) efx_nic_push_buffers(tx_queue); if (i == 0) return -EIO; efx_xdp_return_frames(n - i, xdpfs + i); return i; }
Re: [PATCH v3 bpf-next 2/2] net: xdp: introduce xdp_prepare_buff utility routine
On Tue, 15 Dec 2020 16:13:44 +0100 Maciej Fijalkowski wrote: > On Tue, Dec 15, 2020 at 04:06:20PM +0100, Lorenzo Bianconi wrote: > > > On 12/15/20 2:47 PM, Lorenzo Bianconi wrote: > > > [...] > > > > > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > > > > > > index 329397c60d84..61d3f5f8b7f3 100644 > > > > > > --- a/drivers/net/xen-netfront.c > > > > > > +++ b/drivers/net/xen-netfront.c > > > > > > @@ -866,10 +866,8 @@ static u32 xennet_run_xdp(struct > > > > > > netfront_queue *queue, struct page *pdata, > > > > > > xdp_init_buff(xdp, XEN_PAGE_SIZE - XDP_PACKET_HEADROOM, > > > > > > &queue->xdp_rxq); > > > > > > - xdp->data_hard_start = page_address(pdata); > > > > > > - xdp->data = xdp->data_hard_start + XDP_PACKET_HEADROOM; > > > > > > + xdp_prepare_buff(xdp, page_address(pdata), XDP_PACKET_HEADROOM, > > > > > > len); > > > > > > xdp_set_data_meta_invalid(xdp); > > > > > > - xdp->data_end = xdp->data + len; > > > > > > act = bpf_prog_run_xdp(prog, xdp); > > > > > > switch (act) { > > > > > > diff --git a/include/net/xdp.h b/include/net/xdp.h > > > > > > index 3fb3a9aa1b71..66d8a4b317a3 100644 > > > > > > --- a/include/net/xdp.h > > > > > > +++ b/include/net/xdp.h > > > > > > @@ -83,6 +83,18 @@ xdp_init_buff(struct xdp_buff *xdp, u32 > > > > > > frame_sz, struct xdp_rxq_info *rxq) > > > > > > xdp->rxq = rxq; > > > > > > } > > > > > > +static inline void > > > > > > nit: maybe __always_inline > > > > ack, I will add in v4 > > > > > > > > > > > +xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start, > > > > > > +int headroom, int data_len) > > > > > > +{ > > > > > > + unsigned char *data = hard_start + headroom; > > > > > > + > > > > > > + xdp->data_hard_start = hard_start; > > > > > > + xdp->data = data; > > > > > > + xdp->data_end = data + data_len; > > > > > > + xdp->data_meta = data; > > > > > > +} > > > > > > + > > > > > > /* Reserve memory area at end-of data area. > > > > > >* > > > > > > For the drivers with xdp_set_data_meta_invalid(), we're basically setting > > > xdp->data_meta > > > twice unless compiler is smart enough to optimize the first one away (did > > > you double check?). > > > Given this is supposed to be a cleanup, why not integrate this logic as > > > well so the > > > xdp_set_data_meta_invalid() doesn't get extra treatment? > > That's what I was trying to say previously. > > > > > we discussed it before, but I am fine to add it in v4. Something like: > > > > static __always_inline void > > xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start, > > int headroom, int data_len, bool meta_valid) > > { > > unsigned char *data = hard_start + headroom; > > > > xdp->data_hard_start = hard_start; > > xdp->data = data; > > xdp->data_end = data + data_len; > > xdp->data_meta = meta_valid ? data : data + 1; > > This will introduce branch, so for intel drivers we're getting the > overhead of one add and a branch. I'm still opting for a separate helper. I should think, as this gets inlined the compiler should be able to remove the branch. I assume that the usage of 'meta_valid' will be a const in the drivers. Maybe we should have the API be 'const bool meta_valid'? > static __always_inline void > xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start, >int headroom, int data_len) > { > unsigned char *data = hard_start + headroom; > > xdp->data_hard_start = hard_start; > xdp->data = data; > xdp->data_end = data + data_len; > xdp_set_data_meta_invalid(xdp); > } > > static __always_inline void > xdp_prepare_buff_meta(struct xdp_buff *xdp, unsigned char *hard_start, > int headroom, int data_len) > { > unsigned char *data = hard_start + headroom; > > xdp->data_hard_start = hard_start; > xdp->data = data; > xdp->data_end = data + data_len; > xdp->data_meta = data; > } Thanks to you Maciej for reviewing this! :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v3 bpf-next 1/2] net: xdp: introduce xdp_init_buff utility routine
On Sat, 12 Dec 2020 18:41:48 +0100 Lorenzo Bianconi wrote: > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c > b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c > index fcc262064766..b7942c3440c0 100644 > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c > @@ -133,12 +133,11 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct > bnxt_rx_ring_info *rxr, u16 cons, > dma_sync_single_for_cpu(&pdev->dev, mapping + offset, *len, bp->rx_dir); > > txr = rxr->bnapi->tx_ring; > + xdp_init_buff(&xdp, PAGE_SIZE, &rxr->xdp_rxq); > xdp.data_hard_start = *data_ptr - offset; > xdp.data = *data_ptr; > xdp_set_data_meta_invalid(&xdp); > xdp.data_end = *data_ptr + *len; > - xdp.rxq = &rxr->xdp_rxq; > - xdp.frame_sz = PAGE_SIZE; /* BNXT_RX_PAGE_MODE(bp) when XDP enabled */ > orig_data = xdp.data; I don't like loosing the comment here. Other developers reading this code might assume that size is always PAGE_SIZE, which is only the case when XDP is enabled. Lets save them from making this mistake. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next] sfc: reduce the number of requested xdp ev queues
On Tue, 15 Dec 2020 18:49:55 + Edward Cree wrote: > On 15/12/2020 09:43, Jesper Dangaard Brouer wrote: > > On Mon, 14 Dec 2020 17:29:06 -0800 > > Ivan Babrou wrote: > > > >> Without this change the driver tries to allocate too many queues, > >> breaching the number of available msi-x interrupts on machines > >> with many logical cpus and default adapter settings: > >> > >> Insufficient resources for 12 XDP event queues (24 other channels, max 32) > >> > >> Which in turn triggers EINVAL on XDP processing: > >> > >> sfc :86:00.0 ext0: XDP TX failed (-22) > > > > I have a similar QA report with XDP_REDIRECT: > > sfc :05:00.0 ens1f0np0: XDP redirect failed (-22) > > > > Here we are back to the issue we discussed with ixgbe, that NIC / msi-x > > interrupts hardware resources are not enough on machines with many > > logical cpus. > > > > After this fix, what will happen if (cpu >= efx->xdp_tx_queue_count) ? > > Same as happened before: the "failed -22". But this fix will make that > less likely to happen, because it ties more TXQs to each EVQ, and it's > the EVQs that are in short supply. > So, what I hear is that this fix is just pampering over the real issue. I suggest that you/we detect the situation, and have a code path that will take a lock (per 16 packets bulk) and solve the issue. If you care about maximum performance you can implement this via changing the ndo_xdp_xmit pointer to the fallback function when needed, to avoid having a to check for the fallback mode in the fast-path. > > (Strictly speaking, I believe the limitation is a software one, that > comes from the driver's channel structures having been designed a > decade ago when 32 cpus ought to be enough for anybody... AFAIR the > hardware is capable of giving us something like 1024 evqs if we ask > for them, it just might not have that many msi-x vectors for us.) > Anyway, the patch looks correct, so > Acked-by: Edward Cree -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v3 bpf-next 2/2] net: xdp: introduce xdp_prepare_buff utility routine
On Tue, 15 Dec 2020 14:47:10 +0100 Lorenzo Bianconi wrote: > [...] > > > xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp); > > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > b/drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > index 4dbbbd49c389..fcd1ca3343fb 100644 > > > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > @@ -2393,12 +2393,12 @@ static int i40e_clean_rx_irq(struct i40e_ring > > > *rx_ring, int budget) > > > > > > /* retrieve a buffer from the ring */ > > > if (!skb) { > > > - xdp.data = page_address(rx_buffer->page) + > > > -rx_buffer->page_offset; > > > - xdp.data_meta = xdp.data; > > > - xdp.data_hard_start = xdp.data - > > > - i40e_rx_offset(rx_ring); > > > - xdp.data_end = xdp.data + size; > > > + unsigned int offset = i40e_rx_offset(rx_ring); > > > > I now see that we could call the i40e_rx_offset() once per napi, so can > > you pull this variable out and have it initialized a single time? Applies > > to other intel drivers as well. > > ack, fine. I will fix in v4. Be careful with the Intel drivers. They have two modes (at compile time) depending on PAGE_SIZE in system. In one of the modes (default one) you can place init of xdp.frame_sz outside the NAPI loop and init a single time. In the other mode you cannot, and it becomes dynamic per packet. Intel review this carefully, please! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next V8 5/8] bpf: drop MTU check when doing TC-BPF redirect to ingress
On Thu, 3 Dec 2020 00:43:36 +0100 Daniel Borkmann wrote: > On 11/27/20 7:06 PM, Jesper Dangaard Brouer wrote: > > The use-case for dropping the MTU check when TC-BPF does redirect to > > ingress, is described by Eyal Birger in email[0]. The summary is the > > ability to increase packet size (e.g. with IPv6 headers for NAT64) and > > ingress redirect packet and let normal netstack fragment packet as needed. > > > > [0] > > https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/ > > > > V4: > > - Keep net_device "up" (IFF_UP) check. > > - Adjustment to handle bpf_redirect_peer() helper > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > include/linux/netdevice.h | 31 +-- > > net/core/dev.c| 19 ++- > > net/core/filter.c | 14 +++--- > > 3 files changed, 42 insertions(+), 22 deletions(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 7ce648a564f7..4a854e09e918 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -3917,11 +3917,38 @@ int dev_forward_skb(struct net_device *dev, struct > > sk_buff *skb); > > bool is_skb_forwardable(const struct net_device *dev, > > const struct sk_buff *skb); > > > > +static __always_inline bool __is_skb_forwardable(const struct net_device > > *dev, > > +const struct sk_buff *skb, > > +const bool check_mtu) > > +{ > > + const u32 vlan_hdr_len = 4; /* VLAN_HLEN */ > > + unsigned int len; > > + > > + if (!(dev->flags & IFF_UP)) > > + return false; > > + > > + if (!check_mtu) > > + return true; > > + > > + len = dev->mtu + dev->hard_header_len + vlan_hdr_len; > > + if (skb->len <= len) > > + return true; > > + > > + /* if TSO is enabled, we don't care about the length as the packet > > +* could be forwarded without being segmented before > > +*/ > > + if (skb_is_gso(skb)) > > + return true; > > + > > + return false; > > +} > > + > > static __always_inline int dev_forward_skb(struct net_device *dev, > > - struct sk_buff *skb) > > + struct sk_buff *skb, > > + const bool check_mtu) > > { > > if (skb_orphan_frags(skb, GFP_ATOMIC) || > > - unlikely(!is_skb_forwardable(dev, skb))) { > > + unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) { > > atomic_long_inc(&dev->rx_dropped); > > kfree_skb(skb); > > return NET_RX_DROP; > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 60d325bda0d7..6ceb6412ee97 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -2189,28 +2189,13 @@ static inline void net_timestamp_set(struct sk_buff > > *skb) > > > > bool is_skb_forwardable(const struct net_device *dev, const struct > > sk_buff *skb) > > { > > - unsigned int len; > > - > > - if (!(dev->flags & IFF_UP)) > > - return false; > > - > > - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; > > - if (skb->len <= len) > > - return true; > > - > > - /* if TSO is enabled, we don't care about the length as the packet > > -* could be forwarded without being segmented before > > -*/ > > - if (skb_is_gso(skb)) > > - return true; > > - > > - return false; > > + return __is_skb_forwardable(dev, skb, true); > > } > > EXPORT_SYMBOL_GPL(is_skb_forwardable); > > Only user of is_skb_forwardable() that is left after this patch is bridge, > maybe > the whole thing should be moved into the header? Well, yes, maybe... I just felt it belongs in another patchset. > > int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) > > { > > - int ret = dev_forward_skb(dev, skb); > > + int ret = dev_forward_skb(dev, skb, true); > > > > if (likely(!ret)) { > > skb->protocol = eth_type_trans(skb, dev); > > diff --git a/net/core/filter.c b/net/core/filter.c > > index d6125cfc49c3..4673afe59533 100644 > > --- a/net/core/filter
Re: [PATCH bpf-next V8 5/8] bpf: drop MTU check when doing TC-BPF redirect to ingress
On Thu, 17 Dec 2020 15:46:55 +0100 Jesper Dangaard Brouer wrote: > > > diff --git a/net/core/filter.c b/net/core/filter.c > > > index d6125cfc49c3..4673afe59533 100644 > > > --- a/net/core/filter.c > > > +++ b/net/core/filter.c > > > @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto > > > bpf_csum_level_proto = { > > > > > > static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff > > > *skb) > > > { > > > - return dev_forward_skb(dev, skb); > > > + int ret = dev_forward_skb(dev, skb, false); > > > + > > > + if (likely(!ret)) { > > > + skb->protocol = eth_type_trans(skb, dev); > > > + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > > > + ret = netif_rx(skb); > > > > Why netif_rx() and not netif_rx_internal() as in dev_forward_skb() > > originally? > > One extra call otherwise. > > This is because the function below calls netif_rx(), which is just > outside patch-diff-window. Thus, it looked wrong/strange to call > netif_rx_internal(), but sure I can use netif_rx_internal() instead. Well, when building I found that we obviously cannot call netif_rx_internal() as this is filter.c, else we get a build error: net/core/filter.c:2091:9: error: implicit declaration of function ‘netif_rx_internal’ [-Werror=implicit-function-declaration] 2091 | ret = netif_rx_internal(skb); | ^ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH bpf-next V9 0/7] bpf: New approach for BPF MTU handling
This patchset drops all the MTU checks in TC BPF-helpers that limits growing the packet size. This is done because these BPF-helpers doesn't take redirect into account, which can result in their MTU check being done against the wrong netdev. The new approach is to give BPF-programs knowledge about the MTU on a netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers are added and extended to make it possible to do MTU checks in the BPF-code. If BPF-prog doesn't comply with the MTU then the packet will eventually get dropped as some other layer. In some cases the existing kernel MTU checks will drop the packet, but there are also cases where BPF can bypass these checks. Specifically doing TC-redirect from ingress step (sch_handle_ingress) into egress code path (basically calling dev_queue_xmit()). It is left up to driver code to handle these kind of MTU violations. One advantage of this approach is that it ingress-to-egress BPF-prog can send information via packet data. With the MTU checks removed in the helpers, and also not done in skb_do_redirect() call, this allows for an ingress BPF-prog to communicate with an egress BPF-prog via packet data, as long as egress BPF-prog remove this prior to transmitting packet. This patchset is primarily focused on TC-BPF, but I've made sure that the MTU BPF-helpers also works for XDP BPF-programs. V2: Change BPF-helper API from lookup to check. V3: Drop enforcement of MTU in net-core, leave it to drivers. V4: Keep sanity limit + netdev "up" checks + rename BPF-helper. V5: Fix uninit variable + name struct output member mtu_result. V6: Use bpf_check_mtu() in selftest V7: Fix logic using tot_len and add another selftest V8: Add better selftests for BPF-helper bpf_check_mtu V9: Remove patch that use skb_set_redirected --- Jesper Dangaard Brouer (7): bpf: Remove MTU check in __bpf_skb_max_len bpf: fix bpf_fib_lookup helper MTU check for SKB ctx bpf: bpf_fib_lookup return MTU value as output when looked up bpf: add BPF-helper for MTU checking bpf: drop MTU check when doing TC-BPF redirect to ingress selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect bpf/selftests: tests using bpf_check_mtu BPF-helper include/linux/netdevice.h | 31 +++ include/uapi/linux/bpf.h | 78 +++- net/core/dev.c | 19 -- net/core/filter.c | 184 -- tools/include/uapi/linux/bpf.h | 78 +++- tools/testing/selftests/bpf/prog_tests/check_mtu.c | 204 tools/testing/selftests/bpf/progs/test_check_mtu.c | 196 +++ .../selftests/bpf/progs/test_cls_redirect.c|7 + 8 files changed, 754 insertions(+), 43 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c --
[PATCH bpf-next V9 1/7] bpf: Remove MTU check in __bpf_skb_max_len
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size. Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 255aeee72402..f8f198252ff2 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, return 0; } -static u32 __bpf_skb_max_len(const struct sk_buff *skb) -{ - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : - SKB_MAX_ALLOC; -} +#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, u32, mode, u64, flags) @@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, { u32 len_cur, len_diff_abs = abs(len_diff); u32 len_min = bpf_skb_net_base_len(skb); - u32 len_max = __bpf_skb_max_len(skb); + u32 len_max = BPF_SKB_MAX_LEN; __be16 proto = skb->protocol; bool shrink = len_diff < 0; u32 off; @@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 min_len = __bpf_skb_min_len(skb); int ret; @@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = { static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 new_len = skb->len + head_room; int ret;
[PATCH bpf-next V9 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up
The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup) can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog don't know the MTU value that caused this rejection. If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it need to know this MTU value for the ICMP packet. Patch change lookup and result struct bpf_fib_lookup, to contain this MTU value as output via a union with 'tot_len' as this is the value used for the MTU lookup. V5: - Fixed uninit value spotted by Dan Carpenter. - Name struct output member mtu_result Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 11 +-- net/core/filter.c | 22 +++--- tools/include/uapi/linux/bpf.h | 11 +-- 3 files changed, 33 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 77d7c1bb2923..649586d656b6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2225,6 +2225,9 @@ union bpf_attr { * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the * packet is not forwarded or needs assist from full stack * + * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU + * was exceeded and output params->mtu_result contains the MTU. + * * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -4975,9 +4978,13 @@ struct bpf_fib_lookup { __be16 sport; __be16 dport; - /* total length of packet from network header - used for MTU check */ - __u16 tot_len; + union { /* used for MTU check */ + /* input to lookup */ + __u16 tot_len; /* L3 length from network hdr (iph->tot_len) */ + /* output: MTU value */ + __u16 mtu_result; + }; /* input: L3 device index for lookup * output: device index from FIB lookup */ diff --git a/net/core/filter.c b/net/core/filter.c index 95c6fdebd0db..722910cbfe0b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5268,12 +5268,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = { #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, const struct neighbour *neigh, - const struct net_device *dev) + const struct net_device *dev, u32 mtu) { memcpy(params->dmac, neigh->ha, ETH_ALEN); memcpy(params->smac, dev->dev_addr, ETH_ALEN); params->h_vlan_TCI = 0; params->h_vlan_proto = 0; + if (mtu) + params->mtu_result = mtu; /* union with tot_len */ return 0; } @@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct net_device *dev; struct fib_result res; struct flowi4 fl4; + u32 mtu = 0; int err; - u32 mtu; dev = dev_get_by_index_rcu(net, params->ifindex); if (unlikely(!dev)) @@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } nhc = res.nhc; @@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (!neigh) return BPF_FIB_LKUP_RET_NO_NEIGH; - return bpf_fib_set_fwd_params(params, neigh, dev); + return bpf_fib_set_fwd_params(params, neigh, dev, mtu); } #endif @@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct flowi6 fl6; int strict = 0; int oif, err; - u32 mtu; + u32 mtu = 0; /* link local addresses are never forwarded */ if (rt6_need_strict(dst) || rt6_need_strict(src)) @@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; +
[PATCH bpf-next V9 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size, by adjusting fib_params 'tot_len' with the packet length plus the expected encap size. (Just like the bpf_check_mtu helper supports). He discovered that for SKB ctx the param->tot_len was not used, instead skb->len was used (via MTU check in is_skb_forwardable()). Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g. zero) then keep existing behaviour intact. Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status") Reported-by: Carlo Carraro Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index f8f198252ff2..95c6fdebd0db 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5568,11 +5568,21 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, #endif } - if (!rc) { + if (rc == BPF_FIB_LKUP_RET_SUCCESS) { struct net_device *dev; + u32 mtu; dev = dev_get_by_index_rcu(net, params->ifindex); - if (!is_skb_forwardable(dev, skb)) + mtu = READ_ONCE(dev->mtu); + + /* Using tot_len for (L3) MTU check if provided by user */ + if (params->tot_len && params->tot_len > mtu) + rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; + + /* Notice at this TC cls_bpf level skb->len contains L2 size, +* but is_skb_forwardable takes that into account +*/ + if (params->tot_len == 0 && !is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; }
[PATCH bpf-next V9 4/7] bpf: add BPF-helper for MTU checking
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. The SKB object is complex and the skb->len value (accessible from BPF-prog) also include the length of any extra GRO/GSO segments, but without taking into account that these GRO/GSO segments get added transport (L4) and network (L3) headers before being transmitted. Thus, this BPF-helper is created such that the BPF-programmer don't need to handle these details in the BPF-prog. The API is designed to help the BPF-programmer, that want to do packet context size changes, which involves other helpers. These other helpers usually does a delta size adjustment. This helper also support a delta size (len_diff), which allow BPF-programmer to reuse arguments needed by these other helpers, and perform the MTU check prior to doing any actual size adjustment of the packet context. It is on purpose, that we allow the len adjustment to become a negative result, that will pass the MTU check. This might seem weird, but it's not this helpers responsibility to "catch" wrong len_diff adjustments. Other helpers will take care of these checks, if BPF-programmer chooses to do actual size adjustment. V9: - Use dev->hard_header_len (instead of ETH_HLEN) - Annotate with unlikely req from Daniel - Fix logic error using skb_gso_validate_network_len from Daniel V6: - Took John's advice and dropped BPF_MTU_CHK_RELAX - Returned MTU is kept at L3-level (like fib_lookup) V4: Lot of changes - ifindex 0 now use current netdev for MTU lookup - rename helper from bpf_mtu_check to bpf_check_mtu - fix bug for GSO pkt length (as skb->len is total len) - remove __bpf_len_adj_positive, simply allow negative len adj Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 67 ++ net/core/filter.c | 122 tools/include/uapi/linux/bpf.h | 67 ++ 3 files changed, 256 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 649586d656b6..fa2e99351758 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3833,6 +3833,61 @@ union bpf_attr { * Return * A pointer to a struct socket on success or NULL if the file is * not a socket. + * + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) + * Description + * Check ctx packet size against MTU of net device (based on + * *ifindex*). This helper will likely be used in combination with + * helpers that adjust/change the packet size. The argument + * *len_diff* can be used for querying with a planned size + * change. This allows to check MTU prior to changing packet ctx. + * + * Specifying *ifindex* zero means the MTU check is performed + * against the current net device. This is practical if this isn't + * used prior to redirect. + * + * The Linux kernel route table can configure MTUs on a more + * specific per route level, which is not provided by this helper. + * For route level MTU checks use the **bpf_fib_lookup**\ () + * helper. + * + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** for tc cls_act programs. + * + * The *flags* argument can be a combination of one or more of the + * following values: + * + * **BPF_MTU_CHK_SEGS** + * This flag will only works for *ctx* **struct sk_buff**. + * If packet context contains extra packet segment buffers + * (often knows as GSO skb), then MTU check is harder to + * check at this point, because in transmit path it is + * possible for the skb packet to get re-segmented + * (depending on net device features). This could still be + * a MTU violation, so this flag enables performing MTU + * check against segments, with a different violation + * return code to tell it apart. Check cannot use len_diff. + * + * On return *mtu_len* pointer contains the MTU value of the net + * device. Remember the net device configured MTU is the L3 size, + * which is returned here and XDP and TX length operate at L2. + * Helper take this into account for you, but remember when using + * MTU value in your BPF-code. On input *mtu_len* must be a valid + * pointer and be initialized (to zero), else verifier will reject + * BPF program. + * + * Return + * * 0 on success, and populate MTU value in *mtu_len* pointer. + * + * * < 0 if any input argument is invalid (*mtu_len* not updated) + * + * MTU v
[PATCH bpf-next V9 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/ V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: Jesper Dangaard Brouer --- include/linux/netdevice.h | 31 +-- net/core/dev.c| 19 ++- net/core/filter.c | 14 +++--- 3 files changed, 42 insertions(+), 22 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7bf167993c05..4c78f73db1ec 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3947,11 +3947,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb); +static __always_inline bool __is_skb_forwardable(const struct net_device *dev, +const struct sk_buff *skb, +const bool check_mtu) +{ + const u32 vlan_hdr_len = 4; /* VLAN_HLEN */ + unsigned int len; + + if (!(dev->flags & IFF_UP)) + return false; + + if (!check_mtu) + return true; + + len = dev->mtu + dev->hard_header_len + vlan_hdr_len; + if (skb->len <= len) + return true; + + /* if TSO is enabled, we don't care about the length as the packet +* could be forwarded without being segmented before +*/ + if (skb_is_gso(skb)) + return true; + + return false; +} + static __always_inline int dev_forward_skb(struct net_device *dev, - struct sk_buff *skb) + struct sk_buff *skb, + const bool check_mtu) { if (skb_orphan_frags(skb, GFP_ATOMIC) || - unlikely(!is_skb_forwardable(dev, skb))) { + unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; diff --git a/net/core/dev.c b/net/core/dev.c index a46334906c94..ffead13c158f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2176,28 +2176,13 @@ static inline void net_timestamp_set(struct sk_buff *skb) bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) { - unsigned int len; - - if (!(dev->flags & IFF_UP)) - return false; - - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; - if (skb->len <= len) - return true; - - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before -*/ - if (skb_is_gso(skb)) - return true; - - return false; + return __is_skb_forwardable(dev, skb, true); } EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, true); if (likely(!ret)) { skb->protocol = eth_type_trans(skb, dev); diff --git a/net/core/filter.c b/net/core/filter.c index 6e113de2baae..b682b11fbbec 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = { static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb) { - return dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, false); + + if (likely(!ret)) { + skb->protocol = eth_type_trans(skb, dev); + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); + ret = netif_rx(skb); + } + + return ret; } static inline int __bpf_rx_skb_no_mac(struct net_device *dev, struct sk_buff *skb) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, false); if (likely(!ret)) { skb->dev = dev; @@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb) goto out_drop; dev = ops->ndo_get_peer_dev(dev); if (unlikely(!dev || -!is_skb_forwardable(dev, skb) || +!(dev->flags & IFF_UP) ||
[PATCH bpf-next V9 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
This demonstrate how bpf_check_mtu() helper can easily be used together with bpf_skb_adjust_room() helper, prior to doing size adjustment, as delta argument is already setup. Hint: This specific test can be selected like this: ./test_progs -t cls_redirect Signed-off-by: Jesper Dangaard Brouer --- .../selftests/bpf/progs/test_cls_redirect.c|7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c b/tools/testing/selftests/bpf/progs/test_cls_redirect.c index c9f8464996ea..3c1e042962e6 100644 --- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c +++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c @@ -70,6 +70,7 @@ typedef struct { uint64_t errors_total_encap_adjust_failed; uint64_t errors_total_encap_buffer_too_small; uint64_t errors_total_redirect_loop; + uint64_t errors_total_encap_mtu_violate; } metrics_t; typedef enum { @@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr); int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead; uint16_t proto = ETH_P_IP; + uint32_t mtu_len = 0; /* Loop protection: the inner packet's TTL is decremented as a safeguard * against any forwarding loop. As the only interesting field is the TTL @@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e } } + if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) { + metrics->errors_total_encap_mtu_violate++; + return TC_ACT_SHOT; + } + if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET, BPF_F_ADJ_ROOM_FIXED_GSO | BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
[PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
Adding selftest for BPF-helper bpf_check_mtu(). Making sure it can be used from both XDP and TC. Signed-off-by: Jesper Dangaard Brouer --- tools/testing/selftests/bpf/prog_tests/check_mtu.c | 204 tools/testing/selftests/bpf/progs/test_check_mtu.c | 196 +++ 2 files changed, 400 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c new file mode 100644 index ..b5d0c3a9abe8 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c @@ -0,0 +1,204 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Jesper Dangaard Brouer */ + +#include /* before test_progs.h, avoid bpf_util.h redefines */ + +#include +#include "test_check_mtu.skel.h" +#include + +#include +#include + +#define IFINDEX_LO 1 + +static __u32 duration; /* Hint: needed for CHECK macro */ + +static int read_mtu_device_lo(void) +{ + const char *filename = "/sys/devices/virtual/net/lo/mtu"; + char buf[11] = {}; + int value; + int fd; + + fd = open(filename, 0, O_RDONLY); + if (fd == -1) + return -1; + + if (read(fd, buf, sizeof(buf)) == -1) + return -2; + close(fd); + + value = strtoimax(buf, NULL, 10); + if (errno == ERANGE) + return -3; + + return value; +} + +static void test_check_mtu_xdp_attach(struct bpf_program *prog) +{ + int err = 0; + int fd; + + fd = bpf_program__fd(prog); + err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE); + if (CHECK(err, "XDP-attach", "failed")) + return; + + bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0); +} + +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, + struct bpf_program *prog, + __u32 mtu_expect) +{ + const char *prog_name = bpf_program__name(prog); + int retval_expect = XDP_PASS; + __u32 mtu_result = 0; + char buf[256]; + int err; + + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .prog_fd = bpf_program__fd(prog), + }; + + memset(buf, 0, sizeof(buf)); + + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run", + "prog_name:%s (err %d errno %d retval %d)\n", + prog_name, err, errno, tattr.retval); + +CHECK(tattr.retval != retval_expect, "retval", + "progname:%s unexpected retval=%d expected=%d\n", + prog_name, tattr.retval, retval_expect); + + /* Extract MTU that BPF-prog got */ + mtu_result = skel->bss->global_bpf_mtu_xdp; + CHECK(mtu_result != mtu_expect, "MTU-compare-user", + "failed (MTU user:%d bpf:%d)", mtu_expect, mtu_result); +} + +static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex) +{ + struct test_check_mtu *skel; + int err; + + skel = test_check_mtu__open(); + if (CHECK(!skel, "skel_open", "failed")) + return; + + /* Update "constants" in BPF-prog *BEFORE* libbpf load */ + skel->rodata->GLOBAL_USER_MTU = mtu; + skel->rodata->GLOBAL_USER_IFINDEX = ifindex; + + err = test_check_mtu__load(skel); + if (CHECK(err, "skel_load", "failed: %d\n", err)) + goto cleanup; + + test_check_mtu_run_xdp(skel, skel->progs.xdp_use_helper, mtu); + test_check_mtu_run_xdp(skel, skel->progs.xdp_exceed_mtu, mtu); + test_check_mtu_run_xdp(skel, skel->progs.xdp_minus_delta, mtu); + +cleanup: + test_check_mtu__destroy(skel); +} + +static void test_check_mtu_run_tc(struct test_check_mtu *skel, + struct bpf_program *prog, + __u32 mtu_expect) +{ + const char *prog_name = bpf_program__name(prog); + int retval_expect = BPF_OK; + __u32 mtu_result = 0; + char buf[256]; + int err; + + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .prog_fd = bpf_program__fd(prog), + }; + + memset(buf, 0, sizeof(buf)); + + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(err != 0 || errno != 0, "
Re: [PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
On Thu, 17 Dec 2020 18:26:50 +0100 Jesper Dangaard Brouer wrote: > Adding selftest for BPF-helper bpf_check_mtu(). Making sure > it can be used from both XDP and TC. > > Signed-off-by: Jesper Dangaard Brouer > --- > tools/testing/selftests/bpf/prog_tests/check_mtu.c | 204 > > tools/testing/selftests/bpf/progs/test_check_mtu.c | 196 +++ > 2 files changed, 400 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c > create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c Will send V10 as I have an error in this selftests > diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c > b/tools/testing/selftests/bpf/prog_tests/check_mtu.c > new file mode 100644 > index ..b5d0c3a9abe8 > --- /dev/null > +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c [...] > +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, > +struct bpf_program *prog, > +__u32 mtu_expect) > +{ > + const char *prog_name = bpf_program__name(prog); > + int retval_expect = XDP_PASS; > + __u32 mtu_result = 0; > + char buf[256]; > + int err; > + > + struct bpf_prog_test_run_attr tattr = { > + .repeat = 1, > + .data_in = &pkt_v4, > + .data_size_in = sizeof(pkt_v4), > + .data_out = buf, > + .data_size_out = sizeof(buf), > + .prog_fd = bpf_program__fd(prog), > + }; > + > + memset(buf, 0, sizeof(buf)); > + > + err = bpf_prog_test_run_xattr(&tattr); > + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run", ^^^ You/I cannot use the check "errno != 0" here, as something else could have set it earlier. > +"prog_name:%s (err %d errno %d retval %d)\n", > +prog_name, err, errno, tattr.retval); > + > +CHECK(tattr.retval != retval_expect, "retval", > + "progname:%s unexpected retval=%d expected=%d\n", > + prog_name, tattr.retval, retval_expect); > + > + /* Extract MTU that BPF-prog got */ > + mtu_result = skel->bss->global_bpf_mtu_xdp; > + CHECK(mtu_result != mtu_expect, "MTU-compare-user", > + "failed (MTU user:%d bpf:%d)", mtu_expect, mtu_result); > +} > +static void test_check_mtu_run_tc(struct test_check_mtu *skel, > + struct bpf_program *prog, > + __u32 mtu_expect) > +{ [...] > + err = bpf_prog_test_run_xattr(&tattr); > + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run", > +"prog_name:%s (err %d errno %d retval %d)\n", > +prog_name, err, errno, tattr.retval); Same issue here. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v4 bpf-next 1/2] net: xdp: introduce xdp_init_buff utility routine
On Sat, 19 Dec 2020 18:55:00 +0100 Lorenzo Bianconi wrote: > diff --git a/include/net/xdp.h b/include/net/xdp.h > index 11ec93f827c0..323340caef88 100644 > --- a/include/net/xdp.h > +++ b/include/net/xdp.h > @@ -76,6 +76,13 @@ struct xdp_buff { > u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/ > }; > > +static __always_inline void > +xdp_init_buff(struct xdp_buff *xdp, u32 frame_sz, struct xdp_rxq_info *rxq) > +{ > + xdp->frame_sz = frame_sz; > + xdp->rxq = rxq; Later you will add 'xdp->mb = 0' here. > +} Via the names of your functions, I assume that xdp_init_buff() is called before xdp_prepare_buff(), right? (And your pending 'xdp->mb = 0' also prefer this.) Below in bpf_prog_test_run_xdp() and netif_receive_generic_xdp() you violate this order... which will give you headaches when implementing the multi-buff support. It is also a bad example for driver developer that need to figure out this calling-order from the function names. Below, will it be possible to have 'init' before 'prepare'? > + > /* Reserve memory area at end-of data area. > * > * This macro reserves tailroom in the XDP buffer by limiting the > diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c > index c1c30a9f76f3..a8fa5a9e4137 100644 > --- a/net/bpf/test_run.c > +++ b/net/bpf/test_run.c > @@ -640,10 +640,10 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const > union bpf_attr *kattr, > xdp.data = data + headroom; > xdp.data_meta = xdp.data; > xdp.data_end = xdp.data + size; > - xdp.frame_sz = headroom + max_data_sz + tailroom; > > rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, > 0); > - xdp.rxq = &rxqueue->xdp_rxq; > + xdp_init_buff(&xdp, headroom + max_data_sz + tailroom, > + &rxqueue->xdp_rxq); > bpf_prog_change_xdp(NULL, prog); > ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true); > if (ret) > diff --git a/net/core/dev.c b/net/core/dev.c > index a46334906c94..b1a765900c01 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4588,11 +4588,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff > *skb, > struct netdev_rx_queue *rxqueue; > void *orig_data, *orig_data_end; > u32 metalen, act = XDP_DROP; > + u32 mac_len, frame_sz; > __be16 orig_eth_type; > struct ethhdr *eth; > bool orig_bcast; > int hlen, off; > - u32 mac_len; > > /* Reinjected packets coming from act_mirred or similar should >* not get XDP generic processing. > @@ -4631,8 +4631,8 @@ static u32 netif_receive_generic_xdp(struct sk_buff > *skb, > xdp->data_hard_start = skb->data - skb_headroom(skb); > > /* SKB "head" area always have tailroom for skb_shared_info */ > - xdp->frame_sz = (void *)skb_end_pointer(skb) - xdp->data_hard_start; > - xdp->frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + frame_sz = (void *)skb_end_pointer(skb) - xdp->data_hard_start; > + frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > > orig_data_end = xdp->data_end; > orig_data = xdp->data; > @@ -4641,7 +4641,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff > *skb, > orig_eth_type = eth->h_proto; > > rxqueue = netif_get_rxqueue(skb); > - xdp->rxq = &rxqueue->xdp_rxq; > + xdp_init_buff(xdp, frame_sz, &rxqueue->xdp_rxq); > > act = bpf_prog_run_xdp(xdp_prog, xdp); -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
On Sat, 19 Dec 2020 10:30:57 -0500 Jamal Hadi Salim wrote: > On 2020-12-19 9:53 a.m., Shay Agroskin wrote: > > > > Lorenzo Bianconi writes: > > > > >> for the moment I do not know if this area is used for other purposes. > >> Do you think there are other use-cases for it? Yes, all the same use-cases as SKB have. I wanted to keep this the same as skb_shared_info, but Lorenzo choose to take John's advice and it going in this direction (which is fine, we can always change and adjust this later). > Sorry to interject: > Does it make sense to use it to store arbitrary metadata or a scratchpad > in this space? Something equivalent to skb->cb which is lacking in > XDP. Well, XDP have the data_meta area. But difficult to rely on because a lot of driver don't implement it. And Saeed and I plan to use this area and populate it with driver info from RX-descriptor. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH bpf-next V9 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
On Fri, 18 Dec 2020 12:13:45 -0800 Andrii Nakryiko wrote: > On Thu, Dec 17, 2020 at 9:30 AM Jesper Dangaard Brouer > wrote: > > > > Adding selftest for BPF-helper bpf_check_mtu(). Making sure > > it can be used from both XDP and TC. > > > > Signed-off-by: Jesper Dangaard Brouer > > --- > > tools/testing/selftests/bpf/prog_tests/check_mtu.c | 204 > > > > tools/testing/selftests/bpf/progs/test_check_mtu.c | 196 > > +++ > > 2 files changed, 400 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c > > b/tools/testing/selftests/bpf/prog_tests/check_mtu.c > > new file mode 100644 > > index ..b5d0c3a9abe8 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c > > @@ -0,0 +1,204 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* Copyright (c) 2020 Jesper Dangaard Brouer */ > > + > > +#include /* before test_progs.h, avoid bpf_util.h > > redefines */ > > + > > +#include > > +#include "test_check_mtu.skel.h" > > +#include > > + > > +#include > > +#include > > + > > +#define IFINDEX_LO 1 > > + > > +static __u32 duration; /* Hint: needed for CHECK macro */ > > + > > +static int read_mtu_device_lo(void) > > +{ > > + const char *filename = "/sys/devices/virtual/net/lo/mtu"; I will change this to: /sys/class/net/lo/mtu > > + char buf[11] = {}; > > + int value; > > + int fd; > > + > > + fd = open(filename, 0, O_RDONLY); > > + if (fd == -1) > > + return -1; > > + > > + if (read(fd, buf, sizeof(buf)) == -1) > > close fd missing here? ack, fixed. > > + return -2; > > + close(fd); > > + > > + value = strtoimax(buf, NULL, 10); > > + if (errno == ERANGE) > > + return -3; > > + > > + return value; > > +} > > + > > +static void test_check_mtu_xdp_attach(struct bpf_program *prog) > > +{ > > + int err = 0; > > + int fd; > > + > > + fd = bpf_program__fd(prog); > > + err = bpf_set_link_xdp_fd(IFINDEX_LO, fd, XDP_FLAGS_SKB_MODE); > > + if (CHECK(err, "XDP-attach", "failed")) > > + return; > > + > > + bpf_set_link_xdp_fd(IFINDEX_LO, -1, 0); > > can you please use bpf_link-based bpf_program__attach_xdp() which will > provide auto-cleanup in case of crash? Sure, that will be good for me to learn. > also check that it succeeded? > > > +} > > + > > +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, > > + struct bpf_program *prog, > > + __u32 mtu_expect) > > +{ > > + const char *prog_name = bpf_program__name(prog); > > + int retval_expect = XDP_PASS; > > + __u32 mtu_result = 0; > > + char buf[256]; > > + int err; > > + > > + struct bpf_prog_test_run_attr tattr = { > > + .repeat = 1, > > + .data_in = &pkt_v4, > > + .data_size_in = sizeof(pkt_v4), > > + .data_out = buf, > > + .data_size_out = sizeof(buf), > > + .prog_fd = bpf_program__fd(prog), > > + }; > > nit: it's a variable declaration, so keep it all in one block. There > is also opts-based variant, which might be good to use here instead. > > > + > > + memset(buf, 0, sizeof(buf)); > > char buf[256] = {}; would make this unnecessary ok. > > > + > > + err = bpf_prog_test_run_xattr(&tattr); > > + CHECK_ATTR(err != 0 || errno != 0, "bpf_prog_test_run", > > + "prog_name:%s (err %d errno %d retval %d)\n", > > + prog_name, err, errno, tattr.retval); > > + > > +CHECK(tattr.retval != retval_expect, "retval", > > whitespaces are off? Yes, I noticed with scripts/checkpatch.pl. And there are a couple more, that I've already fixed. > > + "progname:%s unexpected retval=%d expected=%d\n", > > + prog_name, tattr.retval, retval_expect); > > + > > + /* Extract MTU that BPF-prog go
[PATCH bpf-next V10 0/7] bpf: New approach for BPF MTU handling
This patchset drops all the MTU checks in TC BPF-helpers that limits growing the packet size. This is done because these BPF-helpers doesn't take redirect into account, which can result in their MTU check being done against the wrong netdev. The new approach is to give BPF-programs knowledge about the MTU on a netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers are added and extended to make it possible to do MTU checks in the BPF-code. If BPF-prog doesn't comply with the MTU then the packet will eventually get dropped as some other layer. In some cases the existing kernel MTU checks will drop the packet, but there are also cases where BPF can bypass these checks. Specifically doing TC-redirect from ingress step (sch_handle_ingress) into egress code path (basically calling dev_queue_xmit()). It is left up to driver code to handle these kind of MTU violations. One advantage of this approach is that it ingress-to-egress BPF-prog can send information via packet data. With the MTU checks removed in the helpers, and also not done in skb_do_redirect() call, this allows for an ingress BPF-prog to communicate with an egress BPF-prog via packet data, as long as egress BPF-prog remove this prior to transmitting packet. This patchset is primarily focused on TC-BPF, but I've made sure that the MTU BPF-helpers also works for XDP BPF-programs. V2: Change BPF-helper API from lookup to check. V3: Drop enforcement of MTU in net-core, leave it to drivers. V4: Keep sanity limit + netdev "up" checks + rename BPF-helper. V5: Fix uninit variable + name struct output member mtu_result. V6: Use bpf_check_mtu() in selftest V7: Fix logic using tot_len and add another selftest V8: Add better selftests for BPF-helper bpf_check_mtu V9: Remove patch that use skb_set_redirected V10: Fix selftests and 'tot_len' MTU check like XDP --- Jesper Dangaard Brouer (7): bpf: Remove MTU check in __bpf_skb_max_len bpf: fix bpf_fib_lookup helper MTU check for SKB ctx bpf: bpf_fib_lookup return MTU value as output when looked up bpf: add BPF-helper for MTU checking bpf: drop MTU check when doing TC-BPF redirect to ingress selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect bpf/selftests: tests using bpf_check_mtu BPF-helper include/linux/netdevice.h | 31 +++ include/uapi/linux/bpf.h | 78 +++ net/core/dev.c | 19 -- net/core/filter.c | 183 +++-- tools/include/uapi/linux/bpf.h | 78 +++ tools/testing/selftests/bpf/prog_tests/check_mtu.c | 218 tools/testing/selftests/bpf/progs/test_check_mtu.c | 199 ++ .../selftests/bpf/progs/test_cls_redirect.c|7 + 8 files changed, 769 insertions(+), 44 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c --
[PATCH bpf-next V10 1/7] bpf: Remove MTU check in __bpf_skb_max_len
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size. Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 255aeee72402..f8f198252ff2 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, return 0; } -static u32 __bpf_skb_max_len(const struct sk_buff *skb) -{ - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : - SKB_MAX_ALLOC; -} +#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, u32, mode, u64, flags) @@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, { u32 len_cur, len_diff_abs = abs(len_diff); u32 len_min = bpf_skb_net_base_len(skb); - u32 len_max = __bpf_skb_max_len(skb); + u32 len_max = BPF_SKB_MAX_LEN; __be16 proto = skb->protocol; bool shrink = len_diff < 0; u32 off; @@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 min_len = __bpf_skb_min_len(skb); int ret; @@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = { static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 new_len = skb->len + head_room; int ret;
[PATCH bpf-next V10 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size, by adjusting fib_params 'tot_len' with the packet length plus the expected encap size. (Just like the bpf_check_mtu helper supports). He discovered that for SKB ctx the param->tot_len was not used, instead skb->len was used (via MTU check in is_skb_forwardable() that checks against netdev MTU). Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g. zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU check is done like XDP code-path, which checks against FIB-dst MTU. V10: - Use same method as XDP for 'tot_len' MTU check Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status") Reported-by: Carlo Carraro Signed-off-by: Jesper Dangaard Brouer --- net/core/filter.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index f8f198252ff2..c1e460193bae 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5548,6 +5548,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, { struct net *net = dev_net(skb->dev); int rc = -EAFNOSUPPORT; + bool check_mtu = false; if (plen < sizeof(*params)) return -EINVAL; @@ -,22 +5556,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT)) return -EINVAL; + if (params->tot_len) + check_mtu = true; + switch (params->family) { #if IS_ENABLED(CONFIG_INET) case AF_INET: - rc = bpf_ipv4_fib_lookup(net, params, flags, false); + rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu); break; #endif #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: - rc = bpf_ipv6_fib_lookup(net, params, flags, false); + rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu); break; #endif } - if (!rc) { + if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) { struct net_device *dev; + /* When tot_len isn't provided by user, +* check skb against net_device MTU +*/ dev = dev_get_by_index_rcu(net, params->ifindex); if (!is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
[PATCH bpf-next V10 4/7] bpf: add BPF-helper for MTU checking
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. The SKB object is complex and the skb->len value (accessible from BPF-prog) also include the length of any extra GRO/GSO segments, but without taking into account that these GRO/GSO segments get added transport (L4) and network (L3) headers before being transmitted. Thus, this BPF-helper is created such that the BPF-programmer don't need to handle these details in the BPF-prog. The API is designed to help the BPF-programmer, that want to do packet context size changes, which involves other helpers. These other helpers usually does a delta size adjustment. This helper also support a delta size (len_diff), which allow BPF-programmer to reuse arguments needed by these other helpers, and perform the MTU check prior to doing any actual size adjustment of the packet context. It is on purpose, that we allow the len adjustment to become a negative result, that will pass the MTU check. This might seem weird, but it's not this helpers responsibility to "catch" wrong len_diff adjustments. Other helpers will take care of these checks, if BPF-programmer chooses to do actual size adjustment. V9: - Use dev->hard_header_len (instead of ETH_HLEN) - Annotate with unlikely req from Daniel - Fix logic error using skb_gso_validate_network_len from Daniel V6: - Took John's advice and dropped BPF_MTU_CHK_RELAX - Returned MTU is kept at L3-level (like fib_lookup) V4: Lot of changes - ifindex 0 now use current netdev for MTU lookup - rename helper from bpf_mtu_check to bpf_check_mtu - fix bug for GSO pkt length (as skb->len is total len) - remove __bpf_len_adj_positive, simply allow negative len adj Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 67 ++ net/core/filter.c | 122 tools/include/uapi/linux/bpf.h | 67 ++ 3 files changed, 256 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 649586d656b6..fa2e99351758 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3833,6 +3833,61 @@ union bpf_attr { * Return * A pointer to a struct socket on success or NULL if the file is * not a socket. + * + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) + * Description + * Check ctx packet size against MTU of net device (based on + * *ifindex*). This helper will likely be used in combination with + * helpers that adjust/change the packet size. The argument + * *len_diff* can be used for querying with a planned size + * change. This allows to check MTU prior to changing packet ctx. + * + * Specifying *ifindex* zero means the MTU check is performed + * against the current net device. This is practical if this isn't + * used prior to redirect. + * + * The Linux kernel route table can configure MTUs on a more + * specific per route level, which is not provided by this helper. + * For route level MTU checks use the **bpf_fib_lookup**\ () + * helper. + * + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** for tc cls_act programs. + * + * The *flags* argument can be a combination of one or more of the + * following values: + * + * **BPF_MTU_CHK_SEGS** + * This flag will only works for *ctx* **struct sk_buff**. + * If packet context contains extra packet segment buffers + * (often knows as GSO skb), then MTU check is harder to + * check at this point, because in transmit path it is + * possible for the skb packet to get re-segmented + * (depending on net device features). This could still be + * a MTU violation, so this flag enables performing MTU + * check against segments, with a different violation + * return code to tell it apart. Check cannot use len_diff. + * + * On return *mtu_len* pointer contains the MTU value of the net + * device. Remember the net device configured MTU is the L3 size, + * which is returned here and XDP and TX length operate at L2. + * Helper take this into account for you, but remember when using + * MTU value in your BPF-code. On input *mtu_len* must be a valid + * pointer and be initialized (to zero), else verifier will reject + * BPF program. + * + * Return + * * 0 on success, and populate MTU value in *mtu_len* pointer. + * + * * < 0 if any input argument is invalid (*mtu_len* not updated) + * + * MTU v
[PATCH bpf-next V10 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up
The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup) can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog don't know the MTU value that caused this rejection. If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it need to know this MTU value for the ICMP packet. Patch change lookup and result struct bpf_fib_lookup, to contain this MTU value as output via a union with 'tot_len' as this is the value used for the MTU lookup. V5: - Fixed uninit value spotted by Dan Carpenter. - Name struct output member mtu_result Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Jesper Dangaard Brouer --- include/uapi/linux/bpf.h | 11 +-- net/core/filter.c | 22 +++--- tools/include/uapi/linux/bpf.h | 11 +-- 3 files changed, 33 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 77d7c1bb2923..649586d656b6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2225,6 +2225,9 @@ union bpf_attr { * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the * packet is not forwarded or needs assist from full stack * + * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU + * was exceeded and output params->mtu_result contains the MTU. + * * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -4975,9 +4978,13 @@ struct bpf_fib_lookup { __be16 sport; __be16 dport; - /* total length of packet from network header - used for MTU check */ - __u16 tot_len; + union { /* used for MTU check */ + /* input to lookup */ + __u16 tot_len; /* L3 length from network hdr (iph->tot_len) */ + /* output: MTU value */ + __u16 mtu_result; + }; /* input: L3 device index for lookup * output: device index from FIB lookup */ diff --git a/net/core/filter.c b/net/core/filter.c index c1e460193bae..db59ab55572c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5268,12 +5268,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = { #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, const struct neighbour *neigh, - const struct net_device *dev) + const struct net_device *dev, u32 mtu) { memcpy(params->dmac, neigh->ha, ETH_ALEN); memcpy(params->smac, dev->dev_addr, ETH_ALEN); params->h_vlan_TCI = 0; params->h_vlan_proto = 0; + if (mtu) + params->mtu_result = mtu; /* union with tot_len */ return 0; } @@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct net_device *dev; struct fib_result res; struct flowi4 fl4; + u32 mtu = 0; int err; - u32 mtu; dev = dev_get_by_index_rcu(net, params->ifindex); if (unlikely(!dev)) @@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } nhc = res.nhc; @@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (!neigh) return BPF_FIB_LKUP_RET_NO_NEIGH; - return bpf_fib_set_fwd_params(params, neigh, dev); + return bpf_fib_set_fwd_params(params, neigh, dev, mtu); } #endif @@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct flowi6 fl6; int strict = 0; int oif, err; - u32 mtu; + u32 mtu = 0; /* link local addresses are never forwarded */ if (rt6_need_strict(dst) || rt6_need_strict(src)) @@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; +
[PATCH bpf-next V10 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19qo4...@mail.gmail.com/ V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: Jesper Dangaard Brouer --- include/linux/netdevice.h | 31 +-- net/core/dev.c| 19 ++- net/core/filter.c | 14 +++--- 3 files changed, 42 insertions(+), 22 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7bf167993c05..4c78f73db1ec 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3947,11 +3947,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb); +static __always_inline bool __is_skb_forwardable(const struct net_device *dev, +const struct sk_buff *skb, +const bool check_mtu) +{ + const u32 vlan_hdr_len = 4; /* VLAN_HLEN */ + unsigned int len; + + if (!(dev->flags & IFF_UP)) + return false; + + if (!check_mtu) + return true; + + len = dev->mtu + dev->hard_header_len + vlan_hdr_len; + if (skb->len <= len) + return true; + + /* if TSO is enabled, we don't care about the length as the packet +* could be forwarded without being segmented before +*/ + if (skb_is_gso(skb)) + return true; + + return false; +} + static __always_inline int dev_forward_skb(struct net_device *dev, - struct sk_buff *skb) + struct sk_buff *skb, + const bool check_mtu) { if (skb_orphan_frags(skb, GFP_ATOMIC) || - unlikely(!is_skb_forwardable(dev, skb))) { + unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; diff --git a/net/core/dev.c b/net/core/dev.c index a46334906c94..ffead13c158f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2176,28 +2176,13 @@ static inline void net_timestamp_set(struct sk_buff *skb) bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) { - unsigned int len; - - if (!(dev->flags & IFF_UP)) - return false; - - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; - if (skb->len <= len) - return true; - - /* if TSO is enabled, we don't care about the length as the packet -* could be forwarded without being segmented before -*/ - if (skb_is_gso(skb)) - return true; - - return false; + return __is_skb_forwardable(dev, skb, true); } EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, true); if (likely(!ret)) { skb->protocol = eth_type_trans(skb, dev); diff --git a/net/core/filter.c b/net/core/filter.c index 3f2e593244ca..1908800b671c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = { static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb) { - return dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, false); + + if (likely(!ret)) { + skb->protocol = eth_type_trans(skb, dev); + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); + ret = netif_rx(skb); + } + + return ret; } static inline int __bpf_rx_skb_no_mac(struct net_device *dev, struct sk_buff *skb) { - int ret = dev_forward_skb(dev, skb); + int ret = dev_forward_skb(dev, skb, false); if (likely(!ret)) { skb->dev = dev; @@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb) goto out_drop; dev = ops->ndo_get_peer_dev(dev); if (unlikely(!dev || -!is_skb_forwardable(dev, skb) || +!(dev->flags & IFF_UP) ||