Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On Thu, Feb 15, 2018 at 11:43:21PM +0100, Jesper Dangaard Brouer wrote: > The virtio_net code have three different RX code-paths in receive_buf(). > Two of these code paths can handle XDP, but one of them is broken for > at least XDP_REDIRECT. > > Function(1): receive_big() does not support XDP. > Function(2): receive_small() support XDP fully and uses build_skb(). > Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). > > The simple explanation is that receive_mergeable() is broken because > it uses napi_alloc_skb(), which violates XDP given XDP assumes packet > header+data in single page and enough tail room for skb_shared_info. > > The longer explaination is that receive_mergeable() tries to > work-around and satisfy these XDP requiresments e.g. by having a > function xdp_linearize_page() that allocates and memcpy RX buffers > around (in case packet is scattered across multiple rx buffers). This > does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because > we have not implemented bpf_xdp_adjust_tail yet). > > The XDP_REDIRECT action combined with cpumap is broken, and cause hard > to debug crashes. The main issue is that the RX packet does not have > the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing > skb_shared_info to overlap the next packets head-room (in which cpumap > stores info). > > Reproducing depend on the packet payload length and if RX-buffer size > happened to have tail-room for skb_shared_info or not. But to make > this even harder to troubleshoot, the RX-buffer size is runtime > dynamically change based on an Exponentially Weighted Moving Average > (EWMA) over the packet length, when refilling RX rings. > > This patch only disable XDP_REDIRECT support in receive_mergeable() > case, because it can cause a real crash. > > But IMHO we should NOT support XDP in receive_mergeable() at all, > because the principles behind XDP are to gain speed by (1) code > simplicity, (2) sacrificing memory and (3) where possible moving > runtime checks to setup time. These principles are clearly being > violated in receive_mergeable(), that e.g. runtime track average > buffer size to save memory consumption. As long as buffers we supply are large enough, we can handle mergeable by receive_small normally. The only issue is with outstanding buffers submitted before XDP was enabled. It should be possible to just copy these out. > Besides the described bug: > > Update(1): There is also a OOM leak in the XDP_REDIRECT code, which > receive_small() is likely also affected by. > > Update(2): Also observed a guest crash when redirecting out an > another virtio_net device, when device is down. > > Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT") > Signed-off-by: Jesper Dangaard Brouer> --- > drivers/net/virtio_net.c |7 --- > 1 file changed, 7 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 626c27352ae2..0ca91942a884 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -677,7 +677,6 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > struct bpf_prog *xdp_prog; > unsigned int truesize; > unsigned int headroom = mergeable_ctx_to_headroom(ctx); > - int err; > > head_skb = NULL; > > @@ -754,12 +753,6 @@ static struct sk_buff *receive_mergeable(struct > net_device *dev, > goto err_xdp; > rcu_read_unlock(); > goto xdp_xmit; > - case XDP_REDIRECT: > - err = xdp_do_redirect(dev, , xdp_prog); > - if (!err) > - *xdp_xmit = true; > - rcu_read_unlock(); > - goto xdp_xmit; > default: > bpf_warn_invalid_xdp_action(act); > case XDP_ABORTED:
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On Fri, Feb 16, 2018 at 04:41:26PM +0100, Jesper Dangaard Brouer wrote: > On Fri, 16 Feb 2018 13:31:37 +0800 > Jason Wangwrote: > > > On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: > > > The virtio_net code have three different RX code-paths in receive_buf(). > > > Two of these code paths can handle XDP, but one of them is broken for > > > at least XDP_REDIRECT. > > > > > > Function(1): receive_big() does not support XDP. > > > Function(2): receive_small() support XDP fully and uses build_skb(). > > > Function(3): receive_mergeable() broken XDP_REDIRECT uses > > > napi_alloc_skb(). > > > > > > The simple explanation is that receive_mergeable() is broken because > > > it uses napi_alloc_skb(), which violates XDP given XDP assumes packet > > > header+data in single page and enough tail room for skb_shared_info. > > > > > > The longer explaination is that receive_mergeable() tries to > > > work-around and satisfy these XDP requiresments e.g. by having a > > > function xdp_linearize_page() that allocates and memcpy RX buffers > > > around (in case packet is scattered across multiple rx buffers). This > > > does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because > > > we have not implemented bpf_xdp_adjust_tail yet). > > > > > > The XDP_REDIRECT action combined with cpumap is broken, and cause hard > > > to debug crashes. The main issue is that the RX packet does not have > > > the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing > > > skb_shared_info to overlap the next packets head-room (in which cpumap > > > stores info). > > > > > > Reproducing depend on the packet payload length and if RX-buffer size > > > happened to have tail-room for skb_shared_info or not. But to make > > > this even harder to troubleshoot, the RX-buffer size is runtime > > > dynamically change based on an Exponentially Weighted Moving Average > > > (EWMA) over the packet length, when refilling RX rings. > > > > > > This patch only disable XDP_REDIRECT support in receive_mergeable() > > > case, because it can cause a real crash. > > > > > > But IMHO we should NOT support XDP in receive_mergeable() at all, > > > because the principles behind XDP are to gain speed by (1) code > > > simplicity, (2) sacrificing memory and (3) where possible moving > > > runtime checks to setup time. These principles are clearly being > > > violated in receive_mergeable(), that e.g. runtime track average > > > buffer size to save memory consumption. > > > > I agree to disable it for -net now. > > Okay... I'll send an official patch later. > > > For net-next, we probably can do: > > > > - drop xdp_linearize_page() and do XDP through generic XDP helper > > after skb was built > > I disagree strongly here - it makes no sense. > > Why do you want to explicit fallback to Generic-XDP? > (... then all the performance gain is gone!) > And besides, a couple of function calls later, the generic XDP code > will/can get invoked anyhow... > > > Take a step back: > What is the reason/use-case for implementing XDP inside virtio_net? Firewall within guest, controllable by guest admin. > From a DDoS/performance perspective XDP in virtio_net happens on the > "wrong-side" as it is activated _inside_ the guest OS, which is too > late for a DDoS filter, as the guest kick/switch overhead have already > occurred. > > I do use XDP_DROP inside the guest (driver virtio_net), but just to > perform what I can zoom-in benchmarking, for perf-record isolating the > early RX code path in the guest. (Using iptables "raw" table drop is > almost as useful for that purpose). > > > > The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is > significantly more interesting. As it allow us to skip large parts of > the network stack and redirect from a physical device (ixgbe) into a > guest device. Ran a benchmark: > - 0.5 Mpps with normal code path into device with driver tun > - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device > > Plus, there are indications that 3.7Mpps is not the real limit, as > guest CPU doing XDP_DROP is 75% idle... thus this is a likely a > scheduling + queue size issue. > > > > - disable EWMA when XDP is set and reserve enough tailroom. > > > > > > > > Besides the described bug: > > > > > > Update(1): There is also a OOM leak in the XDP_REDIRECT code, which > > > receive_small() is likely also affected by. > > > > > > Update(2): Also observed a guest crash when redirecting out an > > > another virtio_net device, when device is down. > > > > Will have a look at these issues. (Holiday in china now, so will do it > > after). > > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 2018年02月21日 00:52, John Fastabend wrote: On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote: On Fri, 16 Feb 2018 09:19:02 -0800 John Fastabendwrote: On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote: On Fri, 16 Feb 2018 13:31:37 +0800 Jason Wang wrote: On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: The virtio_net code have three different RX code-paths in receive_buf(). Two of these code paths can handle XDP, but one of them is broken for at least XDP_REDIRECT. Function(1): receive_big() does not support XDP. Function(2): receive_small() support XDP fully and uses build_skb(). Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). The simple explanation is that receive_mergeable() is broken because it uses napi_alloc_skb(), which violates XDP given XDP assumes packet header+data in single page and enough tail room for skb_shared_info. The longer explaination is that receive_mergeable() tries to work-around and satisfy these XDP requiresments e.g. by having a function xdp_linearize_page() that allocates and memcpy RX buffers around (in case packet is scattered across multiple rx buffers). This does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because we have not implemented bpf_xdp_adjust_tail yet). The XDP_REDIRECT action combined with cpumap is broken, and cause hard to debug crashes. The main issue is that the RX packet does not have the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing skb_shared_info to overlap the next packets head-room (in which cpumap stores info). Reproducing depend on the packet payload length and if RX-buffer size happened to have tail-room for skb_shared_info or not. But to make this even harder to troubleshoot, the RX-buffer size is runtime dynamically change based on an Exponentially Weighted Moving Average (EWMA) over the packet length, when refilling RX rings. This patch only disable XDP_REDIRECT support in receive_mergeable() case, because it can cause a real crash. But IMHO we should NOT support XDP in receive_mergeable() at all, because the principles behind XDP are to gain speed by (1) code simplicity, (2) sacrificing memory and (3) where possible moving runtime checks to setup time. These principles are clearly being violated in receive_mergeable(), that e.g. runtime track average buffer size to save memory consumption. I agree to disable it for -net now. Okay... I'll send an official patch later. For net-next, we probably can do: - drop xdp_linearize_page() and do XDP through generic XDP helper after skb was built I disagree strongly here - it makes no sense. Why do you want to explicit fallback to Generic-XDP? (... then all the performance gain is gone!) And besides, a couple of function calls later, the generic XDP code will/can get invoked anyhow... Hi, Can we get EWMA to ensure for majority of cases we have the extra head room? Seems we could just over-estimate the size by N-bytes. In some cases we may under-estimate and then would need to fall back to generic-xdp or otherwise growing the buffer which of course would be painful and slow, but presumably would happen rarely. Hmmm... (first of all it is missing tail-room not head-room). Second having all this extra size estimating code, and fallback options leaves a very bad taste in my mouth... this sounds like a sure way to kill performance. I think it would be much better to keep this feature vs kill it and make its configuration even more painful to get XDP working on virtio. Based on you request, I'm going to fixing as much as possible of the XDP code path in driver virtio_net... I now have 4 fix patches... Thanks a lot! There is no way around disabling XDP_REDIRECT in receive_mergeable(), as XDP does not have a way to detect/know the "data_hard_end" of the data "frame". Disabling EWMA also seems reasonable to me. To me, it seems more reasonable to have a separate RX function call when an XDP program gets attached, and in that process change to the memory model so it is compatible with XDP. I would be OK with that but would be curious to see what Jason and Michael think. When I original wrote the XDP for virtio support the XDP infra was still primitive and we didn't have metadata, cpu maps, etc. Yes, that why cpumap fails. yet. I suspect there might need to be some additional coordination between guest and host though to switch the packet modes. If I recall this was where some of the original trouble came from. Maybe, but I think just have a separate refill function should be sufficient? Then we could reuse the exist code to deal with e.g synchronization. Take a step back: What is the reason/use-case for implementing XDP inside virtio_net? From a DDoS/performance perspective XDP in virtio_net happens on the "wrong-side" as it is activated _inside_ the guest OS, which is too late for a DDoS filter, as the guest kick/switch
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 2018年02月16日 23:41, Jesper Dangaard Brouer wrote: On Fri, 16 Feb 2018 13:31:37 +0800 Jason Wangwrote: On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: The virtio_net code have three different RX code-paths in receive_buf(). Two of these code paths can handle XDP, but one of them is broken for at least XDP_REDIRECT. Function(1): receive_big() does not support XDP. Function(2): receive_small() support XDP fully and uses build_skb(). Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). The simple explanation is that receive_mergeable() is broken because it uses napi_alloc_skb(), which violates XDP given XDP assumes packet header+data in single page and enough tail room for skb_shared_info. The longer explaination is that receive_mergeable() tries to work-around and satisfy these XDP requiresments e.g. by having a function xdp_linearize_page() that allocates and memcpy RX buffers around (in case packet is scattered across multiple rx buffers). This does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because we have not implemented bpf_xdp_adjust_tail yet). The XDP_REDIRECT action combined with cpumap is broken, and cause hard to debug crashes. The main issue is that the RX packet does not have the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing skb_shared_info to overlap the next packets head-room (in which cpumap stores info). Reproducing depend on the packet payload length and if RX-buffer size happened to have tail-room for skb_shared_info or not. But to make this even harder to troubleshoot, the RX-buffer size is runtime dynamically change based on an Exponentially Weighted Moving Average (EWMA) over the packet length, when refilling RX rings. This patch only disable XDP_REDIRECT support in receive_mergeable() case, because it can cause a real crash. But IMHO we should NOT support XDP in receive_mergeable() at all, because the principles behind XDP are to gain speed by (1) code simplicity, (2) sacrificing memory and (3) where possible moving runtime checks to setup time. These principles are clearly being violated in receive_mergeable(), that e.g. runtime track average buffer size to save memory consumption. I agree to disable it for -net now. Okay... I'll send an official patch later. For net-next, we probably can do: - drop xdp_linearize_page() and do XDP through generic XDP helper after skb was built I disagree strongly here - it makes no sense. Why do you want to explicit fallback to Generic-XDP? (... then all the performance gain is gone!) Note this only happens when: 1) Rx buffer size is under estimated, we could disable estimation and then this won't happen 2) headroom is not sufficient, we try hard to not stop device during XDP set, so this can happen but only for the first several packets So this looks pretty fine and remove a lot of complex codes. And besides, a couple of function calls later, the generic XDP code will/can get invoked anyhow... How if we choose to use native mode of XDP? Take a step back: What is the reason/use-case for implementing XDP inside virtio_net? From a DDoS/performance perspective XDP in virtio_net happens on the "wrong-side" as it is activated _inside_ the guest OS, which is too late for a DDoS filter, as the guest kick/switch overhead have already occurred. I don't see any difference of virtio-net now. Consider a real hardward NIC, XDP (except for the offload case) also start to drop packet after it reach hardware. I do use XDP_DROP inside the guest (driver virtio_net), but just to perform what I can zoom-in benchmarking, for perf-record isolating the early RX code path in the guest. (Using iptables "raw" table drop is almost as useful for that purpose). We could not assume the type of virtio-net backend, it could be dpdk or other high speed implementation. And I'm pretty sure there are more use cases, here are two: - Use XDP to accelerate nest VM - XDP offload to host The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is significantly more interesting. As it allow us to skip large parts of the network stack and redirect from a physical device (ixgbe) into a guest device. Ran a benchmark: - 0.5 Mpps with normal code path into device with driver tun - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device Plus, there are indications that 3.7Mpps is not the real limit, as guest CPU doing XDP_DROP is 75% idle... thus this is a likely a scheduling + queue size issue. Yes, XDP_DROP can do more (but I forget the exact number). Btw testpmd (in guest) can give me about 3Mpps when doing forwarding (io mode). The main bottleneck in this case is vhost, XDP_REDIRECT can provides about 8Mpps to tun, but vhost can only receive about 4Mpps, and vhost tx can only have 3Mpps. Thanks - disable EWMA when XDP is set and reserve enough tailroom. Besides the described bug: Update(1): There is also a OOM leak in the
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote: > On Fri, 16 Feb 2018 09:19:02 -0800 > John Fastabendwrote: > >> On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote: >>> On Fri, 16 Feb 2018 13:31:37 +0800 >>> Jason Wang wrote: >>> On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: > The virtio_net code have three different RX code-paths in receive_buf(). > Two of these code paths can handle XDP, but one of them is broken for > at least XDP_REDIRECT. > > Function(1): receive_big() does not support XDP. > Function(2): receive_small() support XDP fully and uses build_skb(). > Function(3): receive_mergeable() broken XDP_REDIRECT uses > napi_alloc_skb(). > > The simple explanation is that receive_mergeable() is broken because > it uses napi_alloc_skb(), which violates XDP given XDP assumes packet > header+data in single page and enough tail room for skb_shared_info. > > The longer explaination is that receive_mergeable() tries to > work-around and satisfy these XDP requiresments e.g. by having a > function xdp_linearize_page() that allocates and memcpy RX buffers > around (in case packet is scattered across multiple rx buffers). This > does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because > we have not implemented bpf_xdp_adjust_tail yet). > > The XDP_REDIRECT action combined with cpumap is broken, and cause hard > to debug crashes. The main issue is that the RX packet does not have > the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing > skb_shared_info to overlap the next packets head-room (in which cpumap > stores info). > > Reproducing depend on the packet payload length and if RX-buffer size > happened to have tail-room for skb_shared_info or not. But to make > this even harder to troubleshoot, the RX-buffer size is runtime > dynamically change based on an Exponentially Weighted Moving Average > (EWMA) over the packet length, when refilling RX rings. > > This patch only disable XDP_REDIRECT support in receive_mergeable() > case, because it can cause a real crash. > > But IMHO we should NOT support XDP in receive_mergeable() at all, > because the principles behind XDP are to gain speed by (1) code > simplicity, (2) sacrificing memory and (3) where possible moving > runtime checks to setup time. These principles are clearly being > violated in receive_mergeable(), that e.g. runtime track average > buffer size to save memory consumption. I agree to disable it for -net now. >>> >>> Okay... I'll send an official patch later. >>> For net-next, we probably can do: - drop xdp_linearize_page() and do XDP through generic XDP helper after skb was built >>> >>> I disagree strongly here - it makes no sense. >>> >>> Why do you want to explicit fallback to Generic-XDP? >>> (... then all the performance gain is gone!) >>> And besides, a couple of function calls later, the generic XDP code >>> will/can get invoked anyhow... >>> >> >> Hi, Can we get EWMA to ensure for majority of cases we have the extra >> head room? Seems we could just over-estimate the size by N-bytes. In >> some cases we may under-estimate and then would need to fall back to >> generic-xdp or otherwise growing the buffer which of course would be >> painful and slow, but presumably would happen rarely. > > Hmmm... (first of all it is missing tail-room not head-room). > Second having all this extra size estimating code, and fallback options > leaves a very bad taste in my mouth... this sounds like a sure way to > kill performance. > > >> I think it would be much better to keep this feature vs kill it and >> make its configuration even more painful to get XDP working on virtio. > > Based on you request, I'm going to fixing as much as possible of the > XDP code path in driver virtio_net... I now have 4 fix patches... > Thanks a lot! > There is no way around disabling XDP_REDIRECT in receive_mergeable(), > as XDP does not have a way to detect/know the "data_hard_end" of the > data "frame". > >> Disabling EWMA also seems reasonable to me. > > To me, it seems more reasonable to have a separate RX function call > when an XDP program gets attached, and in that process change to the > memory model so it is compatible with XDP. > I would be OK with that but would be curious to see what Jason and Michael think. When I original wrote the XDP for virtio support the XDP infra was still primitive and we didn't have metadata, cpu maps, etc. yet. I suspect there might need to be some additional coordination between guest and host though to switch the packet modes. If I recall this was where some of the original trouble came from. > >>> Take a step back: >>> What is the reason/use-case for implementing XDP inside virtio_net? >>> >>> From a
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On Fri, 16 Feb 2018 09:19:02 -0800 John Fastabendwrote: > On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote: > > On Fri, 16 Feb 2018 13:31:37 +0800 > > Jason Wang wrote: > > > >> On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: > >>> The virtio_net code have three different RX code-paths in receive_buf(). > >>> Two of these code paths can handle XDP, but one of them is broken for > >>> at least XDP_REDIRECT. > >>> > >>> Function(1): receive_big() does not support XDP. > >>> Function(2): receive_small() support XDP fully and uses build_skb(). > >>> Function(3): receive_mergeable() broken XDP_REDIRECT uses > >>> napi_alloc_skb(). > >>> > >>> The simple explanation is that receive_mergeable() is broken because > >>> it uses napi_alloc_skb(), which violates XDP given XDP assumes packet > >>> header+data in single page and enough tail room for skb_shared_info. > >>> > >>> The longer explaination is that receive_mergeable() tries to > >>> work-around and satisfy these XDP requiresments e.g. by having a > >>> function xdp_linearize_page() that allocates and memcpy RX buffers > >>> around (in case packet is scattered across multiple rx buffers). This > >>> does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because > >>> we have not implemented bpf_xdp_adjust_tail yet). > >>> > >>> The XDP_REDIRECT action combined with cpumap is broken, and cause hard > >>> to debug crashes. The main issue is that the RX packet does not have > >>> the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing > >>> skb_shared_info to overlap the next packets head-room (in which cpumap > >>> stores info). > >>> > >>> Reproducing depend on the packet payload length and if RX-buffer size > >>> happened to have tail-room for skb_shared_info or not. But to make > >>> this even harder to troubleshoot, the RX-buffer size is runtime > >>> dynamically change based on an Exponentially Weighted Moving Average > >>> (EWMA) over the packet length, when refilling RX rings. > >>> > >>> This patch only disable XDP_REDIRECT support in receive_mergeable() > >>> case, because it can cause a real crash. > >>> > >>> But IMHO we should NOT support XDP in receive_mergeable() at all, > >>> because the principles behind XDP are to gain speed by (1) code > >>> simplicity, (2) sacrificing memory and (3) where possible moving > >>> runtime checks to setup time. These principles are clearly being > >>> violated in receive_mergeable(), that e.g. runtime track average > >>> buffer size to save memory consumption. > >> > >> I agree to disable it for -net now. > > > > Okay... I'll send an official patch later. > > > >> For net-next, we probably can do: > >> > >> - drop xdp_linearize_page() and do XDP through generic XDP helper > >> after skb was built > > > > I disagree strongly here - it makes no sense. > > > > Why do you want to explicit fallback to Generic-XDP? > > (... then all the performance gain is gone!) > > And besides, a couple of function calls later, the generic XDP code > > will/can get invoked anyhow... > > > > Hi, Can we get EWMA to ensure for majority of cases we have the extra > head room? Seems we could just over-estimate the size by N-bytes. In > some cases we may under-estimate and then would need to fall back to > generic-xdp or otherwise growing the buffer which of course would be > painful and slow, but presumably would happen rarely. Hmmm... (first of all it is missing tail-room not head-room). Second having all this extra size estimating code, and fallback options leaves a very bad taste in my mouth... this sounds like a sure way to kill performance. > I think it would be much better to keep this feature vs kill it and > make its configuration even more painful to get XDP working on virtio. Based on you request, I'm going to fixing as much as possible of the XDP code path in driver virtio_net... I now have 4 fix patches... There is no way around disabling XDP_REDIRECT in receive_mergeable(), as XDP does not have a way to detect/know the "data_hard_end" of the data "frame". > Disabling EWMA also seems reasonable to me. To me, it seems more reasonable to have a separate RX function call when an XDP program gets attached, and in that process change to the memory model so it is compatible with XDP. > > Take a step back: > > What is the reason/use-case for implementing XDP inside virtio_net? > > > > From a DDoS/performance perspective XDP in virtio_net happens on the > > "wrong-side" as it is activated _inside_ the guest OS, which is too > > late for a DDoS filter, as the guest kick/switch overhead have already > > occurred. > > > > The hypervisor may not "know" how to detect DDOS if its specific to > the VMs domain. In these cases we aren't measuring pps but are looking > at cycles/packet. In this case dropping cycles/packet frees up CPU > for useful work. Here I expect we can see real CPU % drop in VM by > using XDP. Well, my
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On Fri, 16 Feb 2018 13:31:37 +0800 Jason Wangwrote: > > Besides the described bug: > > > > Update(1): There is also a OOM leak in the XDP_REDIRECT code, which > > receive_small() is likely also affected by. Found the issue behind this memory leak... page refcnt issues when hitting xmit error paths. > > Update(2): Also observed a guest crash when redirecting out an > > another virtio_net device, when device is down. Also found reason behind this. It is not related to a "down" device. It caused by redirect into a virtio_net device that does not have setup sufficient XDP TX queues (which it assumes is one per CPU). I have a preliminary fix patch. > Will have a look at these issues. (Holiday in china now, so will do it > after). No worry. I can take care of this... I'll cleanup my patches and test them Monday, should have patches ready Tuesday... as I want to make sure they work in all the different error cases. I'll be working on improvements for the XDP_REDIRECT code paths anyway next week... in preparations for supporting the AF_XDP use case. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote: > On Fri, 16 Feb 2018 13:31:37 +0800 > Jason Wangwrote: > >> On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: >>> The virtio_net code have three different RX code-paths in receive_buf(). >>> Two of these code paths can handle XDP, but one of them is broken for >>> at least XDP_REDIRECT. >>> >>> Function(1): receive_big() does not support XDP. >>> Function(2): receive_small() support XDP fully and uses build_skb(). >>> Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). >>> >>> The simple explanation is that receive_mergeable() is broken because >>> it uses napi_alloc_skb(), which violates XDP given XDP assumes packet >>> header+data in single page and enough tail room for skb_shared_info. >>> >>> The longer explaination is that receive_mergeable() tries to >>> work-around and satisfy these XDP requiresments e.g. by having a >>> function xdp_linearize_page() that allocates and memcpy RX buffers >>> around (in case packet is scattered across multiple rx buffers). This >>> does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because >>> we have not implemented bpf_xdp_adjust_tail yet). >>> >>> The XDP_REDIRECT action combined with cpumap is broken, and cause hard >>> to debug crashes. The main issue is that the RX packet does not have >>> the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing >>> skb_shared_info to overlap the next packets head-room (in which cpumap >>> stores info). >>> >>> Reproducing depend on the packet payload length and if RX-buffer size >>> happened to have tail-room for skb_shared_info or not. But to make >>> this even harder to troubleshoot, the RX-buffer size is runtime >>> dynamically change based on an Exponentially Weighted Moving Average >>> (EWMA) over the packet length, when refilling RX rings. >>> >>> This patch only disable XDP_REDIRECT support in receive_mergeable() >>> case, because it can cause a real crash. >>> >>> But IMHO we should NOT support XDP in receive_mergeable() at all, >>> because the principles behind XDP are to gain speed by (1) code >>> simplicity, (2) sacrificing memory and (3) where possible moving >>> runtime checks to setup time. These principles are clearly being >>> violated in receive_mergeable(), that e.g. runtime track average >>> buffer size to save memory consumption. >> >> I agree to disable it for -net now. > > Okay... I'll send an official patch later. > >> For net-next, we probably can do: >> >> - drop xdp_linearize_page() and do XDP through generic XDP helper >> after skb was built > > I disagree strongly here - it makes no sense. > > Why do you want to explicit fallback to Generic-XDP? > (... then all the performance gain is gone!) > And besides, a couple of function calls later, the generic XDP code > will/can get invoked anyhow... > Hi, Can we get EWMA to ensure for majority of cases we have the extra head room? Seems we could just over-estimate the size by N-bytes. In some cases we may under-estimate and then would need to fall back to generic-xdp or otherwise growing the buffer which of course would be painful and slow, but presumably would happen rarely. I think it would be much better to keep this feature vs kill it and make its configuration even more painful to get XDP working on virtio. Disabling EWMA also seems reasonable to me. > > Take a step back: > What is the reason/use-case for implementing XDP inside virtio_net? > > From a DDoS/performance perspective XDP in virtio_net happens on the > "wrong-side" as it is activated _inside_ the guest OS, which is too > late for a DDoS filter, as the guest kick/switch overhead have already > occurred. > The hypervisor may not "know" how to detect DDOS if its specific to the VMs domain. In these cases we aren't measuring pps but are looking at cycles/packet. In this case dropping cycles/packet frees up CPU for useful work. Here I expect we can see real CPU % drop in VM by using XDP. The other use case is once we have a fast path NIC to VM in kernel we can expect, from you numbers below, 3+Mpps. I seem to remember from Jason's netdevconf talk that he had some experimental numbers that were even better. The other case is the hypervisor is not Linux and is feeding packets even faster DPDK numbers, again from Jason's slides, seemed to show 11+Mpps. XDP makes a lot of sense here IMO. (those specific pps numbers I pulled from memory but the point is feeding many Mpps into a VM should be doable) The last thing is we may see hardware VFs emulating virtio at some point. Then XDP would be needed. With the newer virtio spec under development my impression is the hardware emulation piece is becoming more of a focus. But, someone actually working on that could probably provide a more informed comment. > I do use XDP_DROP inside the guest (driver virtio_net), but just to > perform what I can zoom-in benchmarking, for perf-record isolating the > early RX
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On Fri, 16 Feb 2018 13:31:37 +0800 Jason Wangwrote: > On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: > > The virtio_net code have three different RX code-paths in receive_buf(). > > Two of these code paths can handle XDP, but one of them is broken for > > at least XDP_REDIRECT. > > > > Function(1): receive_big() does not support XDP. > > Function(2): receive_small() support XDP fully and uses build_skb(). > > Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). > > > > The simple explanation is that receive_mergeable() is broken because > > it uses napi_alloc_skb(), which violates XDP given XDP assumes packet > > header+data in single page and enough tail room for skb_shared_info. > > > > The longer explaination is that receive_mergeable() tries to > > work-around and satisfy these XDP requiresments e.g. by having a > > function xdp_linearize_page() that allocates and memcpy RX buffers > > around (in case packet is scattered across multiple rx buffers). This > > does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because > > we have not implemented bpf_xdp_adjust_tail yet). > > > > The XDP_REDIRECT action combined with cpumap is broken, and cause hard > > to debug crashes. The main issue is that the RX packet does not have > > the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing > > skb_shared_info to overlap the next packets head-room (in which cpumap > > stores info). > > > > Reproducing depend on the packet payload length and if RX-buffer size > > happened to have tail-room for skb_shared_info or not. But to make > > this even harder to troubleshoot, the RX-buffer size is runtime > > dynamically change based on an Exponentially Weighted Moving Average > > (EWMA) over the packet length, when refilling RX rings. > > > > This patch only disable XDP_REDIRECT support in receive_mergeable() > > case, because it can cause a real crash. > > > > But IMHO we should NOT support XDP in receive_mergeable() at all, > > because the principles behind XDP are to gain speed by (1) code > > simplicity, (2) sacrificing memory and (3) where possible moving > > runtime checks to setup time. These principles are clearly being > > violated in receive_mergeable(), that e.g. runtime track average > > buffer size to save memory consumption. > > I agree to disable it for -net now. Okay... I'll send an official patch later. > For net-next, we probably can do: > > - drop xdp_linearize_page() and do XDP through generic XDP helper > after skb was built I disagree strongly here - it makes no sense. Why do you want to explicit fallback to Generic-XDP? (... then all the performance gain is gone!) And besides, a couple of function calls later, the generic XDP code will/can get invoked anyhow... Take a step back: What is the reason/use-case for implementing XDP inside virtio_net? >From a DDoS/performance perspective XDP in virtio_net happens on the "wrong-side" as it is activated _inside_ the guest OS, which is too late for a DDoS filter, as the guest kick/switch overhead have already occurred. I do use XDP_DROP inside the guest (driver virtio_net), but just to perform what I can zoom-in benchmarking, for perf-record isolating the early RX code path in the guest. (Using iptables "raw" table drop is almost as useful for that purpose). The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is significantly more interesting. As it allow us to skip large parts of the network stack and redirect from a physical device (ixgbe) into a guest device. Ran a benchmark: - 0.5 Mpps with normal code path into device with driver tun - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device Plus, there are indications that 3.7Mpps is not the real limit, as guest CPU doing XDP_DROP is 75% idle... thus this is a likely a scheduling + queue size issue. > - disable EWMA when XDP is set and reserve enough tailroom. > > > > > Besides the described bug: > > > > Update(1): There is also a OOM leak in the XDP_REDIRECT code, which > > receive_small() is likely also affected by. > > > > Update(2): Also observed a guest crash when redirecting out an > > another virtio_net device, when device is down. > > Will have a look at these issues. (Holiday in china now, so will do it > after). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote: The virtio_net code have three different RX code-paths in receive_buf(). Two of these code paths can handle XDP, but one of them is broken for at least XDP_REDIRECT. Function(1): receive_big() does not support XDP. Function(2): receive_small() support XDP fully and uses build_skb(). Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). The simple explanation is that receive_mergeable() is broken because it uses napi_alloc_skb(), which violates XDP given XDP assumes packet header+data in single page and enough tail room for skb_shared_info. The longer explaination is that receive_mergeable() tries to work-around and satisfy these XDP requiresments e.g. by having a function xdp_linearize_page() that allocates and memcpy RX buffers around (in case packet is scattered across multiple rx buffers). This does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because we have not implemented bpf_xdp_adjust_tail yet). The XDP_REDIRECT action combined with cpumap is broken, and cause hard to debug crashes. The main issue is that the RX packet does not have the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing skb_shared_info to overlap the next packets head-room (in which cpumap stores info). Reproducing depend on the packet payload length and if RX-buffer size happened to have tail-room for skb_shared_info or not. But to make this even harder to troubleshoot, the RX-buffer size is runtime dynamically change based on an Exponentially Weighted Moving Average (EWMA) over the packet length, when refilling RX rings. This patch only disable XDP_REDIRECT support in receive_mergeable() case, because it can cause a real crash. But IMHO we should NOT support XDP in receive_mergeable() at all, because the principles behind XDP are to gain speed by (1) code simplicity, (2) sacrificing memory and (3) where possible moving runtime checks to setup time. These principles are clearly being violated in receive_mergeable(), that e.g. runtime track average buffer size to save memory consumption. I agree to disable it for -net now. For net-next, we probably can do: - drop xdp_linearize_page() and do XDP through generic XDP helper after skb was built - disable EWMA when XDP is set and reserve enough tailroom. Besides the described bug: Update(1): There is also a OOM leak in the XDP_REDIRECT code, which receive_small() is likely also affected by. Update(2): Also observed a guest crash when redirecting out an another virtio_net device, when device is down. Will have a look at these issues. (Holiday in china now, so will do it after). Thanks Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT") Signed-off-by: Jesper Dangaard Brouer--- drivers/net/virtio_net.c |7 --- 1 file changed, 7 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 626c27352ae2..0ca91942a884 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -677,7 +677,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, struct bpf_prog *xdp_prog; unsigned int truesize; unsigned int headroom = mergeable_ctx_to_headroom(ctx); - int err; head_skb = NULL; @@ -754,12 +753,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, goto err_xdp; rcu_read_unlock(); goto xdp_xmit; - case XDP_REDIRECT: - err = xdp_do_redirect(dev, , xdp_prog); - if (!err) - *xdp_xmit = true; - rcu_read_unlock(); - goto xdp_xmit; default: bpf_warn_invalid_xdp_action(act); case XDP_ABORTED:
[RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case
The virtio_net code have three different RX code-paths in receive_buf(). Two of these code paths can handle XDP, but one of them is broken for at least XDP_REDIRECT. Function(1): receive_big() does not support XDP. Function(2): receive_small() support XDP fully and uses build_skb(). Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb(). The simple explanation is that receive_mergeable() is broken because it uses napi_alloc_skb(), which violates XDP given XDP assumes packet header+data in single page and enough tail room for skb_shared_info. The longer explaination is that receive_mergeable() tries to work-around and satisfy these XDP requiresments e.g. by having a function xdp_linearize_page() that allocates and memcpy RX buffers around (in case packet is scattered across multiple rx buffers). This does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because we have not implemented bpf_xdp_adjust_tail yet). The XDP_REDIRECT action combined with cpumap is broken, and cause hard to debug crashes. The main issue is that the RX packet does not have the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing skb_shared_info to overlap the next packets head-room (in which cpumap stores info). Reproducing depend on the packet payload length and if RX-buffer size happened to have tail-room for skb_shared_info or not. But to make this even harder to troubleshoot, the RX-buffer size is runtime dynamically change based on an Exponentially Weighted Moving Average (EWMA) over the packet length, when refilling RX rings. This patch only disable XDP_REDIRECT support in receive_mergeable() case, because it can cause a real crash. But IMHO we should NOT support XDP in receive_mergeable() at all, because the principles behind XDP are to gain speed by (1) code simplicity, (2) sacrificing memory and (3) where possible moving runtime checks to setup time. These principles are clearly being violated in receive_mergeable(), that e.g. runtime track average buffer size to save memory consumption. Besides the described bug: Update(1): There is also a OOM leak in the XDP_REDIRECT code, which receive_small() is likely also affected by. Update(2): Also observed a guest crash when redirecting out an another virtio_net device, when device is down. Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT") Signed-off-by: Jesper Dangaard Brouer--- drivers/net/virtio_net.c |7 --- 1 file changed, 7 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 626c27352ae2..0ca91942a884 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -677,7 +677,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, struct bpf_prog *xdp_prog; unsigned int truesize; unsigned int headroom = mergeable_ctx_to_headroom(ctx); - int err; head_skb = NULL; @@ -754,12 +753,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, goto err_xdp; rcu_read_unlock(); goto xdp_xmit; - case XDP_REDIRECT: - err = xdp_do_redirect(dev, , xdp_prog); - if (!err) - *xdp_xmit = true; - rcu_read_unlock(); - goto xdp_xmit; default: bpf_warn_invalid_xdp_action(act); case XDP_ABORTED: